As stated earlier, there is the distinct possibility that dedicated DAE solvers could significantly speedup the solution of the equations that model the generators and power network in a transient stability analysis. The numerous techniques reported in the literature, illustrate a rich, untapped source of potential research that could be utilized to improve the performance of transient stability simulations on parallel architectures. While CDASSL, a concurrent implementation of DASSL has shown potential in a petrochemical engineering application [37], there have been no reports in the power systems literature of existing DAE solvers being applied to transient stability analysis. Research into a combination of algorithmic and parallel processing speedup based on these existing programs offers significant opportunities to improve the performance of transient stability software. Each of the aforementioned existing programs has some feature that offers potential performance improvement. These performance improvements will be highly correlated to the particular generator and control equations.
Inherent in any of the aforementioned techniques is the solution of simultaneous systems of sparse linear equations. There is extensive research to illustrate that it is feasible to obtain reasonable parallelism and speedup when solving general sparse matrices on state-of-the-art distributed-memory multiprocessors by using direct techniques [17,18,19,23,26,37,43,48], as well as by using iterative techniques [4,11,29,36]. Meanwhile, there appears to be significant structure in sparse matrices encountered in power system transient stability simulations; that structure can be exploited for additional parallelism in both the portions of the matrices that represent the generator equations as well as in the portion of the matrix that represents the power network. Matrix structure can be used to improve the parallel performance of either direct or iterative methods. For parallel direct methods, matrix structure can provide additional parallelism, can reduce fillin, can minimize communications if partial pivoting is required, and can assist in developing an efficient balance that evenly distributes processing requirements and reduces interprocessor communications to a minimum while permitting the remaining communications to occur in a regular pattern. For parallel iterative methods, block-bordered diagonal matrix structure can speedup preconditioning techniques and the search for an efficient load balance with regular communications.
The transient stability DAEs are defined by a system of simultaneous equations that have special form --- blocks of generator equations along the diagonals in addition to the algebraic equations representing the network admittance matrix (figure 1). After reordering the matrix, other variables are located in relatively narrow borders of the matrix [15,41]. The generator equations naturally fall into distinct blocks on the diagonal, and the admittance matrix equations can also be placed in block-bordered diagonal form because power networks are hierarchical by nature. Matrices must be reordered for maximum parallelism, which includes considerations to minimize fillin in the sparse matrices --- in order to minimize the amount of computations --- while requiring optimum load balancing for the numerous parallel processors.
Significant parallelism is possible in the direct solution of the linear equations encountered in the transient stability DAEs, because of their bordered-block diagonal form. Portions of the matrix can be reordered into a hierarchical block-bordered diagonal form and further benefit from the reduced fillin and reduced interprocessor communications for parallel implementations. For direct solutions of block-bordered diagonal matrices, significant parallelism is available in each of the three stages:
There are other research areas available in the parallel direct solution of systems of linear equations. After a matrix is reordered, there still is research possible to determine the most efficient mapping of data to particular processors and to determine whether fan-in or fan-out algorithms for the direct solution of systems of linear equations would be the most efficient for the parallel architecture of interest. Meanwhile, the balance of both calculations and communications is of continual concern.
In addition to the numerous research opportunities available with direct linear equation solvers, iterative methods exist for the solution of the linear equations. Implementation characteristics and numerical properties of iterative systems differ significantly from direct techniques to solve systems of linear equations. Conjugate gradient methods and waveform relaxation are potential choices for iterative techniques. Iterative methods do not generate fillin and are not limited by synchronization in parallel implementations due to precedence in the calculations, however, there is no upper bound on the number of iterations required to converge to a desired accuracy. Convergence is primarily determined by the numerical behavior of the particular problem and the initial values --- a set of guesses to start the calculations. Preconditioning techniques are often required to redefine ill-behaved matrices into more tractable ones that require significantly less iterations; previous solutions may make good initial guesses because of the small time-steps required to solve the stiff differential equations. Nevertheless, the stiff equations often have a rapid transient phase that could minimize the effectiveness of previous solutions as initial guesses in iterative techniques.
At present, we believe that direct methods offer the best possible choice for the solution of the non-linear equations inherent in the DAEs. Direct methods are robust and have well-defined processing times, which is important in real-time or faster-than-real-time software. Due to the fundamental differences in parallel implementations and the numerical properties of direct and iterative techniques to solve systems of linear equations, there are numerous research opportunities available to determine the most applicable technique to solve the systems of linear equations for an implementation of the power system transient stability analysis software on a particular parallel architecture. Also, there exists the possibility of developing hybrid techniques that utilize direct techniques to solve portions of the matrix, then use iterative techniques to update the solutions. This technique could be especially effective for symmetric portions of the matrix that could utilize parallel Choleski factorization to reduce the number of calculations.
Another research area is to combine load balancing with variable time step techniques when generators are partitioned into near and far groups that reside inside the study area where significant disturbances will be encountered and outside the study area where generators will encounter only minor disturbances. There has been some discussion of dynamic partitioning algorithms for parallel transient stability analysis in [47], where this technique has been implemented on a shared-memory multiprocessor, without detailed discussion of the required load balancing algorithm required to consider the potentially unbalanced communications that would be encountered on distributed-memory multiprocessors.
Real-time or faster-than-real-time power system transient stability simulations will require significant research that may require optimizing algorithms using some or all of the aforementioned techniques to produce a scalable, architecture-dependent implementation. Scalable implementations permit an increase in speedup, and a corresponding decrease in wall-clock time, as the parallel algorithms are run on similar parallel computer architectures with greater numbers of processors. Meanwhile, power system transient stability simulations may be dependent on architecture particulars --- for example, distributed-memory multiprocessors with efficient broadcast communications may benefit from sparse matrix solvers that use fan-out techniques, while other distributed-memory multiprocessors may benefit more from fan-in sparse matrix techniques.
Some state-of-the-art distributed-memory multiprocessors can utilize Active Messages [45] that:
There is also research potential in examining the benefits of using standards-based, reusable parallel-processing software such as Toolbox [37,38]. Toolbox contains both numerical analysis software and high-level communications software that abstract away architecture dependencies. Numerical analysis algorithms are often a significant portion of applications software, and Toolbox provides a library of optimized versions of numerical analysis software for rapid development of efficient concurrent applications. CDASSL is an example of concurrent numerical analysis software presently available in Toolbox. CDASSL uses direct methods for solving the sparse matrices, although, it has not been optimized for the hierarchical network structures inherent in power system problems. In addition, CDASSL may have difficulties with the large discontinuities in power system problems, so other DAE solvers must be researched for this application. Nevertheless, the communications software in Toolbox can simplify implementation of new numerical analysis software, that in turn can be added to future releases of Toolbox.
The communications package in Toolbox is called Zipcode, and it has high-level commands for communications operations that are optimized for particular architectures. All Toolbox user-application software is written using the Zipcode communications routines that have similar, but optimally implemented, communications commands to simplify the migration of parallel algorithms and concurrent computer code between dissimilar architectures. Toolbox is presently implemented on the Thinking Machines Corporation CM-5, BBN TC2000, and Intel hypercubes. It will soon be modified to run on the nCube 2.
There is academic interest at NPAC in performing computational science research on the power system transient stability problem because of its stature as a computational grand challenge. Thus, there is a need to have a parallel computing testbed at NPAC that can be used to compare the performance of parallel transient stability algorithm implementations and determine whether or not the algorithms maintain sufficient accuracy to give comparable answers to known test systems [42]. We are not interested in developing new transient stability models, but in developing faster algorithms using a broad range of computational science techniques and faster parallel implementations using algorithms that obtain the most from parallel architectures.
We are currently developing a Power System Transient Stability Testbed at NPAC, based on available software such as the EPRI-Extended Transient/Mid-Term Stability Program (ETMSP) or the PTI-Power System Simulation/E Program (PSS/E). The first target architectures for this testbed are the 32 node Thinking Machines Corporation CM5 and the 32 node nCube 2 presently available at NPAC. While the generator equations from these software packages are of primary interest to the NPAC testbed, the implementation of software that interprets the results will permit comparisons of the efficacy of new algorithms with benchmark results available in [42].