At present, techniques available to the computational scientist to either improve the performance of an application or to make the implementation of a grand computing challenge problem even feasible fall into two categories:
Supercomputer technologies based on scalar or vector processing architectures are rapidly approaching the physical limitations of circuit size and logic switching speeds; consequently, concurrent or parallel processing has become the focus for addressing computational grand challenges [39]. Without a doubt, concurrent processing architectures will be the basis for the forthcoming teraflop ( trillion floating-point operations per second) supercomputers. Detailed, accurate, faster-than-real-time transient stability analysis for state or regional electrical power governing authorities will require these impressive computing capabilities to be effectively harnessed using efficient algorithms.
Basic definitions of speedup for sequential and concurrent algorithms are required in order that preliminary estimates of algorithm performance can be developed without extensive modeling of algorithms, computer architectures, and particular power systems.
Definition 1 (Speedup) Given a single problem with two
sequential algorithms that exhibit execution times of and
with
, speedup is defined as
This simple, intuitive definition will be expanded in order to compare the performance of sequential and concurrent algorithms.
Definition 2 (Relative Speedup) Given a single problem with a sequential algorithm running on one processor and a concurrent algorithm running on p independent processors, relative speedup is defined as
where is the time to run the sequential algorithm as a
single process and
is the time to run the concurrent algorithm
on p processors.
Definition 3 (Fair Speedup) Given a single problem with a sequential algorithm and concurrent algorithm running on p independent processors, fair speedup is defined as
where is the time to run the most efficient sequential
algorithm as a single process and
is the time to run the
concurrent algorithm on p processors.
Definition 4 (Amdahl's Law) Given , the time to
solve a problem on a single processor, then
can be
parameterized in
by
where is the inherently sequential fraction of
computations. The aforementioned estimate of
can be used when
estimating the relative speedup
[37] by
Amdahl's Law can be used to estimate the maximum potential relative speedup by taking the inverse of the sequential portion of the parallel problem. According to Amdahl's law, a task with 1% sequential operations could obtain no more than a speedup of 100, regardless of the number of processors applied to the problem.
Definition 5 (Overhead-Based Estimates of Speedup) Amdahl's Law gives one preliminary estimate of the potential speedup in a concurrent algorithm, however, for some concurrent algorithms, overhead associated with the concurrent algorithm appears more critical than the inherent percentage of sequential operations in a concurrent algorithm. In these instances, the time for a parallel algorithm can be defined as
Consequently, an estimate of fair speedup can be obtained by
In this formula is the total amount of overhead from
numerous sources [16] , that include:
It is not always easy to predict the amount of overhead in a parallel
algorithm without extensive, detailed simulation. This measure of
concurrent algorithm performance, along with Amdahl's law, provide
preliminary estimates of potential performance improvement when
comparing algorithms with differing communications properties or when
comparing substantially different concurrent and sequential
algorithms.
The two techniques available to the computational scientist to address this problem can thus be restated as:
Meanwhile, algorithmic speedup research has been frequently reported in the IEEE Transaction on Power Systems, with only a single recent paper addressing the combination of algorithmic speedup in conjunction with parallelism [47]. Some parallel processing papers have addressed the synergism of algorithms with the computer architecture, however, the focus has been generally limited to techniques that permit parallelism by modifying the precedence when solving the DAEs at the normally sequential time-steps [2,10,33,34,35]. This parallel time-domain technique [2] technique has been widely used because it may have more calculations that can be performed in parallel than would a concurrent version of a sequential transient stability simulation and this parallel time-domain technique has permitted implementations that avoid addressing parallel implementations of such algorithms as the direct solution of linear systems of equations [10]. Nevertheless, the parallel time-domain technique requires significant additional calculations to be performed to enhance parallelism.
The example in [2], that illustrates the
parallel time-domain technique, requires nearly 35% more
calculations than solving the DAEs for sequential time-steps. These
additional calculations are a result of reordering the lower
bi-diagonal matrix that represents the application of the trapezoidal
rule on the entire set of time-steps. Reordering the matrix modifies
the precedence between calculations and causes fillin that is not
present in the purely sequential algorithm. Such additional
calculations contribute to algorithmic overhead --- or
additional calculations that a parallel algorithm must perform that a
sequential program does not perform. Any overhead specific to a
concurrent implementation of an algorithm decreases the potential
efficiency of that algorithm. For example, when 33% more
calculations are required in a parallel algorithm, according to the
overhead based estimate of speedup, the maximum speedup will be less
than , where P is the number of processors. In
spite of having to perform significant additional calculations,
techniques like this are one method to utilize more parallel
processors at a time. Techniques that require additional calculations
are an inefficient use of resources if other techniques exist that can
be exploited for parallelism without the requirement for additional
calculations.
Additionally, the parallel time-domain technique is inherently limited to using a low order numerical integration technique, the trapezoidal rule, that may require shorter time steps and consequently more calculations due to accuracy limitations [20]. Because the power system transient stability problem primarily involves the solution of DAEs, additional discussion on DAE solvers will be included in the next section.
One of the interesting areas for algorithmic speedup in transient stability simulations is the partitioning of the electrical network under test into study and external areas that permit larger integration step sizes in external areas. Variable time-steps can be used because the generator equations are better behaved the further the electromagnetic distance from the initial fault. Such techniques are reported in [6,28,30,47], with a discussion of a parallel implementation on an Alliant FX/8 shared-memory multiprocessor being reported in [47]. That article concludes that an algorithmic speedup of approximately five is possible by simply limiting this source of unnecessary calculations. Significant improvements in both reducing the number of computer instructions and executing as many instructions in parallel on distributed-memory multiprocessors for the transient stability problem will require research into many areas not previously addressed in the state-of-the-art parallel power system transient stability analysis research described in the literature.