Definition --- Relative Speedup Given a single problem with a sequential algorithm running on one processor and a concurrent algorithm running on p independent processors, relative speedup is defined as
where is the time to run the sequential algorithm as a
single process and
is the time to run the concurrent algorithm
on p processors.
Graphs of relative speedup calculated from empirical performance data are provided in figures 48 through 50 for the three parallel direct solver implementations. Figures 48 and 49 each have two families of speedup curves that show speedup for the five power systems networks examined in this research with separate families of curves for both factorization and the triangular solution. Figure 50 has three families of curves that show speedup for the five power systems networks for factorization, forward reduction, and backward substitution. Each curve plots relative speedup for 2, 4, 8, 16, and 32 processors,
Figure 48: Relative Speedup --- Complex Variate LU Factorization
Figure 49: Relative Speedup --- Double Precision LU Factorization
Figure 50: Relative Speedup --- Choleski Factorization
Figure 48 illustrates that parallel performance of the complex LU factorization algorithm can be as much as eighteen on 32 processors for the BCSPWR10 power systems network. Meanwhile, factorization performance for the other data sets range from eleven to nearly eight. The BCSPWR10 matrix has the most calculations, and careful examination of figures 43 through 47 show that the empirical timing data for the BCSPWR10 matrix requires a greater relative increase in time from double precision LU factorization to complex LU factorization for a single processor than other power systems networks. A significant increase in the time to factor the matrix on a single processor will cause significant increases in speedup. We believe that the unusually good performance of the parallel solver for the BCSPWR10 data is a result of caching effects --- when the program is run on one or two processors, there is too much data on each processor to fit into the fast memory cache. When more processors are used, the entire memory can fit concurrently into the fast cache, and the program runs considerably faster with all cache hits. Meanwhile, complex triangular solutions provide speedups ranging between a high of eight and a low of four.
Figure 49 illustrates that parallel performance of the double precision LU factorization algorithm can be nearly ten on 32 processors for the BCSPWR10 power systems network. Factorization performance falls between eight and seven for the other four networks. Likewise, double precision triangular solutions provide speedups ranging from slightly greater than four to a low of three.
Parallel Choleski factorization yields speedups that are less than similar LU algorithms. Empirical data for relative speedup varies between five and four for 32 processors, as illustrated in figure 50. This figure also presents empirical speedup data for forward reduction and backward substitution. Due to the significant differences in the implementation of these triangular solution algorithms, empirical data are presented for both.
Backward substitution is the simplest algorithm with the lowest
communications overhead --- limited to only the broadcast of recently
calculated values in when performing the triangular backward
substitution on
. Empirical relative speedup ranges from
a high of 3.5 to 2.5. These speedups are only slightly less than
speedups for backward substitution associated with LU factorization.
Meanwhile, essentially no speedup has been measured for the forward
reduction algorithm, due primarily to additional communications
overhead for this implementation than either LU forward reduction or
Choleski backward substitution. Additional communications is required
to update the last diagonal using data in the borders, and there are
additional communications for reducing the last diagonal block. These
additional communications occur because the data distribution forces
interprocessor communications of partial updates when calculating
values in
, rather than broadcasting values in
as in
the LU-based forward reduction. Interprocessor communications
increase from
to
, or from being
proportional to the number of rows/columns in the last diagonal block
to being proportional to the number of non-zeros in the last diagonal
block. After minimum degree ordering of the last diagonal block,
. Due to the characteristics of Choleski
factorization, it is inevitable that either forward reduction or
backward substitution would have to deal with the problem of increased
interprocessor communications [20].
The sensitivities of these parallel algorithms to communications
overhead is clearly apparent when comparing the relative speedups
presented in figures 48, 49,
and 50. For the three solver implementations, there
are increasing amounts of floating point calculations in double
precision Choleski factorization, double precision LU factorization,
and complex LU factorization, with a relative workload of
approximately 6:2:1; however, there are nearly equal amounts of
communications. Communications in block-diagonal-bordered Choleski or
LU factorization occurs in two locations --- updating the last
diagonal block using data in the borders and factoring the last
diagonal block. There are twice as many calculations and twice as
many values to distribute to processors holding data in the last
diagonal block when updating the last diagonal block for LU
factorization versus Choleski factorization, because LU factorization
requires the update of versus only
. There
are equal amounts of communications for LU and Choleski factorization
when factoring the last diagonal block.
When factoring the last diagonal block, the Choleski algorithm
requires that data in be broadcast to all processors in the
pipelined algorithm that perform the rank 1 update of the submatrix,
However, the parallel algorithm for the last diagonal block requires
only that
be broadcast during the parallel rank 1 update.
Communications overhead is nearly constant and the 6:2:1 relative
workload of floating point operations result in relative speedups of
4:2:1 for the BCSPWR10 power systems network. Consequently, if
speedups of 18 were required for a Choleski factorizations algorithm
embedded in a real-time application, one way to reach those design
goals is to improve the processor/communications performance by a
factor of six to cause proportional reductions in the communications
overhead. Another way that algorithm speedup could be achieved is by
increasing the performance of the floating point capability of the
processor, although, the ratio of communications-to-calculations
performance ratio must stay equal to that in the CM-5 for these test
runs.