Another way to measure performance of a parallel algorithm is in terms of speed-up which is defined as the increase in performance (units of work per unit time) with increase in the number of nodes for a fixed problem size. The definition of speed-up is given in Chapter 4. The hallmark of good scalability is nearly linear speed-up on sufficiently large problems. For a large problem, we would like to halve the total solution time when we double the machine size. It should be noted that many parallel algorithms do not have good scalability due to the increasing dominance of internode communication as the number of nodes increases. Speedup is computed as
where is the time needed on one node and
is the time on
nodes.
It must be noted that it is not possible for speed-up to be linear for endlessly increasing machine size and fixed problem size, since at some point the number of nodes would exceed the total number of operations required to solve the problem! However, for large problems this is rarely a concern, and so we strive to achieve linear speed-ups.
For large problem sizes it may not be possible to obtain the one-node
timing required to compute speed-up due to memory or computation time
limitations. In these cases, is estimated based on
for the
smallest possible
. This estimate is typically very accurate since the
speed-up tends to vary linearly for small
on large problems.
Speedups for matrix fill on the CM-5 for several problem sizes are shown in
Figure 5.1 and Figure 5.2. The small problems are
nearly linear for small numbers of processors ( is small), as shown in
Figure 5.1, but for big
the use of additional nodes is
less effective than on the large problems, as shown in
Figure 5.2. Important point: the parallel algorithm reduces
to the sequential algorithm in the special case
, so the speed-up may
be viewed as being relative to the original sequential algorithm. Thus we
can fill the matrix 500 times as fast as we could with the original serial
code implemented on a SUN SPARC workstation.
The large problems are nearly linear for a large .
Figure 5.2 shows that the large problem sizes yield better
performance in terms of speed-up when a 512-node CM-5 partition is used. This
partition is one of the largest CM-5 partitions available for general use.
For small problems, the
speed-up curves are almost flat after the certain number of nodes.
The matrix factor (LU decomposition) speed-up for the CM-5 is shown in Figure 5.3. Again the speed-ups are much better for the largest problem size than for the smallest. The speed-up curve may have an negative slope when a small problem is executed on a large machine. When the CMSSL LU with partial pivoting is used, the performance cost of pivoting is very much dependent on size and layout. The extra cost of pivoting is greatest for relatively small matrices. For very large matrices (using nearly all processing element memory), the performance of the factorization with pivoting is comparable to the performance without pivoting, whereas the solver remains about 50%slower for the pivoting version. However, for a small matrix, internode communication is actually dominating the total run time, which increases the cost of pivoting.
The speed-up curves obtained on the Intel machines are shown in Figure 5.4, Figure 5.5, Figure 5.6, and Figure 5.7. Figure 5.4 and Figure 5.5 show the speed-up curves on the Intel Paragon at NAS. Figure 5.6 and Figure 5.7 show the speed-up curves obtained on the Intel Touchstone Delta system at JPL/Caltech.
The BLACS on these test runs is from an unoptimized pre-release of this package on SP-1's high performance switch with EUI-H protocol by ORNL's Clint Whaley. An optimized official release of the BLACS on EUI-H will be available in the near future. The speed-up curves obtained on the IBM SP-1 are shown in Figure 5.8,/a> and Figure 5.9.