Next: Numerical Results Up: Performance and Scalability Previous: Performance Measurement for

Scalability Analysis

Six separate parallel algorithms are implemented. They are the setup phase, the precomputation phase, the matrix fill phase, the RHS vector (or vectors) fill phase, matrix factor/solve phase, and far-field phase. By reviewing Chapter 4 on parallel implementation, we know that the setup and precomputation employ the same parallel mechanism in the global broadcast operation. Therefore, only one of three algorithms will be to be analyzed. In the far-field algorithm, the far-field computation is carried out in parallel when there are multiple current vectors. The number of the current vectors is the same as that of the excitation vectors, and this number is usually limited to about thousand. These algorithms described in Chapter 4 are scalable. There is the same amount of work for each node when the number of the excitation vectors increases by the factor and the number of the nodes in the system increases by the factor . The RHS vector fill algorithm is very similar to the far field algorithm. It is scalable like the far field algorithm. We only consider to conduct a scalability analysis for the precomputation phase, the matrix fill phase, and the matrix factor/solve phase. The scalabilities of these components are analyzed separately below.
Precomputation Phase Scalability

In the precomputation phase, several arrays are computed in a distributed fashion. These arrays contain the locations of the numerical integration points, as well as the values of the basis functions and their divergences at those points. These arrays are of order , where is the number of surface patches, which is proportional to the number of basis functions or matrix size (the relation between and the number of basis funcitons is given in Chapter 4). The work of computing these arrays is distributed almost evenly among the nodes.

One may ask what is the broadcasting complexity here. We will give a short analysis to ensure the broadcast operation used here will not be a bottleneck of our approach.

Assume that there are elements in the basis function array, elements for the divergence of the basis function array, and elements for the position vector array. The total time T of this procedure is

where is the maximum computing time needed for a node, and is the total broadcasting time. can be expressed as

where and are the times required to compute an element of the basis function array, the divergence of the basis function array, and the position vector array, respectively. The total broadcasting time can be expressed as

where is the latency for broadcasting a message and is the time required to transfer a four-byte word from one node to another.

In the CM-5, global operations generally are extremely fast. A broadcast of a single word to all nodes take about 8 microseconds. The broadcast operations take longer for longer messages, reaching an aggregate bandwidth of 25 MB/sec on 32 nodes (see [74]).

The only potential liability in the precomputation is seen to be the term. However, since the practical upper limit on for the class of machines of interest is a few thousand, this term will never be drastically larger than the worst case we have run to date . For reasonable problem sizes the setup and precomputation time with was seen to be very small compared to the matrix fill and factor times. The ratio of fill to setup and precomputation remains approximately equal if is increased by a factor of and is increased by a factor of . Since is expected to increase as increases, it is clear that the setup and precomputation will not become dominant under any practical circumstance.
Matrix Fill Scalability

No internode communication is performed during the matrix fill phase. However, the fill time is somewhat dependent on due to the possibility of redundant computations in the parallel algorithm. As described in the previous chapter, each node computes the necessary patch-to-patch geometric interactions. There is some redundancy in this scheme in that more than one node may compute the same interactions when the nine basis functions defined for the source and field patches are not assigned to the same node.

In the slab partitioned algorithm, the amount of redundancy increases as the number of nodes increase up to a statistical limit; the probability that a patch-to-patch interaction will be computed more than twice is very small. The extreme case where the three basis functions on a source patch are assigned to three different nodes has extremely small probability. Therefore the upper bound on the worst case is obtained when each patch-to-patch interaction is computed twice. These computations typically represent less than 25%of the total fill time (this percentage is system dependent) for the sequential algorithm, and so the total work in the parallel fill is typically less than 1.25 times the total work in the sequential algorithm. In other words the lower bound on the fill speedup relative to the sequential algorithm is .

Matrix Factor and Solve Scalability

Our approach in the matrix factor and solve phase is to utilize existing parallel LU matrix solvers. It is reasonable to assume that state-of-the-art solvers provided by a vendor or a respected sources such as the Oak Ridge National Laboratory have been designed to achieve scalable performance, and that the scalability characteristics are well documented [72]. Although it is impossible to characterize all parallel LU solvers in this report, we briefly describe the scalability of the ScaLAPACK solver as an example of the state-of-the-art.

ScaLAPACK uses a block-scattered decomposition of matrix blocks of size distributed among a ``virtual" array of processors (the total number of processors is ). If is smaller than there may be multiple blocks per node.

The efficiency is defined in the previous Chapter as , where and are the solution times for one node and nodes respectively. may be thought of as the percentage of optimal speed-up obtained. For ScaLAPACK, if given by the following formula [72].

where

Consider the special case , . This is the configuration used in the ScaLAPACK code. For sufficiently large ,

In this case, we see that is close to one provided that . For smaller , however, and can be significant. On many multipcomputers, is several orders of magnitude greater than . On a network with limited bandwidth, is large and the term containing may dominate.



Next: Numerical Results Up: Performance and Scalability Previous: Performance Measurement for


xshen@
Sat Dec 3 17:51:03 EST 1994