Six separate parallel algorithms are implemented. They are the setup phase,
the precomputation phase, the matrix fill phase, the RHS vector (or
vectors) fill phase, matrix factor/solve
phase, and far-field phase. By reviewing Chapter 4 on parallel implementation,
we know that the setup and precomputation employ the same parallel
mechanism in the global broadcast operation. Therefore, only one of
three algorithms will be to be analyzed. In the far-field algorithm, the far-field
computation is carried out in parallel when there are multiple current
vectors. The number of the current vectors is the same as that of the
excitation vectors, and this number is usually limited to about thousand. These
algorithms described in Chapter 4 are scalable. There is the same amount
of work for each node when the
number of the excitation vectors increases by the factor and the number of
the nodes in the system increases by the factor
. The RHS vector fill
algorithm is very similar to the far field algorithm. It is scalable like
the far field algorithm.
We only consider to conduct a scalability analysis for the precomputation
phase, the
matrix fill phase, and the matrix factor/solve phase. The scalabilities of
these components are analyzed separately below.
Precomputation Phase Scalability
In the precomputation phase, several arrays are computed in a distributed
fashion. These arrays contain the locations of the numerical integration points,
as well as the values of the basis functions and their divergences
at those points. These arrays are of order , where
is the number of surface patches, which is proportional to the number of
basis functions or matrix size (the relation between
and the number of basis funcitons is given in
Chapter 4). The work of computing these arrays is distributed almost evenly
among the nodes.
One may ask what is the broadcasting complexity here. We will give a short analysis to ensure the broadcast operation used here will not be a bottleneck of our approach.
Assume that there are elements in the basis function array,
elements
for the divergence of the basis function array, and
elements for the
position
vector array.
The total time T of this procedure is
where is the maximum computing time needed for a node,
and
is the total broadcasting
time.
can be expressed as
where and
are the times required to compute an
element of the basis function array, the divergence of the basis function array,
and the position vector array, respectively.
The total broadcasting time can be expressed as
where is the latency for broadcasting a message and
is the
time required to transfer a four-byte word from one node to another.
In the CM-5, global operations generally are extremely fast. A broadcast of a single word to all nodes take about 8 microseconds. The broadcast operations take longer for longer messages, reaching an aggregate bandwidth of 25 MB/sec on 32 nodes (see [74]).
The only potential liability in the precomputation is seen to be the
term. However, since the practical upper limit on
for the class of machines of interest is a few thousand, this term will never
be drastically larger than the worst case we have run to date
.
For reasonable problem sizes
the setup and
precomputation time with
was seen to be very small compared to the
matrix fill and factor times. The ratio of fill to setup and precomputation
remains approximately equal if
is increased by a factor of
and
is increased by a factor of
. Since
is expected to
increase as
increases, it is clear that the setup and precomputation
will not become dominant under any practical circumstance.
Matrix Fill Scalability
No internode communication is performed during the matrix fill phase.
However, the fill time is somewhat dependent on due to the possibility
of redundant computations in the parallel algorithm. As described in the
previous chapter, each node computes the necessary patch-to-patch geometric
interactions. There is some redundancy in this scheme in that more than one
node may compute the same interactions when the nine basis functions
defined for the source and field patches are not assigned to the same
node.
In the slab partitioned algorithm, the amount of redundancy increases as
the number of nodes increase up to a statistical limit; the probability
that a patch-to-patch interaction will be computed more than twice is very
small. The extreme case where the three basis functions on a source patch
are assigned to three different nodes has extremely small probability.
Therefore the upper bound on the worst
case is obtained when each patch-to-patch interaction is computed twice. These
computations typically represent less than 25%of the total fill time (this
percentage is system dependent) for the sequential algorithm, and so
the total work in the parallel fill is typically less than 1.25 times the total work
in the sequential algorithm. In other words the lower bound on the fill
speedup relative to the sequential algorithm is .
Matrix Factor and Solve Scalability
Our approach in the matrix factor and solve phase is to utilize existing parallel LU matrix solvers. It is reasonable to assume that state-of-the-art solvers provided by a vendor or a respected sources such as the Oak Ridge National Laboratory have been designed to achieve scalable performance, and that the scalability characteristics are well documented [72]. Although it is impossible to characterize all parallel LU solvers in this report, we briefly describe the scalability of the ScaLAPACK solver as an example of the state-of-the-art.
ScaLAPACK uses a block-scattered decomposition of matrix blocks of size distributed among a ``virtual"
array of processors
(the total number of processors is
). If
is smaller than
there may be multiple blocks per node.
The efficiency is defined in the previous Chapter as
, where
and
are the solution
times for one node and
nodes respectively.
may be thought of as the
percentage of optimal speed-up obtained. For ScaLAPACK,
if given by the
following formula [72].
where
Consider the special case ,
. This is the configuration used
in the ScaLAPACK code. For sufficiently large
,
In this case, we see that is close to one provided that
. For
smaller
, however,
and
can be significant. On many
multipcomputers,
is several orders of magnitude
greater than
. On
a network with limited bandwidth,
is large and the term containing
may dominate.