The Parallel ParaMoM Code is implemented on three parallel platforms (CM-5 , Intel, IBM SP-1). Each of the platforms is a coarse-grained MIMD (multiple instruction multiple data) with relatively powerful nodes. Each machine supports an explicit message-passing programming model. In each case, the message-passing library may be accessed from a Fortran program.
The explicit message-passing approach has a number of advantages for the MoM application. Performance of a message-passing program is less dependent on the interconnection topology of the parallel machine than it is in data-parallel and shared-memory approaches. There is more direct control over the message volume and timing. Therefore, message-passing programs are more likely to port efficiently from one parallel architecture to the next. In addition, the functions of the message-passing library are quite similar on all architectures; differences are mainly in syntax rather than in philosophy. In contrast, various data-parallel and SIMD machines rely on compiler technologies to varying extents to perform the parallelization work, and the parallel facilities may exist at different levels of abstraction. For the data-parallel processing, an array object refers to all the data elements of the array simultaneously from the software perspective; the separate operations on the array's elements are all performed simultaneously from the hardware perspective.
There are a number of methods available to perform the solution of dense linear system of equations. In specialized problems, iterative approaches can be quite efficient in terms of memory and computation time. However, iterative techniques suffer when used to treat many right-hand side vectors. In this study, it is of interest to solve systems with a large number of right-hand side vectors (i.e., scatterers illuminated by multiple incident waves) and iterative methods will not be considered. The Gaussian elimination method will be employed under these conditions, since the computational intensive factorization need only be performed once.
The computational expenses of the setup phase, precomputation phase, rhs
vector fill phase, and
scattered-field
computation
phase are of order N, where N is the number of unknowns (proportional to
surface area of target).
The matrix fill phase is of order , the matrix factorization is of order
, and the matrix
solution is of order
. Because our primary interest in applying parallel computers
lies in reducing the
total time required to solve large problems (i.e., large
), we
concentrate our energy on reducing
the
and
processes. In this effort we are developing
parallel algorithms for the matrix
fill, factorization, and solution processes, but also for
the setup, precomputation, rhs vector (or vectors) fill,
and scattered-field computation phases as well, since the whole
program must be run on the target system.
In addition to computational cost concerns, it is necessary to consider
memory requirements. On the class of computers we are
considering, the main memory (RAM) is distributed among the nodes.
The moment matrix is generally
much too large to fit in the memory of one node,
so it must be distributed equitably among the nodes. The matrix size
is of order , but all other arrays used in the ParaMoM code
are of order
or lower.
To parallelize the ParaMoM code, we first need to discuss what kind of algorithm design strategy to use. In designing an parallel algorithm for MIMD computers, several approaches are possible. Here, we give a few commonly used approaches. For one of them, each processor may execute exactly the same program, but with different input data. No communication is ever required. This is the trivial parallelization strategy, which is the easiest for programming. The drawback is that this approach requires the entire computation fits into a single processor's memory, which is often found to be impossible. The pipeline approach is to break the computation into a series of small tasks which have to finished sequentially for a large set of data. Each task is assigned to one or a few processors. After completing its computation, a processor passes its result on to processors which handle the next task. However, this approach is applicable to a restricted class of problems.
The problem partition approach has received the most attention in scientific computing applications. In this approach, each processor executes substantially the same program, but on a portion of the problem data. Processors are loosely coupled throughout the computation, exchanging information whenever necessary. It is very suitable for solving large problems, where all available memory is required. The implementation difficulties are how to partition the problem to have a good load balance.
The parallel algorithm implemented on the Intel Paragon and IBM SP-1 is an example of the partitioning approach. On the CM-5, the algorithm used is one that combines both the message passing paradigm and the data parallel paradigm. The matrix fill and field computation use the partitioning strategy implemented in the message passing paradigm. The dense linear system solver is the CM Fortran interface with the CMSSL library [67] which uses Gaussian Elimination and back substitution. The connection between these two paradigms is accomplished by a high speed massive storage device.
The implementation on Intel utilizes the NX message-passing library [68],
the standard PVM (Parallel
Virtual Machine) message-passing library [69] and
ScaLAPACK [70] for excellent
portability. This PVM implemented on Intel machines has been ported to the
IBM SP-1 with little effort.
ScaLAPACK, a publicly available software, was
developed at Oak Ridge National Laboratory.
ScaLAPACK is a distributed-memory version of the standard LAPACK linear
algebra library. It is built on the BLAS library, which is at the
heart of LAPACK
and its predecessor
LINPACK, in combination with a linear algebra
communication library (BLACS [71]). The current
portable version of the BLACS utilizes calls to the standard PVM (Parallel
Virtual Machine)
message-passing library and, therefore, can be easily ported to any machine
on which PVM runs
(for example Intel or Cray T3D). In addition, BLACS has also been implemented
using the Intel NX
message-passing library for optimum performance.
ScaLAPACK performance [72] is
scalable over a
wide range of problem sizes and machine sizes.
In the Intel and IBM implementations, the parallel code structure can be the same as the sequential one in Figure 4.4, whose components are implemented in parallel instead of sequentially.
ScaLAPACK is currently not an efficient matrix solution option on the CM-5 because a BLAS library that is optimized for the CM-5 vector units does not exist. Therefore, the platform specific CMSSL matrix equation solver is being used.
The CMSSL library [67] uses a data-parallel programming model and cannot be globally interfaced with the message-passing, matrix-filling algorithm. Therefore, in our implementation, the matrix fill and matrix factorization and solution (factor/solve) are two distinct program units. A high-speed device, such as the scalable data array (SDA) or DataVault (see [73]), is used to link these two stages. The message-passing MoM matrix-filling program fills the matrix and writes it to a file in the format required for the factor/solve stage. The matrix is subsequently read in by the data-parallel matrix solver stage.
There is very little performance penalty for using separate program units
for the filling
and factor/solve because the DataVault or the SDA has
very high data
rates. As an illustration, we have run some exercises to demonstrate our
claim. There are time data from both CMMD timer [74]
and the CM Fortran timer [75][75] listed
in Tables
and
. In Table
, the
elapsed time for writing the moment matrix to a SDA file is measured by
the CMMD timer. The writing operation is executed by the CMMD global write
under CMMD synchronous sequential mode [77][74]. The CM Fortran
Utility library [75] provides a
``SO" mode which is compatible with almost all CM
systems. One can see that the extra effort in using a high
performance storage device to utilize both the message-passing paradigm and the
data-parallel
paradigm is justified.
For this moderate-sized problem the I/O time is small compared to the
factor time. Since
the I/O time scales as and the LU time scales as
,
it is evident that for very large problems the I/O
time will not be a limiting factor.
The implementation of the matrix-filling algorithm on the CM-5 is
written in Fortran 77 with calls to the CMMD message-passing library. The
resulting code is adequate to
investigate the efficiency of the parallel algorithm as measured
by the speed-up (ratio of elapsed
time for one node versus elapsed time for p nodes).
However, the per-node performance is
expected to be poor because Fortran 77 code cannot utilize the
CM-5 vector units efficiently [79][78]. However, the data-parallel code for matrix
factor and matrix solve utilizes the vector units. The LU and Solve
operations require the most floating point operations.
The performance of the CM-5 implementation is expected to have a good
overall speed-up.
Memory Requirements
The matrix size is where
is the number of basis functions (or
edges between triangular facets or ``faces"; if the scatter has an open
surface then
is slightly less than the number of edges since boundary
edges do not have basis functions, but this effect is small). Each face has
three edges and each edge is shared by two faces, so the relationship
between the number of faces
and
is
.
For the parallel approach we use to develop a parallel
algorithm, each node runs the same program. The matrix is distributed among
the nodes so that the memory required by each node is
bytes
assuming single precision complex matrix storage.
The sequential paramom requires memory
where denotes the portion of memory required by arrays other than
the matrix, and
is the memory required by the matrix.
In the parallel approach, the local memory of each node must be large
enough to exceed
and a portion of
.
The most important task before implementing the ParaMoM as a parallel
code is to reconstruct the program in such a way to reduce .
This is very difficult to accomplish.
The new version of the ParaMoM code has a lower
, which is given by
This significant reduction of makes it possible for a MIMD computer
with a typical configuration of 16 Mbytes to 32 Mbytes local memory to be used
to compute a large application problem. For a very large problem
which has a large
and moderate
, an out-of-core algorithm may
be needed to save on
memory. In the next section, we discuss an out-of-core fill algorithm
for the CM-5 implementation.