Next: Implementation Up: Structure of the Previous: Far-Field and RCS

The Parallel Programming Model and Approaches

The Parallel ParaMoM Code is implemented on three parallel platforms (CM-5 , Intel, IBM SP-1). Each of the platforms is a coarse-grained MIMD (multiple instruction multiple data) with relatively powerful nodes. Each machine supports an explicit message-passing programming model. In each case, the message-passing library may be accessed from a Fortran program.

The explicit message-passing approach has a number of advantages for the MoM application. Performance of a message-passing program is less dependent on the interconnection topology of the parallel machine than it is in data-parallel and shared-memory approaches. There is more direct control over the message volume and timing. Therefore, message-passing programs are more likely to port efficiently from one parallel architecture to the next. In addition, the functions of the message-passing library are quite similar on all architectures; differences are mainly in syntax rather than in philosophy. In contrast, various data-parallel and SIMD machines rely on compiler technologies to varying extents to perform the parallelization work, and the parallel facilities may exist at different levels of abstraction. For the data-parallel processing, an array object refers to all the data elements of the array simultaneously from the software perspective; the separate operations on the array's elements are all performed simultaneously from the hardware perspective.

There are a number of methods available to perform the solution of dense linear system of equations. In specialized problems, iterative approaches can be quite efficient in terms of memory and computation time. However, iterative techniques suffer when used to treat many right-hand side vectors. In this study, it is of interest to solve systems with a large number of right-hand side vectors (i.e., scatterers illuminated by multiple incident waves) and iterative methods will not be considered. The Gaussian elimination method will be employed under these conditions, since the computational intensive factorization need only be performed once.

The computational expenses of the setup phase, precomputation phase, rhs vector fill phase, and scattered-field computation phase are of order N, where N is the number of unknowns (proportional to surface area of target). The matrix fill phase is of order , the matrix factorization is of order , and the matrix solution is of order . Because our primary interest in applying parallel computers lies in reducing the total time required to solve large problems (i.e., large ), we concentrate our energy on reducing the and processes. In this effort we are developing parallel algorithms for the matrix fill, factorization, and solution processes, but also for the setup, precomputation, rhs vector (or vectors) fill, and scattered-field computation phases as well, since the whole program must be run on the target system.

In addition to computational cost concerns, it is necessary to consider memory requirements. On the class of computers we are considering, the main memory (RAM) is distributed among the nodes. The moment matrix is generally much too large to fit in the memory of one node, so it must be distributed equitably among the nodes. The matrix size is of order , but all other arrays used in the ParaMoM code are of order or lower.

To parallelize the ParaMoM code, we first need to discuss what kind of algorithm design strategy to use. In designing an parallel algorithm for MIMD computers, several approaches are possible. Here, we give a few commonly used approaches. For one of them, each processor may execute exactly the same program, but with different input data. No communication is ever required. This is the trivial parallelization strategy, which is the easiest for programming. The drawback is that this approach requires the entire computation fits into a single processor's memory, which is often found to be impossible. The pipeline approach is to break the computation into a series of small tasks which have to finished sequentially for a large set of data. Each task is assigned to one or a few processors. After completing its computation, a processor passes its result on to processors which handle the next task. However, this approach is applicable to a restricted class of problems.

The problem partition approach has received the most attention in scientific computing applications. In this approach, each processor executes substantially the same program, but on a portion of the problem data. Processors are loosely coupled throughout the computation, exchanging information whenever necessary. It is very suitable for solving large problems, where all available memory is required. The implementation difficulties are how to partition the problem to have a good load balance.

The parallel algorithm implemented on the Intel Paragon and IBM SP-1 is an example of the partitioning approach. On the CM-5, the algorithm used is one that combines both the message passing paradigm and the data parallel paradigm. The matrix fill and field computation use the partitioning strategy implemented in the message passing paradigm. The dense linear system solver is the CM Fortran interface with the CMSSL library [67] which uses Gaussian Elimination and back substitution. The connection between these two paradigms is accomplished by a high speed massive storage device.

The implementation on Intel utilizes the NX message-passing library [68], the standard PVM (Parallel Virtual Machine) message-passing library [69] and ScaLAPACK [70] for excellent portability. This PVM implemented on Intel machines has been ported to the IBM SP-1 with little effort. ScaLAPACK, a publicly available software, was developed at Oak Ridge National Laboratory.

ScaLAPACK is a distributed-memory version of the standard LAPACK linear algebra library. It is built on the BLAS library, which is at the heart of LAPACK and its predecessor LINPACK, in combination with a linear algebra communication library (BLACS [71]). The current portable version of the BLACS utilizes calls to the standard PVM (Parallel Virtual Machine) message-passing library and, therefore, can be easily ported to any machine on which PVM runs (for example Intel or Cray T3D). In addition, BLACS has also been implemented using the Intel NX message-passing library for optimum performance. ScaLAPACK performance [72] is scalable over a wide range of problem sizes and machine sizes.

In the Intel and IBM implementations, the parallel code structure can be the same as the sequential one in Figure 4.4, whose components are implemented in parallel instead of sequentially.

ScaLAPACK is currently not an efficient matrix solution option on the CM-5 because a BLAS library that is optimized for the CM-5 vector units does not exist. Therefore, the platform specific CMSSL matrix equation solver is being used.

The CMSSL library [67] uses a data-parallel programming model and cannot be globally interfaced with the message-passing, matrix-filling algorithm. Therefore, in our implementation, the matrix fill and matrix factorization and solution (factor/solve) are two distinct program units. A high-speed device, such as the scalable data array (SDA) or DataVault (see [73]), is used to link these two stages. The message-passing MoM matrix-filling program fills the matrix and writes it to a file in the format required for the factor/solve stage. The matrix is subsequently read in by the data-parallel matrix solver stage.

There is very little performance penalty for using separate program units for the filling and factor/solve because the DataVault or the SDA has very high data rates. As an illustration, we have run some exercises to demonstrate our claim. There are time data from both CMMD timer [74] and the CM Fortran timer [75][75] listed in Tables and . In Table , the elapsed time for writing the moment matrix to a SDA file is measured by the CMMD timer. The writing operation is executed by the CMMD global write under CMMD synchronous sequential mode [77][74]. The CM Fortran Utility library [75] provides a ``SO" mode which is compatible with almost all CM systems. One can see that the extra effort in using a high performance storage device to utilize both the message-passing paradigm and the data-parallel paradigm is justified.

For this moderate-sized problem the I/O time is small compared to the factor time. Since the I/O time scales as and the LU time scales as , it is evident that for very large problems the I/O time will not be a limiting factor.

The implementation of the matrix-filling algorithm on the CM-5 is written in Fortran 77 with calls to the CMMD message-passing library. The resulting code is adequate to investigate the efficiency of the parallel algorithm as measured by the speed-up (ratio of elapsed time for one node versus elapsed time for p nodes). However, the per-node performance is expected to be poor because Fortran 77 code cannot utilize the CM-5 vector units efficiently [79][78]. However, the data-parallel code for matrix factor and matrix solve utilizes the vector units. The LU and Solve operations require the most floating point operations. The performance of the CM-5 implementation is expected to have a good overall speed-up.

Memory Requirements
The matrix size is where is the number of basis functions (or edges between triangular facets or ``faces"; if the scatter has an open surface then is slightly less than the number of edges since boundary edges do not have basis functions, but this effect is small). Each face has three edges and each edge is shared by two faces, so the relationship between the number of faces and is .

For the parallel approach we use to develop a parallel algorithm, each node runs the same program. The matrix is distributed among the nodes so that the memory required by each node is bytes assuming single precision complex matrix storage. The sequential paramom requires memory

where denotes the portion of memory required by arrays other than the matrix, and is the memory required by the matrix. In the parallel approach, the local memory of each node must be large enough to exceed and a portion of .

The most important task before implementing the ParaMoM as a parallel code is to reconstruct the program in such a way to reduce . This is very difficult to accomplish. The new version of the ParaMoM code has a lower , which is given by

This significant reduction of makes it possible for a MIMD computer with a typical configuration of 16 Mbytes to 32 Mbytes local memory to be used to compute a large application problem. For a very large problem which has a large and moderate , an out-of-core algorithm may be needed to save on memory. In the next section, we discuss an out-of-core fill algorithm for the CM-5 implementation.



Next: Implementation Up: Structure of the Previous: Far-Field and RCS


xshen@
Sat Dec 3 17:51:03 EST 1994