For the purpose of measuring performance,
the parallel RCS prediction code
has been run for a set of conducting spheres.
The number of equations goes from a few hundred to
about ten thousand. There is only one right-hand (RHS) vector.
The
run time of each part of the parallel code gives a better view of the CPU time
distribution from a user's perspective. The machine size (or the number of
nodes in the system) is fixed but the problem size (the number of equations
to be solved) is variable. For the CM-5 implementation, two machine sizes are
chosen to present the run time. They are the 32-node CM-5 at NPAC and the
512-node CM-5 at MSCI/AHPCRC.
In Tables
and
,
the matrix fill time,
the total time for extra I/O operation of CM-5 implementation, far-field time,
and precomputation time is presented.
The fill, far field, and precomputation time are
recorded in Tables and
using a CMMD timer.
The w/r time is
the total time required for writing out the matrix and RHS vector
and reading in the
solution vectors using CMMD timer plus the time required for reading
in the matrix and
RHS vectors and writing out the solution vectors using CM Fortran timer.
The w/r is elapsed time and the rest of the time is the CPU busy time.
The last column in Table
is the Mflops generated by the
matrix factorization. The formulation computing the Mflops is
For single precision complex type, the CMSSL LU function takes
operations to factorize an N by N matrix. The time used to compute the
Mflops is the CPU busy time recorded by the CM Fortran timer.
From the run times listed in Tables and
, we see that the extra time required for the write/read
operation to connect between the CMMD message passing program and a
data-parallel
CM Fortran program is indeed very short. The matrix
fill and factor consume the most CPU time. Both the nature of the problem
making the matrix fill code very inefficient to be vectorized and Fortran 77
being inaccessible to the vector units within CM-5 node make the matrix
fill slow. On the other hand, since a CM Fortran code is able to
utilize all vector
units associated with the CM-5 node, the matrix factor/solve code
is able to use the system much
more efficiently.
We only list the matrix fill time and the matrix factorization time for the
Intel machines and IBM SP-1 implementation because
they are the dominant parts of the code. To present the run time on the
Intel machines, we choose the 32-node Intel Paragon at NAS and the 64-node
partition and 512-node partition of the Touchstone Delta at JPL/Caltech in
Table .
For the IBM SP-1 implementation, we list the time data on the 58-node partition
and the 32-node partition of SP-1 at Argonne National Laboratory (ANL). As with the Intel machines, only
the fill time and the factor time are listed in Table .
For the purpose of performance comparison, the PVM implementation of
the RCS prediction code is ported to a 8-node DEC Alpha farm cluster with
64 Mbytes memory per node. The Alpha cluster is connected by a FDDI-based
Gigswitch. In Table , we list the CPU time and Mflops running
on these machines.