In this thesis, we described Wilkes' and Cha's Parametric patch model and their basis function which is defined on a typical patch. Each patch is a nonplanar surface, that is a triangle in curvilinear coordinate space. The EFIE, MFIE, and CFIE formulations are derived for calculating scattering by 3-D arbitrarily shaped conducting bodies with and without dielectric material coating. A sequential computer program has been written based on the formulation developed in Chapter 3. This code has been merged into the ParaMoM code previously developed by Cha's group at SRC based on the formulation described in Chapter 2. The ParaMoM program calculates the Radar Cross Section (RCS) conducting bodies whose surfaces are arbitrarily shaped with or without dielectric coating.
If the unmodified ParaMoM code was parallelized, the large setup memory required would make it impractical for MIMD distributed computers. ParaMoM was modified to reduce the setup memory to a manageable level for a MIMD distributed computer. The parallel implementation of the ParaMoM code is called ParaMoM-MPP. ParaMoM-MPP is implemented on three different MIMD distributed memory architectures: the Thinking Machine's CM-5, Intel machines, and IBM SP-1. In order to demonstrate the performance of the ParaMoM-MPP code, a series of tests were run to obtain performance data. The performance data listed in the previous chapter shows good performance of all three implementations. To demonstrate the accuracy of the formulation derived in the thesis, ParaMoM-MPP has been used to compute the scattering for the EMCC testing cases. The ParaMoM-MPP RCS results agree well with measured results.
From running EMCC test cases, we observe that exploiting the target's geometric symmetry gives ParaMoM-MPP an advantage in RCS prediction applications. In the EMCC test cases, we take this advantage to reduce problem size by a maximum factor of eight. In a distributed MIMD system, there is only a fixed amount of RAM for each node and no virtual memory in the system. The above-mentioned reduction in setup memory requirement enables the given system to solve a bigger problem.
The parametric
patch model requires fewer unknowns than the flat patch model under the
same error tolerance. This is demonstrated by Wilkes and Cha
in Figures and
in the previous chapter.
To demonstrate this using ParaMoM-MPP,
we choose the EMCC single ogive target.
First, three symmetry planes are used so that only one eighth of the single
ogive is modeled by 2088 curved triangular patches. There are eight
sub-problems to solve. The average number of equations in each sub-problem
is about 3150. Second, the single ogive is
modeled without exploiting any symmetry. The dimension of the
moment matrix for the whole ogive is chosen to be 7332. The total allocated
memory to run this problem is 21 Mbytes per node. The RCS with both HH and
VV polarizations with and without symmetry plane are shown in
Figures
and
. This result shows that
the number of patches could be much less. This result gives clear vision
about the
advantage of taking the symmetry into consideration in terms of the memory per
node
requirement. It is very important for a distributed memory MPP system.
In order to take advantage of state-of-the-art computer technology as much as we could, we designed parallel algorithms for all the components of the ParaMoM code. We implemented a coordinated parallelism algorithm which takes advantage of both the message passing paradigm and data parallel paradigm on Thinking Machine's CM-5. We implemented algorithms which used the NX message passing library [68] or PVM [69] and ScaLAPACK libraries on the Intel computers. The PVM (Parallel virtual machine) and ScaLAPACK libraries are used to accomplish message passing and direct LU matrix solution, respectively. PVM automatically converts data formats among computers from different vendors, making it possible to run an application over a collection of heterogeneous computers. To demonstrate portability, the PVM Intel version of ParaMoM-MPP was ported to the IBM SP-1 with little effort. We also ported the PVM Intel version of ParaMoM-MPP to a FDDI-based Gigswitch-connected DEC alpha farm.
In general, high performance and good speed-up are achieved in these
parallel implementations. The performance is scaled well with the problem size
and machine size. Highly portable code development across three parallel
systems with diverse configurations and hardware capabilities has been
achieved by using the BLACS/ScaLAPACK on the distributed system, which is
eventually achieved by using PVM message-passing programming interface.
This code can further be easily ported to a heterogeneous computing
environment. The systems which are used to run ParaMoM-MPP represent
most of the state-of-the-art architectures in today's massively parallel
processing
technology.
We may summarize the relative performance of various parallel
machines as follows.
The CM5 Implementation
The flexibility given to a user to choose among programming models is an
important feature of CM-5, since it lets users choose the technique that is
best, not only for their application, but for each part of their
application. The ParaMoM-MPP implementation is a good example. The matrix
fill
portion of ParaMoM-MPP is optimized by slab data decomposition. Utilizing
the edge numbering scheme to loop over patches rather than basis functions
minimizes the redundant
computation to one Green's function for each source-field patch pair. The
matrix fill algorithm has
no internode communication
and very good load balancing. The fill algorithm is scalable as discussed
in Chapter 5. The data parallel program for
factoring and solving the matrix equation is written in CM Fortran
interfaced with the CMSSL (Connection Machine Scalable Scientific Library).
The
factor/solve implementation is optimized because: 1). It
has good load
balance (block cyclic data decomposition); 2). CM Fortran utilizes the
full
power of four vector units within each CM-5 node; the peak performance
for the CM-5 scalar nodes is 5 Mflops but using the 4 vector units the peak
performance increases to 128 Mflops; 3) The CMSSL Library is a well-written
and well-respected piece of software. The performance of factor/solve is
the best on the CM-5 among the three platforms.
The I/O performance is excellent to both SDA and Data Vault systems.
The CM-5 implementation can be ported to all CM-5 partitions without
change.
The drawback of the CM-5 implementation is that the matrix fill is
relatively
poor. Fortran 77 can not utilize the vector units in the CM-5 nodes,
therefore the performance is expected to be poor since CM-5
SPARC2 node has only 5 Mflops
peak performance. Although it is possible to utilize the vector
units by implementing in CM DPEAC (a CM assemble statements), this is
not expected to improve the fill performance much. Because of its nature,
the problem is very difficult to vectorize and pipeline efficiently.
The Intel Paragon Implementation
Both PVM and NX (the Intel message passing library) are implemented on
various Intel machines, such as the iPSC/860, the Touchstone Delta, and
the Paragon. These implementations are very flexible. Users can configure
virtually the architecture.
The BLACS gives good communication libraries and the ScaLAPACK provides
scalable
factor/solve functions. Combining the above with powerful Paragon nodes,
the ParaMoM-MPP
code gives good performance for both matrix fill and factor/solve.
For matrix fill, a slab data decomposition not only minimizes the
amount of redundant
computation of Green's function for each source-field patch as with the CM-5
implementation but achieves good load balancing as well.
However, the block scattering
data decomposition gives better load balancing for LU with partial
pivoting. Since our Intel
implementation is in one piece, it is impossible to achieve load balancing
for both matrix fill and factor/solve without data reshuffling. We not only
implemented ParaMoM-MPP to optimize matrix fill (or slab data
decomposition) but also implemented it to use I/O with a parallel file system as
a buffer to reshuffle data in each node. We observe that the same total wall
clock
time was used to run the code for the same application. The number of factor/solve
Mflops achieved after data reshuffling is almost double the number
without reshuffling. We note that the
I/O performance is relatively poor on the NAS Paragon,
but that may be in part due to the relatively low number of I/O nodes in
the NAS Paragon configuration. Portability is achieved by the PVM library,
which
is available to many systems.
The IBM SP-1 Implementation
The IBM SP-1 has the best scalar node performance among these three
platforms,
and as a result it gives the best matrix fill performance per node
for the platforms tested. The SP-1 system is still a new machine and in the
test stage;
the system software is not yet mature. It is believed that
the IBM SP series can be very powerful once the software is further
developed.
The DEC Alpha Farm Implementation
For testing purposes, we also implemented ParaMoM-MPP
on a FDDI-based Gigswitch-connected DEC Alpha
workstation cluster. A distributed system provides higher performance and
larger memory per node than MPP systems. This makes the high performance
distributed computing (HPDC) approach to electromagnetic application
feasible and attractive to solve modest sized problems.
However, the communication bandwidth
on a typical network does not scale with increased numbers of nodes.
Therefore
the workstation cluster is probably not a practical solution for solution
of large problems since the LU factor performance will be limited by the
communication bottleneck.
Our contributions are summarized below:
Desirable features of the ParaMoM-MPP implementation are listed below:
Accuracy
The ParaMoM-MPP code is functionally identical to the mature,
stable, and well-validated ParaMoM code. Results that validate ParaMoM-MPP
are presented in Chapter 5.
Efficient Parallel Implementation
The overall performance of the parallel code is very good on the target
architectures. The solution time for relatively large problems is two
orders of magnitude shorter on typical MPPs than on the fastest
state-of-the-art single-processor workstation.
Portability
The parallel ParaMoM code was implemented on multiple MPP architectures.
Differences between the implementations are superficial.
Scalability
Empirical and analytical results demonstrate that excellent efficiency may
be obtained on very large problems and large MPP configurations.
Flexibility to Incorporate Future Developments
The matrix filling algorithm utilizes a flexible data partitioning scheme that can
easily be interfaced to a variety of matrix solvers. The out-of-core fill
algorithm gives the ability to fill the matrix either in-core or
out-of-core. This algorithm has been incorporated and tested on the CM-5 code.
Delivery of Useful MPP Codes on Multiple Platforms
The ParaMoM-MPP code has been fully tested on three basic configurations:
Intel, CM-5, and IBM SP-1.
Future Work
Our work demonstrates that parallel computing and advanced solution
techniques are equally important to successfully achieve full-scale
aircraft RCS prediction. Our work is a very successful example of combining
state-of-the-art massively parallel processing technology with
state-of-the-art
computational electromagnetic techniques. This approach will lead
the way to achieve RCS prediction of a full-scale aircraft. To illustrate
this, we examine the case of the VFY218. The full-scale VFY218 has a total
surface area, including engine ducts and exhaust, of approximately 200
. An exact method of decomposition can be used to
isolate the interior solutions (ducts, exhaust) from the
exterior solution (outside the aircraft). This will reduce the surface area of the problem to 160
. One symmetry plane can be used to halve the 160
to 80
. We now have 900
at 1 GHz and 225
at
500
MHz. At 500 MHz, a 10 points per wavelength sampling rate creates a moment matrix of
dimension 45,000. An out-of-core implementation on a state-of-the-art IBM
512 SP-2 system-which will be installed at the Cornell Theory Center later
in
1994-can handle this case. SRC is developing an
advanced
method capable of accurate prediction at a sampling rate around
6 points per wavelength.
This method will be in ParaMoM 2.0. At 1 GHz, this sampling rate
leads to a matrix order of 65,000 for a VFY218. When ParaMoM 2.0 is developed,
the parallel implementation will achieve the RCS prediction of a full-scale
VFY218. Practical RCS prediction using numerical methods is realistic with
massively parallel processing technology.
It is possible to combining the MoM with other numerical techiques to
develop a new algorithm.
The new algorithm may reduce the computation from
to
.