Next: A Normalized Local Up: Application of Massively Parallel Previous: Discussion of Two

Conclusion

In this thesis, we described Wilkes' and Cha's Parametric patch model and their basis function which is defined on a typical patch. Each patch is a nonplanar surface, that is a triangle in curvilinear coordinate space. The EFIE, MFIE, and CFIE formulations are derived for calculating scattering by 3-D arbitrarily shaped conducting bodies with and without dielectric material coating. A sequential computer program has been written based on the formulation developed in Chapter 3. This code has been merged into the ParaMoM code previously developed by Cha's group at SRC based on the formulation described in Chapter 2. The ParaMoM program calculates the Radar Cross Section (RCS) conducting bodies whose surfaces are arbitrarily shaped with or without dielectric coating.

If the unmodified ParaMoM code was parallelized, the large setup memory required would make it impractical for MIMD distributed computers. ParaMoM was modified to reduce the setup memory to a manageable level for a MIMD distributed computer. The parallel implementation of the ParaMoM code is called ParaMoM-MPP. ParaMoM-MPP is implemented on three different MIMD distributed memory architectures: the Thinking Machine's CM-5, Intel machines, and IBM SP-1. In order to demonstrate the performance of the ParaMoM-MPP code, a series of tests were run to obtain performance data. The performance data listed in the previous chapter shows good performance of all three implementations. To demonstrate the accuracy of the formulation derived in the thesis, ParaMoM-MPP has been used to compute the scattering for the EMCC testing cases. The ParaMoM-MPP RCS results agree well with measured results.

From running EMCC test cases, we observe that exploiting the target's geometric symmetry gives ParaMoM-MPP an advantage in RCS prediction applications. In the EMCC test cases, we take this advantage to reduce problem size by a maximum factor of eight. In a distributed MIMD system, there is only a fixed amount of RAM for each node and no virtual memory in the system. The above-mentioned reduction in setup memory requirement enables the given system to solve a bigger problem.

The parametric patch model requires fewer unknowns than the flat patch model under the same error tolerance. This is demonstrated by Wilkes and Cha in Figures and in the previous chapter. To demonstrate this using ParaMoM-MPP, we choose the EMCC single ogive target. First, three symmetry planes are used so that only one eighth of the single ogive is modeled by 2088 curved triangular patches. There are eight sub-problems to solve. The average number of equations in each sub-problem is about 3150. Second, the single ogive is modeled without exploiting any symmetry. The dimension of the moment matrix for the whole ogive is chosen to be 7332. The total allocated memory to run this problem is 21 Mbytes per node. The RCS with both HH and VV polarizations with and without symmetry plane are shown in Figures and . This result shows that the number of patches could be much less. This result gives clear vision about the advantage of taking the symmetry into consideration in terms of the memory per node requirement. It is very important for a distributed memory MPP system.

In order to take advantage of state-of-the-art computer technology as much as we could, we designed parallel algorithms for all the components of the ParaMoM code. We implemented a coordinated parallelism algorithm which takes advantage of both the message passing paradigm and data parallel paradigm on Thinking Machine's CM-5. We implemented algorithms which used the NX message passing library [68] or PVM [69] and ScaLAPACK libraries on the Intel computers. The PVM (Parallel virtual machine) and ScaLAPACK libraries are used to accomplish message passing and direct LU matrix solution, respectively. PVM automatically converts data formats among computers from different vendors, making it possible to run an application over a collection of heterogeneous computers. To demonstrate portability, the PVM Intel version of ParaMoM-MPP was ported to the IBM SP-1 with little effort. We also ported the PVM Intel version of ParaMoM-MPP to a FDDI-based Gigswitch-connected DEC alpha farm.

In general, high performance and good speed-up are achieved in these parallel implementations. The performance is scaled well with the problem size and machine size. Highly portable code development across three parallel systems with diverse configurations and hardware capabilities has been achieved by using the BLACS/ScaLAPACK on the distributed system, which is eventually achieved by using PVM message-passing programming interface. This code can further be easily ported to a heterogeneous computing environment. The systems which are used to run ParaMoM-MPP represent most of the state-of-the-art architectures in today's massively parallel processing technology. We may summarize the relative performance of various parallel machines as follows.

The CM5 Implementation
The flexibility given to a user to choose among programming models is an important feature of CM-5, since it lets users choose the technique that is best, not only for their application, but for each part of their application. The ParaMoM-MPP implementation is a good example. The matrix fill portion of ParaMoM-MPP is optimized by slab data decomposition. Utilizing the edge numbering scheme to loop over patches rather than basis functions minimizes the redundant computation to one Green's function for each source-field patch pair. The matrix fill algorithm has no internode communication and very good load balancing. The fill algorithm is scalable as discussed in Chapter 5. The data parallel program for factoring and solving the matrix equation is written in CM Fortran interfaced with the CMSSL (Connection Machine Scalable Scientific Library). The factor/solve implementation is optimized because: 1). It has good load balance (block cyclic data decomposition); 2). CM Fortran utilizes the full power of four vector units within each CM-5 node; the peak performance for the CM-5 scalar nodes is 5 Mflops but using the 4 vector units the peak performance increases to 128 Mflops; 3) The CMSSL Library is a well-written and well-respected piece of software. The performance of factor/solve is the best on the CM-5 among the three platforms. The I/O performance is excellent to both SDA and Data Vault systems. The CM-5 implementation can be ported to all CM-5 partitions without change.

The drawback of the CM-5 implementation is that the matrix fill is relatively poor. Fortran 77 can not utilize the vector units in the CM-5 nodes, therefore the performance is expected to be poor since CM-5 SPARC2 node has only 5 Mflops peak performance. Although it is possible to utilize the vector units by implementing in CM DPEAC (a CM assemble statements), this is not expected to improve the fill performance much. Because of its nature, the problem is very difficult to vectorize and pipeline efficiently.

The Intel Paragon Implementation
Both PVM and NX (the Intel message passing library) are implemented on various Intel machines, such as the iPSC/860, the Touchstone Delta, and the Paragon. These implementations are very flexible. Users can configure virtually the architecture. The BLACS gives good communication libraries and the ScaLAPACK provides scalable factor/solve functions. Combining the above with powerful Paragon nodes, the ParaMoM-MPP code gives good performance for both matrix fill and factor/solve. For matrix fill, a slab data decomposition not only minimizes the amount of redundant computation of Green's function for each source-field patch as with the CM-5 implementation but achieves good load balancing as well. However, the block scattering data decomposition gives better load balancing for LU with partial pivoting. Since our Intel implementation is in one piece, it is impossible to achieve load balancing for both matrix fill and factor/solve without data reshuffling. We not only implemented ParaMoM-MPP to optimize matrix fill (or slab data decomposition) but also implemented it to use I/O with a parallel file system as a buffer to reshuffle data in each node. We observe that the same total wall clock time was used to run the code for the same application. The number of factor/solve Mflops achieved after data reshuffling is almost double the number without reshuffling. We note that the I/O performance is relatively poor on the NAS Paragon, but that may be in part due to the relatively low number of I/O nodes in the NAS Paragon configuration. Portability is achieved by the PVM library, which is available to many systems.

The IBM SP-1 Implementation
The IBM SP-1 has the best scalar node performance among these three platforms, and as a result it gives the best matrix fill performance per node for the platforms tested. The SP-1 system is still a new machine and in the test stage; the system software is not yet mature. It is believed that the IBM SP series can be very powerful once the software is further developed.

The DEC Alpha Farm Implementation
For testing purposes, we also implemented ParaMoM-MPP on a FDDI-based Gigswitch-connected DEC Alpha workstation cluster. A distributed system provides higher performance and larger memory per node than MPP systems. This makes the high performance distributed computing (HPDC) approach to electromagnetic application feasible and attractive to solve modest sized problems. However, the communication bandwidth on a typical network does not scale with increased numbers of nodes. Therefore the workstation cluster is probably not a practical solution for solution of large problems since the LU factor performance will be limited by the communication bottleneck.

Our contributions are summarized below:

Desirable features of the ParaMoM-MPP implementation are listed below:

Accuracy

The ParaMoM-MPP code is functionally identical to the mature, stable, and well-validated ParaMoM code. Results that validate ParaMoM-MPP are presented in Chapter 5.

Efficient Parallel Implementation

The overall performance of the parallel code is very good on the target architectures. The solution time for relatively large problems is two orders of magnitude shorter on typical MPPs than on the fastest state-of-the-art single-processor workstation.

Portability

The parallel ParaMoM code was implemented on multiple MPP architectures. Differences between the implementations are superficial.

Scalability

Empirical and analytical results demonstrate that excellent efficiency may be obtained on very large problems and large MPP configurations.

Flexibility to Incorporate Future Developments

The matrix filling algorithm utilizes a flexible data partitioning scheme that can easily be interfaced to a variety of matrix solvers. The out-of-core fill algorithm gives the ability to fill the matrix either in-core or out-of-core. This algorithm has been incorporated and tested on the CM-5 code.

Delivery of Useful MPP Codes on Multiple Platforms

The ParaMoM-MPP code has been fully tested on three basic configurations: Intel, CM-5, and IBM SP-1.


Future Work

Our work demonstrates that parallel computing and advanced solution techniques are equally important to successfully achieve full-scale aircraft RCS prediction. Our work is a very successful example of combining state-of-the-art massively parallel processing technology with state-of-the-art computational electromagnetic techniques. This approach will lead the way to achieve RCS prediction of a full-scale aircraft. To illustrate this, we examine the case of the VFY218. The full-scale VFY218 has a total surface area, including engine ducts and exhaust, of approximately 200 . An exact method of decomposition can be used to isolate the interior solutions (ducts, exhaust) from the exterior solution (outside the aircraft). This will reduce the surface area of the problem to 160 . One symmetry plane can be used to halve the 160 to 80 . We now have 900 at 1 GHz and 225 at 500 MHz. At 500 MHz, a 10 points per wavelength sampling rate creates a moment matrix of dimension 45,000. An out-of-core implementation on a state-of-the-art IBM 512 SP-2 system-which will be installed at the Cornell Theory Center later in 1994-can handle this case. SRC is developing an advanced method capable of accurate prediction at a sampling rate around 6 points per wavelength. This method will be in ParaMoM 2.0. At 1 GHz, this sampling rate leads to a matrix order of 65,000 for a VFY218. When ParaMoM 2.0 is developed, the parallel implementation will achieve the RCS prediction of a full-scale VFY218. Practical RCS prediction using numerical methods is realistic with massively parallel processing technology.

It is possible to combining the MoM with other numerical techiques to develop a new algorithm. The new algorithm may reduce the computation from to .



Next: A Normalized Local Up: Application of Massively Parallel Previous: Discussion of Two


xshen@
Sat Dec 3 17:51:03 EST 1994