Distributed Computational Electromagnetics Systems
 
   G. Cheng, K. A. Hawick, G. Mortensen*, X. Shen and G. C. Fox
 
         Northeast Parallel Architectures Center
         Syracuse University, 111 College Place, 
         Syracuse, NY 13244-4100
 
         *Syracuse Research Corporation, Syracuse, NY
 
         Tel. 315-443-2083, Fax. 315-443-1973
         Email: gcheng@npac.syr.edu
 
            
                     (Extended Abstract)
 
In this paper, we describe our initial work of developing a real world
electromagnetic application on distributed computing systems. A
computational electromagnetics (CEM) application for radar
cross-section(RCS) modeling of full scale airborne is ported to three
networked workstation cluster systems: IBM SP-1 with Ethernet
connection, a FDDI-based GIGswitch-connected DEC Alpha farm and an
ATM-connected SUN IPXs testbed. We use ScaLAPACK LU solver from
ORNL/UTK in the parallel implementation for solving the dense matrix
which forms the kernel computationally intensive part of this
application, and BLACS (PVM version) as the message passing interface
in the code development to achieve high portability across the three
distributed configurations.  Performance data are reported, as well as
for comparison purpose the timing data from other MPP systems on which
this application is initially implemented, including Intel iPSC/860
and CM5.

Most electromagnetic applications on high performance computing
systems so far reported have been on MPP systems. Traditional
electromagnetic engineering simulations are limited in most cases by
memory requirements, as well as by sequential processing time. The
advanced development in workstation cluster technology, represented by
IBM's SP-1 system and Digital's Alpha farm, provides a high
performance node processor and sufficient memory capacity for this
type of application.

Together with the emergence of portable distributed linear algebra
packages such as BLACS and ScaLAPACK and portable message passing
interface such as PVM and MPI, cluster-based computational
electromagnetics system gives an attractive solution to numerical
simulations in engineering electromagnetic analysis and design with
modest problem sizes.

The workstation cluster solution achieves better cost/performance than
the MPP solutions and this is especially important to real world
applications where workstation clusters are more accessible to
engineers than the MPP systems.

The CEM application used in this work is a well-established CEM
package from Syracuse Research Corporation(SRC), named PARAMOM which
stands for Parametric Patch Method of Moments (MoM). The most distinctive
feature of SRC PARAMOM is its use of basis functions that conform to
parametric surfaces with curvature and thus generate accurate and stable
simulation results.  The ability of PARAMOM to accurately account for
surface curvature with the use of parametric surface patches has in
many cases shown significant advantage over conventional techniques
that employ flat facets as modeling elements.

The PARAMOM code consists of four basic processing phases with
different computational requirements in terms of CPU time and memory
consumption, as shown in Table 1:

   Table 1: Processing Phases of Sequential PARAMOM and 
            Their Computational Requirements
---------------------------------------------------------------------
Phase                 Component            CPU time          Memory
---------------------------------------------------------------------
 1                     setup                O(N)             O(N)
 2                   matrix fill            O(N*N)           O(N*N)
 3                  RHS vector fill         O(N)             O(N)
 4                matrix factor/solve       O(N**3)          O(N*N)
 5            scattered field computation   O(N)             O(N)

Where N is the number of unknowns (proportional to surface area of target). 

Because the computations for phases 2 - 5 are all related to the impedance
matrix (Z-matrix), we use the ScaLAPACK SSB(Scattered Square Block) 
partition for matrix decomposition in parallel algorithms of phases 2 - 5. 

Our algorithm design is focused on the Z-matrix filling part, with the
ScaLAPACK LU solver package used for the matrix factor and solve part.
The major task is to partition the computation of Z-matrix elements amongst
processors by first distributing the memory allocated to the Z-matrix
amongst processors.  Since we employ the ScaLAPACK LU solver which
uses a SSB partition scheme, in order to avoid the communication
overhead of redistributing the Z-matrix among processors, we choose to
use the same partition scheme in the filling part as in the LU part.
Furthermore, in our current implementation, we keep the identical data
partitioning scheme and (BLACS) virtual machine configuration in all
phases 2 - 5.

We also perform necessary parallelizations in other
computational components to optimize the overall performance of this
application.  Our complete parallel implementation includes:

1) parallel input for target geometry data and a parallel algorithm for 
   precomputation.  Only global broadcast operations are used in the setup
   phase.

2) an embarrassingly parallel algorithm for Z-matrix filling which 
   requires some redundant calculation of patch-patch interactions but
   no inter-node communication.
   
3) a similar embarrassingly parallel algorithm for RHS excitation vectors 
   filling which has no inter-node communication.

4) parallel ScaLAPACK LU solver (from ORNL/UTK),
   including Z-matrix factoring and solving.  The ScaLAPACK LU package is a 
   distributed-memory version of the LAPACK LU solver with
   block-oriented algorithm and scalable data partition, built on
   BLAS, BLACS, and PBBLAS.  Ours is a COMPLEX implementation.

5) parallel algorithm for the far-field calculation.
   Only global summation operations are required.

With the BLACS PVM version of the ScaLAPACK package, we were able to
port the parallel implementation of the PARAMOM code to three
platforms with different configurations:

1) a 8-node IBM SP-1 with Ethernet connection;
2) a 8-node Alpha Farm with FDDI network, and
3) a 2-node SUN IPXs connected by an ATM link.

In conclusion, we have achieved the following:

1) Parallel algorithms (for matrix fill and solve) were developed for 
this application. High performance and good speedup were
observed in the parallel implementation.  High data-locality and
load balance were achieved for the parallel matrix fill and the 
performance scales well with the problem size and also with the machine size.
(Performance Scalability)

2) Highly portable code development across three platforms
with diverse configurations and hardware capabilities was achieved 
by using the BLACS/ScaLAPACK on the distributed systems, which is 
implemented on top of the PVM message-passing programming interface. 
This code can also run on a heterogeneous computing 
environment.  The three systems represent the most state-of-art 
architectures in today's workstation clusters technology. 
(Program Portability)

3) Real-world test cases (i.e. Electromagnetic Code Consortium 
benchmarking data) were conducted to test the parallel implementation 
on the three systems which, together with the unique features used in 
PARAMOM, proved that distributed computing technology is a viable 
approach to solve computational electromagnetics engineering problems 
of modest sizes in industry.
(Applicability)

4) Performance comparisons of the implementations on the three platforms
were conducted and the related system issues were analysed. This application
can be a good candidate for benchmarking purposes, based on the following
three characteristics of this application:

  a) Two distinct computationally intensive components, i.e., matrix fill,
     matrix LU solve, with different computational requirements in
     processor performance(fill), cache memory access(fill and solve) and 
     inter-processor communication (solve only).
  b) The in-core memory used to store a dense matrix must be 
     large enough to run the application on a coarse-grained cluster 
     configuration. Parallel I/O must be used to achieve balanced 
     set-up and matrix filling/solving performance.
  c) This RCS problem represents a typical (classic) problem in computational 
     electromagnetics which uses boundary integral methods and involves
     heavy computing and solving a large dense complex matrix.

(Benchmarking)

---------------------------------------------------------------------------------
Format preferred:  20-minute presentation
Conference themes: Distributed Computing or 
                   Scalable Parallel Algorithms and 
                      Implementations in Electromagnetics, or
                   Parallel Matrix Computation