Distributed Computational Electromagnetics Systems G. Cheng, K. A. Hawick, G. Mortensen*, X. Shen and G. C. Fox Northeast Parallel Architectures Center Syracuse University, 111 College Place, Syracuse, NY 13244-4100 *Syracuse Research Corporation, Syracuse, NY Tel. 315-443-2083, Fax. 315-443-1973 Email: gcheng@npac.syr.edu (Extended Abstract) In this paper, we describe our initial work of developing a real world electromagnetic application on distributed computing systems. A computational electromagnetics (CEM) application for radar cross-section(RCS) modeling of full scale airborne is ported to three networked workstation cluster systems: IBM SP-1 with Ethernet connection, a FDDI-based GIGswitch-connected DEC Alpha farm and an ATM-connected SUN IPXs testbed. We use ScaLAPACK LU solver from ORNL/UTK in the parallel implementation for solving the dense matrix which forms the kernel computationally intensive part of this application, and BLACS (PVM version) as the message passing interface in the code development to achieve high portability across the three distributed configurations. Performance data are reported, as well as for comparison purpose the timing data from other MPP systems on which this application is initially implemented, including Intel iPSC/860 and CM5. Most electromagnetic applications on high performance computing systems so far reported have been on MPP systems. Traditional electromagnetic engineering simulations are limited in most cases by memory requirements, as well as by sequential processing time. The advanced development in workstation cluster technology, represented by IBM's SP-1 system and Digital's Alpha farm, provides a high performance node processor and sufficient memory capacity for this type of application. Together with the emergence of portable distributed linear algebra packages such as BLACS and ScaLAPACK and portable message passing interface such as PVM and MPI, cluster-based computational electromagnetics system gives an attractive solution to numerical simulations in engineering electromagnetic analysis and design with modest problem sizes. The workstation cluster solution achieves better cost/performance than the MPP solutions and this is especially important to real world applications where workstation clusters are more accessible to engineers than the MPP systems. The CEM application used in this work is a well-established CEM package from Syracuse Research Corporation(SRC), named PARAMOM which stands for Parametric Patch Method of Moments (MoM). The most distinctive feature of SRC PARAMOM is its use of basis functions that conform to parametric surfaces with curvature and thus generate accurate and stable simulation results. The ability of PARAMOM to accurately account for surface curvature with the use of parametric surface patches has in many cases shown significant advantage over conventional techniques that employ flat facets as modeling elements. The PARAMOM code consists of four basic processing phases with different computational requirements in terms of CPU time and memory consumption, as shown in Table 1: Table 1: Processing Phases of Sequential PARAMOM and Their Computational Requirements --------------------------------------------------------------------- Phase Component CPU time Memory --------------------------------------------------------------------- 1 setup O(N) O(N) 2 matrix fill O(N*N) O(N*N) 3 RHS vector fill O(N) O(N) 4 matrix factor/solve O(N**3) O(N*N) 5 scattered field computation O(N) O(N) Where N is the number of unknowns (proportional to surface area of target). Because the computations for phases 2 - 5 are all related to the impedance matrix (Z-matrix), we use the ScaLAPACK SSB(Scattered Square Block) partition for matrix decomposition in parallel algorithms of phases 2 - 5. Our algorithm design is focused on the Z-matrix filling part, with the ScaLAPACK LU solver package used for the matrix factor and solve part. The major task is to partition the computation of Z-matrix elements amongst processors by first distributing the memory allocated to the Z-matrix amongst processors. Since we employ the ScaLAPACK LU solver which uses a SSB partition scheme, in order to avoid the communication overhead of redistributing the Z-matrix among processors, we choose to use the same partition scheme in the filling part as in the LU part. Furthermore, in our current implementation, we keep the identical data partitioning scheme and (BLACS) virtual machine configuration in all phases 2 - 5. We also perform necessary parallelizations in other computational components to optimize the overall performance of this application. Our complete parallel implementation includes: 1) parallel input for target geometry data and a parallel algorithm for precomputation. Only global broadcast operations are used in the setup phase. 2) an embarrassingly parallel algorithm for Z-matrix filling which requires some redundant calculation of patch-patch interactions but no inter-node communication. 3) a similar embarrassingly parallel algorithm for RHS excitation vectors filling which has no inter-node communication. 4) parallel ScaLAPACK LU solver (from ORNL/UTK), including Z-matrix factoring and solving. The ScaLAPACK LU package is a distributed-memory version of the LAPACK LU solver with block-oriented algorithm and scalable data partition, built on BLAS, BLACS, and PBBLAS. Ours is a COMPLEX implementation. 5) parallel algorithm for the far-field calculation. Only global summation operations are required. With the BLACS PVM version of the ScaLAPACK package, we were able to port the parallel implementation of the PARAMOM code to three platforms with different configurations: 1) a 8-node IBM SP-1 with Ethernet connection; 2) a 8-node Alpha Farm with FDDI network, and 3) a 2-node SUN IPXs connected by an ATM link. In conclusion, we have achieved the following: 1) Parallel algorithms (for matrix fill and solve) were developed for this application. High performance and good speedup were observed in the parallel implementation. High data-locality and load balance were achieved for the parallel matrix fill and the performance scales well with the problem size and also with the machine size. (Performance Scalability) 2) Highly portable code development across three platforms with diverse configurations and hardware capabilities was achieved by using the BLACS/ScaLAPACK on the distributed systems, which is implemented on top of the PVM message-passing programming interface. This code can also run on a heterogeneous computing environment. The three systems represent the most state-of-art architectures in today's workstation clusters technology. (Program Portability) 3) Real-world test cases (i.e. Electromagnetic Code Consortium benchmarking data) were conducted to test the parallel implementation on the three systems which, together with the unique features used in PARAMOM, proved that distributed computing technology is a viable approach to solve computational electromagnetics engineering problems of modest sizes in industry. (Applicability) 4) Performance comparisons of the implementations on the three platforms were conducted and the related system issues were analysed. This application can be a good candidate for benchmarking purposes, based on the following three characteristics of this application: a) Two distinct computationally intensive components, i.e., matrix fill, matrix LU solve, with different computational requirements in processor performance(fill), cache memory access(fill and solve) and inter-processor communication (solve only). b) The in-core memory used to store a dense matrix must be large enough to run the application on a coarse-grained cluster configuration. Parallel I/O must be used to achieve balanced set-up and matrix filling/solving performance. c) This RCS problem represents a typical (classic) problem in computational electromagnetics which uses boundary integral methods and involves heavy computing and solving a large dense complex matrix. (Benchmarking) --------------------------------------------------------------------------------- Format preferred: 20-minute presentation Conference themes: Distributed Computing or Scalable Parallel Algorithms and Implementations in Electromagnetics, or Parallel Matrix Computation