TO: Dr. Cha, Chung-Chi and Jay Mortensen FROM: Geoffrey Fox, Gang Cheng RE: CEM Proposal(NPAC's input) CC: Northeast Parallel Architectures Center Applying high performance computing technologies to real world problems is the central mission of the Northeast Parallel Architectures Center(NPAC) at Syracuse University. NPAC, in collaboration with national research partners, engages interdisciplinary research, development, education and technology transfer in high performance computing. Core enabling technologies such as parallel algorithms, software systems, compilers, and databases are developed internally and integrated with related national research efforts, then applied in federally and privately funded research application projects by NPAC researchers. The central mission of NPAC's ACTION program is to accelerate the use of high performance parallel computing and communications in industry. The outreach activities of the program result in close ties with industrial firms such as NYNEX/New York Telephone, General Electric, IBM, UTC/Carrier, and SIAC. NPAC scientists have expertise and experience in large industrial application projects. NPAC is a member of a prestigious Center for Research in Parallel Computation(CRPC), one of the 11 original NSF centers in parallel processing. It is a research consortium of six participating institutions: Argonne National Laboratory, the California Institute of Technology, Los Alamos National Laboratory, Rice University, Syracuse University, and the University of Tennessee. CRPC is committed to a program of research that will make parallel computer systems truly usable - as usable as sequential computer systems are today. CRPC research is both interdisciplinary and interinstitutional. The strength of the research is in the combined efforts in the development of new software tools, new parallel algorithms, prototype implementations of scientific application programs, and shared resources among the six sites. Through our close collaboration with CRPC, NPAC can exploit the newest parallel algorithms for scientific computing and access to the most advanced parallel computing systems today. The NPAC computational scientists will work together with SRC as a team to evaluate current CEM models, algorithms and numerical methods. The NPAC contribution to this project will be to 1) design and implement state-of-the-science algorithms and parallel software in the project models, and 2) to provide parallel systems needed to support this project which include both in-house machines for intensive program development and remote-site machines for production run and evaluations. 1. The Application Problem and Computational Requirements In the moment method which was first studied by R.F. Harrington[12], the electromagnetic component is modeled by a three-dimensional surface integral equation which formulates the electric current induced on the component's 3D surface. To solve the integral equation, a set of dense matrix equations are computed and its solution yields the radar cross section(RCS). More specifically for our existing f77 code, the physical problem solved can be expressed as follows: RCS = g(Es(r),Einc(r)) ........... (1) where g is a quadratic function of Es and Einc, and Einc is the known incident wave. Es(r) is the scattering electric field at point r, computed by Es = L(J) ........... (2) where L is a linear operator, and J is the induced electric current to be determined on the surface of the scattering component. The method of moment divides the surface into a collection of small curved triangle patches and uses the following approximation for J: ....... (3) where fi is the ith known expansion function and Ii is a unknown coefficient. Based on a boundary condition and a set of chosen testing functions Ti (i=1,..., N), the unknown coefficient vector I is computed by solving the matrix equation: [Z][I] = [V] ......... (4) where V is an excitation vector and [Z] is a NxN matrix called impedance matrix. Their elements are determined by: ......... (4) ......... (5) A major contribution of moment method is its capability to model the generally unconfined nature (open region) of electromagnetic wave and to give high accurate and stable solution, which has much advantages over other classical analytic methods such as finite element and finite difference. However, because of the nature of dense matrix with its size directly dependent on electrical size of the component, the storage and execution time requirements are proportional to the matrix size N, and usually become the limiting factors in engineering applications. For our discussion of computational requirement and later the proposed approaches, we breakdown the existing code by time and storage complexity into three major steps(or components): 1. Matrix Fill At this stage, the surface of the scattering component is discretized into small curved triangle patches and elements of the impedance matrix Z and excitation vector V are computed (by the equations (4) and (5) above). Execution Time: O(N2) flops. More specifically for our fortran77 code, about 4000N2 flops Storages: O(N2) bytes. 2. Matrix Equation Solve A linear solve is used to solve the matrix equation (4) above. Due to LU decomposition used in the linear solver, it can be further breakdown into two stages: 1) LU factorization Instead of solve (4) directly, Z is first factorized into two triangular matrixes [Z] = [L][U]. Execution Time: O(N3) flops. For the existing fortran77 implementation, it is about 3N3 flops. Storages: O(N2) bytes. 2) Matrix Solve Forwardward and back-substitution are used respectively to obtain the final solution of the equation. Execution Time: O(N2) flops. Storages: O(N2) bytes. 3. Field Calculation Final solution is obtained by finishing the computations of the above equations (3)(2)(1). Execution Time: O(N) flops. Storages: O(N) bytes. Notice that for a filled matrix Z, the step 2 & 3 may be repeated for a number of times, corresponding to different (independent) excitation vector I. This, of course, increases the total execution time of the system. Total Computational Requirements: For Phase 1 benchmark cases with matrix size N=5,000 - N=6,000: Total Execution Time: 200 Gflops - 400 Gflops, Matrix Fill(50%), LU Factorization(50%). Total Storages(single precision): 250 Mbytes - 350 Mbytes For Phase 2 benchmark cases with matrix size N=50,000 - N=60,000: Total Execution Time: 385 Teraflops - 663 Teraflops, Matrix Fill(3%), LU Factorization(97%) Total Storages(single precision):25 Gbytes -35 Gbytes 2. Discussion of proposed technical approaches 2. 1. Summary of the approaches Based on evaluation of the nature of this application problem, existing sequential code, its computational requirements and objectives of the proposed study, we plan to develop explicit message passing parallel implementation of the fortran77 model on MIMD (Multiple Instruction and Multiple Data) massively parallel systems. We choose to implement our CEM application on Connection Machine CM5, Intel MPP Machines(Intel iPSC/860, Intel Touchstone Delta and Intel Paragon XP/S), and IBM SP-1(optional). Those machines are typical of the most advanced technologies in massively parallel computing industry. We will evaluate and demonstrate the MPP implementations of the advanced CEM techniques on the newest MPP architectures today. At the most general level, we will evaluate the issues when implement the parallel model on those MPP systems. More specifically, we must address scalable data decomposition, stable and efficient parallel linear solver for large dense matrix, effective use of memory and optimized computation/communication in the system. We plan to use both in-core and out-of-core parallel linear solvers in our implementation. We choose to adopt in our implementation the parallel linear solver from the work of ScaLAPACK[1] developed at Oak Ridge National Laboratory by D. W. Walker's research group. Using this scalable and portable linear algebra package(LAPACK) for general distributed memory parallel systems, we can port our implementations on CM5, Intel Machines and IBM SP-1 with minimum effort. We also plan to use vendor-supplied linear algebra packages, i.e. CMSSL on CM5, ProSolver-DES on Intel MPP systems, to compare and achieve the most efficient dense matrix solver. To solve the storage limitation problem of large matrix size in Phase 2, we will design an out-of-core scheme for our Phase 2 algorithm design and implementation, which will make effective use of parallel file system and memory hierarchy of the target MPP system. In Phase 1, we will first convert the fortran77 code to CM5 in f77+CMMD(the CM5 message passing library) with matrix equation solver using ScaLAPACK. This then will be used as a working message passing implementation, based on which the next step is, simultaneously, 1) to implement a CMfortran + CMMD version in which the matrix equation solver using the Gaussian elimination routines provided by CMSSL; 2) to port it on Intel iPSC/860 in Fortran77 + NX(the Intel message passing library). Besides the ScaLAPACK version in the Fortran77+NX, we will also implement a Fortran77 + NX version using the ProSolver-DES for the dense matrix solver on Intel machines. As an option, we also plan to port the message passing implementation with ScaLAPACK on a IBM SP-1 system. After the implementation work, we will conduct benchmark experiments on those working machines with different sizes(32-node, 1024-node CM5s, 16-node, 32-node Intel iPSC/860s, 512-node Intel Touchstone Delta, Intel Paragon XP/S, and optional 8-node IBM SP-1), to evaluate scalabilty and performance results. In Phase 2, we will choose the best target machine, based on the evaluation of our Phase 1's work in terms of model accuracy and stability, computational efficiency and projected scalability and performance for large full-scale modeling implementation. We plan to design out-of-core block algorithms using the parallel file system (e.g. DataVault on CM5, CFS on Intel iPSC/860 and Touchstone Delta,or PFS on Intel Paragon), as well as memory hierarchy of the target machine. In our implementation, a vender-supplied out-of-core linear solver (e.g., the out-of-core LU solver to be support in CMSSL 3.1, or the out-of-core ProSolver-DES on Intel MPP systems) will be adopted to achieve high performance. 2.2 Why message passing ? Message passing allows programmer to have maximum flexibility and control in exploiting domain decomposition and control decomposition of the application problem so that maximum efficiency utilizing the high performance MPP hardware can be achieved by carefully design of explicit message passing parallel algorithms. One of the most useful features of message passing programs is that they allow the processing nodes to synchronize as frequently or infrequently as required for a given application. A message-passing program is free to arbitrarily distribute program tasks and data among the nodes as necessary, using synchronization and data communication between nodes when it is necessary. This arbitrary load-balancing of tasks and data makes message-passing particularly useful for the CEM applications that require dynamic allocation of tasks and data. In our CEM application, according to performance data reported in a previous similiar work[17], the matrix fill part needs to be carefully designed to make efficient use of the MPP's node perofrmance, while the matrix solver part can be optimized using vendor-supplied assembly coded routines. Moreover, driven by the major consideration of scalability and large dense matrix solver in algorithm design, MIMD implementation in explicit message passing paradigm appears to be the natural approach for this application. Further more, we believe MIMD architectures represent the promising future MPP technologies to be adapted in industry applications of high performance computing and communications(HPCC). 2.3 Algorithm Design Issues In Phase 1, based on the techniques employed in ScaLAPACK, we will first concentrate on developing a block domain decomposition scheme which should make the message passing algorithms for matrix fill, matrix equation solver and field calculation both scalable(variable problem size and machine size) and efficient. Block algorithms, which update a block of contiguous vectors instead of one data element at a time, are the major techniques used in ScaLAPACK for LU factorization and matrix-matrix multiplication, and will be adapted in the algorithms for matrix fill and field calculation to have a scalabe and optimized implementation. Programming in explicit message passing library, we will implement the block algorithms in an pipelined way such that calculations of matrix blocks local to a processing node is pipelined(overlapped) with data communications between nodes, and also pipelined are computations/communications among different stages in the block algorithms. Through an optimized global algorithm design and implementation, we expect to achieve the most efficient use of working hardware, including the scalable architecture supporting high floating-point vector processing, high-bandwidth internal networks, and later through a out-of-core scheme, balanced parallel I/O, processing and memory bandwidth. We will evaluate the optimized block size of the LU factorization for the overall performance, given different benchmark cases. We expect that the effort of the migration of Fortran77+CMMD to other MPP systems would be minimum, as the CMMD supports most commonly used message passing paradigms, including MIMD/SPMD distributed memory message passing, global and node-to-node communication, block, nonblock, and active message passing, host-node and hostless, multiple parallel I/O mode, interrupts and polling, etc., and most of the porting work will be confined in syntax while the maximum performance structure in the parallel algorithms can still be maintained. In Phase 2, using the required full-scale large model cases as testing demonstrations, those design and implementation issues in Phase 1 will be revisited and further evaluated. An out-of-core block scheme and a vender supplied out-of-core linear solver will be incorporated into the block algorithm design and implementation. To effectively solve the limiting storage problem of large matrix case, this scheme makes use of target machine's parallel I/O mass storage system as a 'Virtual Memory' in the memory hierarchy so that the disk-based parallel file system with extremely high storage capacity can be utilized in the algorithm design and implementation, and accessed through direct-connected parallel I/O channels with a sustained high rate. We expect this incorporation would be seamless and efficient, due to the block algorithms and pipelined techniques adopted in our Phase 1 approach. In Phase 2, we will also investigate more promising analytical techniques and/or software developed under the separate efforts, through our close relationship with ARPAR, NASA Goddard, NSF, CRPC, NASA NRA, and other industry partners. 2.4 Description of tools to be used in this project 1). ScaLAPACK ScaLAPACK is a distributed memory version of the LAPACK[5][6] software package for dense and banded matrix computations[1][2][3]. Developed at Oak Ridge National Laboratory / University of Tennessee by a research group headed by Jack Dongarra and David W. Walker, this package is designed to be scalable, portable, flexible and ease of use on MIMD parallel computers. Using distributed versions of the Level 3 BLAS[4] as building blocks, a square block scattered decomposition scheme and an object-based interface to the ScaLAPACK routines,ScaLAPACK takes advantage of surface-to-volume effects and allow effective reuse of data in local memory. With other techniques such as a two-dimensional data mapping, increasing the granularity of the algorithms, overlapping communication and calculation, and maximizing the size of submatrices multiplied in each processor, this approach can lead to optimal algorithms of linear solvers on distributed-memory machines. Portability is achieved through a collection of routines that are portable across a wide range of architectures without sacrificing performance. 2). CMMD 3.0 and NX CMMD is a library of message-passing routines for the CM5. Programs that use CMMD typically operate in a message-passing style such that each of the processing nodes of the CM5 runs an independent copy of a single program and manages its own computations and data layout, and communication between nodes is handled by calls to CMMD message passing functions. CMMD supports a variety of message passing styles, including point-point message, cooperative message passing, asynchronous message passing, virtual communication channels, global operations, interrupts and polling, and active message(CMAML). In addition, CMMD programs written in data parallel languages such as CMFortran use both the microprocessor and the vector units. The parallel language provides the concept of a "node-level parallel" array, in which a parallel array is stored on a single node, with its values spread across the memory of the node's four vector units, thus can take advantages of the VUs for high-speed float-point computations. NX is the message passing library used for communication among processes in an Intel MPP system. As independent processor/memory pairs, the nodes do not share physical memory. Even node processes running on the same node maintain distinct memory spaces. If the node processes need to communicate, they do so via NX message passing calls. NX supports common message passing paradigms, including synchronous/asynchronous message passing, interruption message passing and global operations. 3). CMSSL 3.1 and ProSolver/DES The Connection Machine Scientific Software Library(CMSSL) is a set of numerical routines that support computational applications while exploiting the massive parallelism of the CM system. CMSSL provides data parallel implementations of familiar numerical routines such as linear algebra, providing new solutions to problems of both performance and algorithm choice and design. Many of the routines have been implemented to allow parallel computation on either multiple independent objects or a single large object. For dense matrices, CMSSL offers two methods of solving a linear system represented as a real or complex general matrix. The first method uses the Gauss-Jordan linear system solver, while the second method uses the QR factorization operation in combination with the QR solver. A out-of-core implementation of LU and QR solvers will be released in CMSSL 3.1 in June, 1993 by Thinking Machines Co.. ProSolver-DES is a dense solver for the Intel iPSC/860 and Paragon-XP/S, with both in-core and out-of-core implementations. It is a library for solving large, dense and/or disk resident, systems of linear equations, AX=B. ProSolver-DES implements a variant of block Gaussian elimination to compute the block LU factorization of A and then uses block forward and back substitution to solve for the unknown matrix X. ProSolver-DES was carefully optimized for high performance on iPSC/860. In the out-of-core implementation, asynchronous I/O is used to overlap most of the I/O time with computation and the maximum problem size which can be solved by ProSolver-DES depends on the size of the file system. 3. NPAC's Capabilities and Relevant Experience 3.1 MPP machines . NPAC's in-house 32-node CM5, 16-node Intel iPSC/860, 8-node IBM SP-1 for development in Phase 1 & 2 . CRPC facilities: . CalTech 512-node Intel Touchstone Delta . Los Alamos National Laboratory 1024-node CM5 or Pittsburgh Supercomputing Center 1024-node CM5 or Minnesota Supercomputer Center 1024-node CM5 . CalTech or Oak Ridge National Laboratory Intel Paragon XP/S (through CRPC's participation in Concurrent Supercomputing Consortium) 3.2 NPAC's relevant experience (to be elaborated ... ) . NPAC's Action Program, industry and application projects . in-house 32-node CM5 for phase 1, 16-node Intel iPSC/860 for Phase 2. . Block LU project, EMS projects funded by DARPA, AVS+EMS work(a heterogeneous approach) . NPAC's relationship with CRPC (1024-node CM5, relationship with Jack Dongarra's group at ORNL) . NPAC's collaboration with Dr. Harrington's group 4. References (more to be added ... ) [1] J. Choi, J.J. Dongarra, R. Pozo and D. W. Walker, "ScaLAPACK: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers," Proceedings of 4th Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia, October 19-21, 1992), IEEE Computer Society Press, Los Alamitos, California, 1992, pp. 120-127. [2] J. Choi, J.J. Dongarra, and D. W. Walker,"The Design of Scalable Software Libraries for Distributed Memory Concurrent Computers," CNRS-NSF Workshop on Environments and Tools for Parallel Scientific Computing (Saint Hilaire du Touvet, France, Sept. 7-8, 1992), Elsevier Science Publishers, 1992. [3] J. Demmel, J.J. Dongarra, R. van de Geijn, and D. W. Walker, "LAPACK for Distributed Memory Architectures: The Next Generation," Proceeding of 6th SIAM Conference on Parallel Processing for Scientific Computing(Norfolk, Virginia, March 22-24, 1993) (R. F. Sincovec, eds.), SIAM, Philadephia, 1993. [4] J. Choi, J.J. Dongarra, and D. W. Walker, "Level 3 BLAS for Distributed Memory Concurrent Computers," CNRS-NSF Workshop on Environments and Tools for Parallel Scientific Computing (Saint Hilaire du Touvet, France, Sept. 7-8, 1992), Elsevier Science Publishers, 1992. [5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, "Lapack: A Portable Linear Algebra Library for High Performance Computers," Proceeding of Supercomputing `90, pp. 1-10, IEEE Press, 1990. [6] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users' Guide, SIAM Press, Philadephia, 1991. [7] G. C. Fox, S. W. Otto, Matrix Algorithms on a Hypercube I: Matrix Multiplication, Parallel Computing, Vol. 4, pp. 17-31, 1987. [8] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors, Volume 1, Prentice Hall, Englewood Cliffs, N.J., 1988. [9] G. Laszewski, M. Parashar, A. G. Mohamed, G. Fox, "On the Parallelization of Blocked LU Factorization Algorithms on Distributed Memory Architectures," Proceedings of Supercomputing `92, IEEE Press, 1992, pp. 170 - 179. [10] G. C. Fox, "The Architecture of Problems and Portable Parallel Software Systems," SCCS technical Report 134, also CRPC - TR91172, 1991. [11] G. Fox, "Parallel Computing in Industry: An Initial Survey", in Proc. of Fifth Australian Supercomputing Conference, World Congress Centre, Melbourne, Australia, December, 1992. [12] R. F. Harrington, Field Computation by Moment Methods, the Macmillan Co., New York (1968). Reprinted by Krieger Publishing Co., Malabar, FL (1982). [13] G. Cheng, K. Mills and G. Fox, "An Interactive Visualization Environment for Financial Modeling on Heterogeneous Computing Systems," Proceeding of 6th SIAM Conference on Parallel Processing for Scientific Computing(Norfolk, Virginia, March 22-24, 1993) (R. F. Sincovec, eds.), SIAM, Philadephia, 1993. [14] Y. Lu, A. G. Mohamed, G. Fox and R. F. Harrington, "Implementation of Electromagnetic Scattering from Conductors Containing Loaded Slots on the Connection Machine CM-2," Proceeding of 6th SIAM Conference on Parallel Processing for Scientific Computing(Norfolk, Virginia, March 22-24, 1993) (R. F. Sincovec, eds.), SIAM, Philadephia, 1993. [15] G. A. Mohmad, G. Fox, G. Laszewski, M. Parashar, T. Haupt, K. Mills, Y. Lu, N. Lin, and N. Yeh, "Applications Benchmarking Set for Fortran-D and High Performance Fortran," Technical Report, Syracuse Center for Computational Science 327, June, 1992, also CRPC-TR92260. [16] G. Cheng, Y. Lu, G.C. Fox, K. Mills and T. Haupt, "An Interactive Remote Visualization Environment for an Electromagnetic Scattering Simulation on a High Performance Computing System," submitted to Supercomputing `93, Portland, OR, Nov. 14-17, 1993.(in review). [17] T. Cwik, J. Patterson, D. Scott, "Electromagnetic Scattering Calculations on the Intel Touchstone Delta," Proceedings of Supercomputing `92, IEEE Press, 1992, pp. 538-542. [18] D. Scott, "Out of Core Dense Solvers on Intel Parallel Supercomputers," Proceedings of 4th Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia, October 19-21, 1992), IEEE Computer Society Press, Los Alamitos, California, 1992, pp 484-487. [19] K. Mills, M. Vinson, and G. Cheng, ``A Large Scale Comparison of Option Pricing Models with Historical Market Data,'' Proceedings of 4th Symposium on the Frontiers of Massively Parallel Computation (McLean, Virginia, October 19-21, 1992), IEEE Computer Society Press, Los Alamitos, California, 1992. 5. Resumes of G. Fox and G. Cheng (to be sent)