Hi, Geoffrey:

After the meeting with Dr. Cha and Jay of SRC this Tuesday, I,
together Xianneng Shen(Harlington's PhD Student now working at NPAC),
went to their office the next morning and further discussed 
with them issues we need to clarify, especially try to have 
more understandings about their application problem. Based on my reading 
through related materials and understanding of this project, I compiled 
the following info. and some of my thoughts on the technical approaches 
which need to further discuss with you. It's not elaborated and just 
reflects my personel view which may be biased, due to the short time 
invlovement of this proposal preparation. 

1. Goal of the proposed study (excerpted from the announcement sheet):

The objectives of this project is to conduct an evaluation of existing
CEM technology concepts on massively parallel processing(MPP) 
machines and determine which analytic concepts have potential for 
further development to model and predicate radar cross section(RCS) of 
full scale, material treated aerospace vehicles. and demonstrate the 
potential of advanced CEM(Computational Electromagnetics) techniques 
when couples to massively parallel computing systems.

Key objectives:
. migrate ARPA sponsored parallel computing technology to CEM techniques.
. develop and demonstrate a major increase in CEM capability
. spearhead the development of new computational methods, and
physical modeling techniques that incorporate high levels of parallelism.

Phrase 1 (6 months):

A key element of Phrase 1 is to demonstrate the effective use of MPP
computing systems and the retionale for scaling the approach to a full
scale fighter aircraft. Criteria to evaluate the Phrase 1's work (listed
in descending order of importance), based on evaluations on the EMCC
benchmark cases :
. Accurancy
. scalability of the code (risk associated with the potential to model
and predict the RCS of a scaled (1/30) materials treated fighter and a full
scale PEC model) 
. computaional speed
. technical documentation

Phrase 2 (12 months):

Continue to develop and refine/modify Phrase 1 software, and as necessary,
shall develop new algorithm based on physical modeling concepts, and/or
integrate any promising analytical techniques and/or software developed
under the seperate NASA NRA effort. In Phrase 2, two government specified
large-scale demonstrations must be conducted. One predicting the RCS of a
full scale perfectly conducting fighter model; the other predicting a 
material-clad suoscale fighter model.

2. My major observations related to this proposal

Reading through the announcement and discussing with SRC, I got an 
impression of this proposed project with the following major concerns:

1) Feasibility and practicability issues vs. research 
and methodology issues

Among the practice and research issues involved in this project in both 
physical modeling techniques and parallel algorithm design, this project 
puts much more emphasies on effective and practical capability and evaluation 
of the parallelized existing code on a existing massively parallel system 
to solve real large scale problems.

2) Scalability vs. portability

The announcement indicates little attention on portability of the 
software development and corresponding architecture. Morever, it 
seems they are trying to, through this project, evaluate which specific 
(existing or future) architectures are best suited to solve large real 
CEM industry applications. However, scalability issue appears throughout
the announment. One of the key evaluation criteria for the success of 
Phrase 1 is if the code both performs well in solving benchmark 
cases(small scale with matrix size between 3,000x3,000 - 5,000x5,000) 
and scales well to solve full-target real case(large scale
with matrix size around 50,000x50,000 - 60,000x60,000). 

3) advanced CEM modeling techniques vs. efficient parallel 
algorithms design

Accurancy (and stability) of the modeling in parallel implementation 
is the first concern of the project, which in large is related to the 
CEM modeling techniques. Partial reason is that the traditional 
computational limitations of most CEM models are first in memory capacity 
and second computational speed due to large matrix size. The parallel 
algorithm design of CEM in large relies on linear algbra and data 
decomposition methodology. Here they are more interested in 
investigating effective parallel implementation for various advanced 
CEM techniques with high accurancy and stability.

3. Technical Approaches*

We have discussed two alternative approaches for this project:

1) Target Machines: CM5, Intel machiines, IBM SP-1. 
   Message passing library: PVM
   Advantages: Highly portability across the three platforms, 
               and even to hetergeneous computing environment 
               (networked workstations and MPP machines) which is a 
               portential way to solve the memory limitation 
               problem for large application case (I'll discuss 
               this later). Good to compare computational and 
               parallel implementation issues of CEM on different
               architecture MPP systems and find out which one 
               is best suited for this application and which are not. 
       
   Disadvantages: The only MPP system to which PVM so far is ported 
               is Intel iPSC/860, despite CM5 and IBM SP-1 claimed 
               to support PVM on their machines. I have contacted
               the PVM developers about this issue and was told it 
               is unlikely that the themselves are to start porting 
               PVM on CM5 and IBM SP-1 in short time. Due to the 
               nature of public domain software, there is no system 
               technical support for PVM and (as I know and from 
               other PVM users) PVM is still not in truly stable
               period, compared with those commerial products. 

2) Target Machines: CM5(Phrase 1 & 2 working machine), Intel machiines
                    (Only Phrase 2), IBM SP-1(Only Phrase 2). 
                    
   Message passing library: CMMD (Phrase 1 & 2), PVM(only Phrase 2).
  
   Another alternative to start the Phrase 1 work is to use CM5 and 
CMfortran+CMMD to implement the existing Fortrann77 code, based on 
the above observations and the facts (or advantages) below:
   . the time frame for Phrase 1 is relatively tight 
   . most of the effort in Phrase 1 is parallel algorithms design, 
     coding and debugging which require intensive use of a working 
     machine(an in-house machine is best to manage and speed-up 
     the project progress in such a tight time frame and an 
     application project) 
   . scalability is more important than portability in the Phrase 1, 
     while the latter can be investigated in the Phrase 2 when 
     hopefully PVM will be available on all the 3 platforms.
   . It is really easy to port programs written in one message 
     passing libray (eg. CMMD) to programs in another library
     (eg. Intel's NX library) and CMMD supports most commonly used 
     message passing paradigms(block, nonblock, and active 
     message passing, host-node and hostless, multiple parallel I/O mode, 
     interrupts and polling, etc.). 

Disdvantages: 
   . portability becomes a non-major concern in the Phrase 1
   . several non-portable implementation techniques specific to CM5 
     architecture (will be discussed in technical discussion section 
     later) will be either difficult or ineffient to port to other 
     platforms, though without loss of generality in terms of methodology.

The reason choosing CM5 rather than iPSC/860 is explained below:

CM5's advantages: 
  . NPAC's in-house CM5 with flexible allocation time and good system 
    support (even good expertised users at NPAC) which will greatly 
    convient the code development which is critical in Phrase 1.
  . The modest system configration(machine and memory size, data vault, 
    vector units, etc.) good enough to investigate computational issues 
    for Phrase 1, including scalability, piplelined block algorithm design, 
    modest matrix equation solver, modest benchmark production runs, etc.
  . So far there has been reported experiment on CM5 to use the techniques 
    we plan to to use for the large scale CEM application, while several 
    works have been investgated on the Intel Touchstone architecture for 
    the similar application. Our technical approach is "new" on CM5 
    architecture.
  . The memory hierarchy on CM5 is very suitable to apply our approach, 
    including node-level data parallel for matrix manipulation and 
    global-level parallel message passing, pipe-lined LU block factorization, 
    a off-shell block Gaussian elimination technique and an overall optimized 
    pipelined computation and communication. So is its high-bandwidth
    balanced internal networks which are important to faciliate a 
    efficient pipelined scheme.
  . The high scalability of CM5's universal architecture and technology.

CM5's disadvantages(or iPSC's advantages): 

  . It is not easy, in the message passing program, to achieve the advantages 
    of high performance hardware (eg. Vector Units processing) on CM5. 
    The CMMD program must be carefully programmed to obtain high efficiency.
  . Unlike on Intel iPSC/860, Blas, whose routines will be intensively used 
    in the parallel implementation, is not assembly coded.
  . A majority of notable works in parallel CEM have been on Intel machines. 
    A key previous work we'll apply to our CEM, the block LU decomosition 
    algorithm design of NPAC's benchmark project, was implemented on iPSC/860, 
    though it is optimistic to port to CM5 with high efficiency. The message 
    passing implementation of our EM application(Lu's work) was also done 
    on iPSC/860. 

The reason to use CMfortran+CMMD:
    So far the only way on CM5 in message passing program to make use of 
the vector-units is programming in CMfortran+CMMD. This is good for our 
application becasue most of the computations local to nodes are matrix 
manipulations, including block matrix filling, block LU factorization 
and block Gaussian elimination. It is also expected to be good for our 
pipelined algorithms from both local level and optimized global level.

Sumary of technical approaches proposed:

The application can be abstracted as the following major components: 
(% shows the projected percentage of computational time in sequential code)
. Fill the impedance matrix (40%)
. a linear solve for the matrix equation (50%)
  . LU factorization (40%)
  . Matrix solve (forward elimination and backward substitution) (10%)
. calculate field (10%)

Phrase 1:

. CM5, CMfortran+CMMD (NPAC's in-house CM5 system)
. Scalable data decomposition scheme, scaled and optimized by the scalable 
  problem size  and machine size. The scheme will be determined by a 
  block matrix filling algorithm.
. Block algorithms, including block matrix filling, block LU factorization 
  and block matrix solve.
. make efficient use of the CM5's memory hierarchy (node registers, 
  vector-unit memory or/and Data Vault).
. Pipelined algorithms for different stages block Gaussian Elemination 
  and possibly include the matrix filling.
. A off-shell block matrix filling and block Gaussian Elemination scheme 
  (a kind of virtual memory scheme) ultilizing CM5's parallel mass storage 
  system Data Vault for large-scale problem whose memory requirements exceed 
  the physical memory limitation. This scheme will only be considered in 
  domain decomposition and algorithm design of Phrase 1 as required by the 
  scalability concern. The real implementation will be done in Phrase 2.

Phrase 2:
. Port the software of Phrase 1 to other two platforms, namely iPSC/860 and 
  IBM SP-1, using either PVM(if the library is available and stable on 
  those machines at that time) or their primitive message passing libraries. 
. evaluate three more schemes to solve the memory limitation problem:
  . the off-shell virtual memory scheme mentioned above on modest CM5 system
  . scale the software to a 1024 CM5 system
  . a hetergeneous processing scheme distributing the memory requirements 
    to networked MPP systems and/or workstations (in PVM).
. investigate more promising analytical techniques and/or software 
  developed under the seperate NASA NRA effort in our parallel implementation.
  
*: Detailed approach description is being prepared and will be sent under seperate
mail. Please give your opinion of this planned approach as it is somewhat different
from the original discussion. 

P.S.: a hardcopy of this email is given to Lisa for your convenience of reading.

--- Gang