Given by Geoffrey C. Fox at Mardi Gras Conference on Concurrent Computing in Physical Sciences on February 18, 1993. Foils prepared October 22 1997
Outside Index
Summary of Material
This overviews status of HPCC architectures |
Software approaches focussing on HPF and an Application Analysis |
Problem Architectures Load Balancing and the Complex System Approach |
Outside Index Summary of Material
Geoffrey C. Fox |
gcf@nova.npac.syr.edu |
315.443.2163 |
Mardi Gras Conference |
Concurrent Computing in the Physical Sciences |
Louisiana State University |
Baton Rouge, LA |
February 18, 1993 |
This is a phase transition |
Needs headroom (Carver Mead) which is large (factor of 10 ?) due to large new software investment |
nCUBE-1 and CM-2 were comparable in cost performance to conventional supercomputers
|
Intel Paragon, CM-5, DECmpp (MP-2), IBM SP-1 have enough headroom? |
Different machines |
New types of computers |
New libraries |
Rewritten Applications |
Totally new fields able to use computers .... ==> Need new educational initiatives Computational Science |
Will be a nucleus for the phase transition |
and accelerate use of parallel computers in the real world |
Data Parallelism - universal form of scaling parallelism |
Functional Parallelism - Important but typically modest speedup. - Critical in multidisciplinary applications. |
On any machine architecture
|
Simple, but general and extensible to many more nodes is domain decomposition |
All successful concurrent machines with
|
Problem is an algorithm applied to data set Obtain concurrency by acting on data concurrently. The three architectures considered here differ as follows:
|
Hardware trends imply that all computers (PC's ---> Supercomputers) will be parallel by the year 2000
|
Software is challenge and could prevent/delay hardware trend that suggests parallelism will be a mainline computer architecture
|
Partial Differential Equations |
Particle Dynamics and Multidisciplinary Integration |
Image Processing Some: |
Visualization |
Artificial Intelligence Not Much: |
Network Simulation |
Economic (and other complex system) modeling |
Scheduling |
Manufacturing |
Education |
Entertainment |
Information Processing
|
Industrial and Government Problems
|
We know parallel versions for basic algorithms currently being used in most kernels
|
The study of implementation of ACTION
|
Study of role of applications in CRPC
|
Study of relation of CRPC and NSF Supercomputer Centers which are again much larger
|
P C E T E C H includes and expands technology development components of CRPC and ACTION |
PVM, Express, Linda |
Einstein (CSW) |
ISIS (Cornell) |
High Performance Fortran Compiler |
High Performance C, C++ Compiler |
HPF Extensions - PARTI |
Parallel / Distributed Computing Runtime Tools |
ADIFOR |
AVS and Extensions |
MOVIE (Syracuse) |
High Performance Fortran Interpreter |
Televirtuality / Virtual Reality OS |
Parallel Database - Oracle 7.0 |
OO Database
|
Mesh Generation |
SCALAPACK |
Sparse Matrix Solvers - Templates and libraries (Direct and Iterative) |
Particle Dynamics Kernels - Templates and Libraries ( O(N2) to fast multipole) |
Optimization Methodology and Templates
|
Scheduling (neural-net, parallel) Templates |
We may have to rewrite our code, but we want the resulting program to run on "All" current and future (anticipated) machines |
No compelling new languages e.g. OCCAM does not solve enough problems to warrant adoption Adapt existing languages Fortran C++ ADA C LISP PROLOG BASIC ..........COBOL ? |
Can map onto individual sequential or parallel computers or networks of these (distributed computing) |
Experience of users |
Good sequential and node compilers exist |
Migration of existing codes |
Hard to build a new language as need all the features/libraries of Fortran / C implemented well. This is why one uses Fortran 90 and not APL |
Fortran and C are good (OK) for a class of applications |
Natural MIMD model of programming |
Used in vast majority of current successful MIMD programming |
Each node has program which controls and calculates its data (owner-computes "rule" ) Do i run over local data CALL READ (for non-local data) CALCULATE ALL i's DATA |
In simplest form one program per node of computer |
Advantages
|
Disadvantages
|
e.g. Program represented as sequence of functions operating on arrays |
Dimension: A(100,100), B(100,100), C(100,100) A = B + C
|
All data owned by a particular processor |
Owner computes rule: for an expression A(i,j) = .........., Processor holding A(i,j) calculates expression after communicating any needed data if r.h.s. contains data not owned by processor. |
Similar concepts in C, C++, LISP, ADA ....... |
Advantages
|
Disadvantages
|
Program in an SPMD data parallel style
|
Program is annotated with assertions giving information about desired data locality and/or distribution |
Compiler generates code implementing data and work distribution from the user-written SPMD application
|
String theories are candidates for a TOE (Theory of Everything). They provide a plausible quantum theory of gravity. |
Basic objects are tiny one dimensional "strings" (rather than points). Dynamics described by a two dimensional space-time world-sheet (rather than world line). |
Calculations involve integrating over all possible world-sheets, i.e. all 2-d surfaces embedded in some higher dimensional space (3-d, 4-d, 26-d, ....) |
Done numerically by a Monte Carlo integration over different triangulations of the surface, produced dynamically using the Metropolis algorithm. |
Current research centers on finding good discretizations of the continuum theories, and trying to find critical points with divergent correlation lengths, so the details of the discrete lattice structure become irrelevant and we can extract continuum results (c.f. lattice QCD) |
Interesting critical points are phase transitions between a smooth phase and a crumpled phase of the surface. The order and critical exponents of this crumpling transition are still uncertain. |
We have done the largest simulations so far on 2-d systems, but these are only 576 node meshes. Critical slowing down is a major problem - need better algorithms. |
The same techniques are used to model real fluctuating surfaces, e.g. cell membranes, lipid bilayers, phase boundaries. |
Program for random surfaces is very irregular, with many linked list and pointer trees to traversed. Original code performed very poorly. Completely rewrote and simplified code and gained a substantial speed-up. Still performed poorly on some RISC processors such as the Intel i860. |
Optimal use of cache memory is vital for processors such as the i860, which have a small cache memory, and the time to access external memory is much greater than the time to perform a floating point operation |
Due to the dynamic nature of the code, the neighbors of a site which are used for the update of that site are constantly changing. This means we cannot make good use of the cache unless we keep a new array which constantly readjusts pointers to neighbors, so that data for neighboring sites is kept in neighboring regions of memory. |
Other optimizations on i860 and use of non-IEEE arithmetic gave us a substantial improvement, but still nowhere near 80 Mflops peak performance! |
Program performs much better on more recent processors (IBM RS/6000, Hewlett-Packard). |
Independent Parallelism
|
Parallel Random Surfaces
|
FORTRAN 77 entirely contained within Fortran 90 |
Language goal is to "modernize Fortran, so that it may continue its long history as a scientific and engineering programming language" |
Major language additions include:
|
A significantly larger language than FORTRAN 77 |
Data Distribution Directives
|
Parallel Statements
|
Extended Intrinsic Functions and Standard Library
|
EXTRINSIC Procedures
|
Parallel I/O Statements
|
Changes in Sequence and Storage Association
|
In addition, a subset of HPF suitable for earlier implementation is defined |
Alpha version: demonstrated at SUPERCOMPUTING '92 in Minneapolis, MN
|
Now running on nCUBE2, iPSC/860 and network of SUN workstations |
beta release: June 1993 |
alliance with industry to develop commercial compiler for HPF, optimized for specific architectures |
next target: heterogeneous networks |
Syntactic form of a structured Fortran environment |
A program that contains directives is expected to generate the same results whether directives are processed or not, assuming that the directives are used correctly |
HPF directives are consistent with Fortran 90
|
Specifies lowest-level abstract alignment of array elements |
TEMPLATE is an abstract, very fine-grain virtual processor space Template labels physically independent objects e.g. gluons, fluid particles. Align degrees of freedom e.g. gluon components to template members |
Use for machine-independent alignment between structures |
Specifies coarse-grain partioning of data |
PROCESSORS is a coarse-grain processor grid, presumably in 1-1 correspondence with physical processors |
Use for machine-dependent partitioning of data structures |
Compiler will normally use "owner-computes rule" |
Elemental array assignment statement (Similar to array expressions) |
FORALL is not a general parallel construct! |
Functions without side effects may be called from FORALL |
Other functions may be allowed as well |
An assertion that the iterations of the DO may be executed in any order or concurrently |
If the assertion is false, the compiler is free to take any action it deems necessary |
No restrictions on the loop body |
No way to perform explicit synchronization |
Much of HPF's Parallel "Power" is buried in its library |
Personal note: Fortran 90D motivated by similarities between Caltech message passing "collective" communication and F90 Intrinsics |
18 Non-elemental (true array) Fortran 90 Intrinsics e.g. CSHIFT, SUM |
7 new HPF Intrinsics e.g. Alignment, Template, Distribution inquiries |
4 new HPF Reduction (combine) Functions |
11 new HPF Combine-Scatter Functions |
22 new HPF Parallel Prefix Functions |
2 new HPF Parallel Sorting Functions |
3 other new HPF Parallel Functions |
7 HPF Local Intrinsics |
74 library routines (==> ~1000 routines when written in clumsy F77) |
HPF makes much greater use of runtime optimizations than the traditional sequential or parallel compiler |
Fortran 77: Do 1 I=1,N 1 A(I) = FUNC(B(I), B(I +/- 1), C(I), C(I +/- 1), .....) Traditionally arrange access to these elements in COMPILER |
HPF: A=FUNC(B,C) Invoke ALREADY OPTIMIZED parallel implementation of FUNC In many cases, HPF compiler need not know anything about parallelization except interfaces, data layout etc. |
Runtime optimizations acceptable (and perhaps best) as "grain size" (i.e. total number elements in array divided by Nproc) is large ==> "Startup" of Runtime Procedure calls is small ==> Can use an interpreter rather efficiently cf. APL and *LISP on CM-2 |
There is only ONE data parallelism ==> Same Runtime library can support C, C++, ADA APL, HPF Interpreter |
Anonymous FTP minerva.npac.syr.edu (directory HPFF) titan.cs.rice.edu (directory public/HPFF/draft think.com (directory public/HPFF) ftp.gmd.de (directory hpf-europe) theory.tc.cornell.edu (directory pub) |
Electronic Mail from the Softlib server at Rice University. This can be accessed in two ways: A. Send electronic mail to softlib@cs.rice.edu with "sendhpf-v10.ps" in the message body. This report is sent as a Postscript file. B. Send electronic mail to softlib@cs.rice.edu with "send hpf-v10.tarZ" in the message body. The report is sent as a uuencoded compressed tar file containing LaTeX source |
Hardcopy Send requests to Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251. There is a charge of $50.00 for this report to cover copying and mailing costs. |
HPFF encourages reviewers to submit comments as soon as possible, with a deadline of March 1, 1993 for consideration. |
Send comments to hpff-comments@cs.rice.edu. |
To facilitate the processing of comments we request that separate comment messages be submitted for each chapter of the document and that the chapter be clearly identified in the "Subject:" line of the message. Comments about general overall impressions of the HPF document should be labeled as Chapter 1. All comments on the language specification become the property of Rice University. |
If email access is impossible for comment responses, hardcopy may be sent to: HPF Comments c/o Theresa Chatman CITI/CRPC, Box 1892 Rice University Houston, TX 77251 HPFF plans to process the feedback received at a meeting in March. Best efforts will be made to reply to comments submitted. |
Somewhat related issue ? What problems are suitable for HPF and make good use of MIMD architectures ? i.e. run well on MIMD but poorly on SIMD |
Classification of Problems (Fox, 1988) |
Embarrassingly Parallel Problems Basic entities can be very complicated but they can be calculated independently. Use INDEPENDENT command in HPF |
Search databases for given text ( ELIXIR for DECmpp ) (can be SIMD) |
Analyse separately 107 events recorded from Synchrotron (SSC) collisions. ( definitely MIMD "farm" ) |
Calculate matrix elements for (chemistry) Hamiltonian (may need MIMD)
|
Quantum Monte Carlo
|
Particle Dynamics with cutoff forces |
(Vij = 0 | ri - rj | > d ) and irregular spatial structure |
CHARMM, AMBER .............. |
Unstructured finite element meshes
|
Performance of SIMD or Autonomous SIMD (Maspar) versus MIMD unclear for these applications. |
Simple models reproduce quantitative behavior of real magnetic materials with complex interactions |
Fundamental to the development of the theory of phase transitions and critical phenomena, which has very wide applicability. |
Computational algorithms and techniques can be applied to other fields, e.g.
|
Since the models are simple and exact results are known in some cases (e.g. 2-d Ising Model), these are used as a testing ground for new computational techniques, e.g.
|
The Metropolis Algorithm
|
Cluster Algorithm
|
Metropolis Algorithm
|
Swendsen-Wang Algorithm
|
Wolff Algorithm
|
Parallel Component Labeling
|
Independent Parallelism
|
Multiple Phase problems such as
|
Need to express "alignment" between phases (different levels of multigrid, particles and grid for PIC) and then optimize distribution and redistribution between phases. |
Irregular problems of this type are only suitable for MIMD. |
Very irregular cases with mismatches between phases perform poorly even on MIMD. |
Michael Warren (with John Salmon) has reformulated Barnes-Hut parallel algorithm |
New parallel algorithm appears suitable for incorporation into extended HPF compiler as a generalization of PARTI methods used to parallelize irregular grids |
Clustering algorithm (Appel, Barnes-Hut, Greengard) |
Can replace M body force calculation by one using center of mass of cluster
|
Can do O(10,000) particles with O(N2) algorithm |
One to two orders of magnitude larger problem possible with clustering algorithm |
Key idea behind data-parallel languages - and perhaps all good languages |
The language expresses problem and not the machine architecture |
Need different approach to express functional parallelism |
Use"object-oriented" feature when problems has natural objects |
Do not use "object-oriented" features when objects are artifacts of target machine |
Need data and functional (object) parallel paradigms in many problems - especially multidisciplinary |
We can divide problems and machines into interconnected modules - and we do this at different granularities We can consider two limiting cases (I) High performance Low Latency links Machine o Classic MPP such as Intel Paragon Problem o QCD: Components are gluon fields. Links are local. Software model o High Performance Fortran o Currently only regular structure o Eventually all such tightly coupled loosely synchronous problems o Fortran + Message passing |
Can divide a problem into "highly optimized" v. "flexible" at different levels |
HPF Compiler Component is full HPF Program Glue full programs together on bus |
HPF Interpreter Component is single MATLAB/APL/HPF array statement Glue data parallel coarse grain array statement "function" calls together |
Three methods using physical analogies 1. Simulated Annealing Algorithm (SAA) Formulate optimization as finding ground state of physical systems. Use Monte Carlo to equilibrate and reduce temperature 2. Bold Neural Network (BNN) Mean field approximation to same analogy as in SAA to find quick minima. Related to "Eigenvalue Bisection" methods. 3. Genetic Algorithm (GA) Maximize fitness (minimize cost function) of evolving population of structures (candidate solutions). |
Click here for subtitle |