Given by Geoffrey C. Fox at CRPC/MCNC Workshop on April 10-13 1995. Foils prepared April 7,1995
Outside Index
Summary of Material
This module describes many current approaches including different languages which support message passing, data parallelism and task parallelism. We describe the status of various approaches and what software is appropriate for what problems and what machines |
We describe High Performance Fortran and what features are needed for what applications as well as |
Special needs of coarse grain task parallelism |
Outside Index Summary of Material
See Generic and Specific Application Glossaries |
As nearly all large scale parallel machines are distributed memory, experience is largely
|
Explicit Message Passing is only available way on "all" MIMD machines for generating good performance although this is at cost of significant user effort
|
HPF and Massage Passing both require substantial rewriting of code |
If existing or modestly modified Fortran 77, C, C++ codes would run "well" on parallel machine, then users would be happy
|
Many users want to use workstation clusters as well as dedicated parallel machines
|
This classification is summarized in Parallel Computing Works Survey |
Synchronous:Regular in Space and Time
|
High Performance Fortran (HPF), CMFortran, Crystal, C*, APL are natural and can be automatically parallelized on any machine.
|
Asynchronous:Irregular in Space and Time
|
Loosely Synchronous: Irregular in Space, Regular in Time
|
Fortran + Message Passing "works but is not portable |
Extended HPF (HPF+) designed to do most of these examples |
Data or geometric parallelism again but dynamic data structures which are not arrays
|
Synchronous: Parallel Fortran such as High Performance Fortran, CC++,Fortran-M, pC++, Crystal, APL.. |
Loosely Synchronous: Extensions of the above. (HPF+, HPC++) |
Asynchronous: Web Technology, PCN, Linda, C++ object-oriented approaches....
|
Metaproblems: AVS, MOVIE, PCN, Linda, ADA, Fortran-M, CC++, .... controlling modules written in synchronous or loosely synchronous approach |
Embarrassingly Parallel: Several approaches work?
|
Message Passing works in ALL cases! |
CC++, MOVIE (Furmanski at Syracuse), Linda, Fortran-M, PCN (Chandy)...
|
Parallel Fortran, C*, Fortran 90, ...
|
These approaches tackle different problem architectures
|
See NPAC's High Performance Fortran Applications Resource |
Should not solve ALL Problems |
Goal: Express in a high level portable scalable fashion those aspects of problems one would like to use Fortran for
|
Lessons from study of
|
Particle Dynamics and Parallel Differential Equation Solving have been studied in detail (for HPF). Other fields less completely understood. |
Express in a high level portable scalable fashion those aspects of problems one would like to use Fortran for
|
Lessons from study of
|
Particle Dynamics and Parallel Differential Equation Solving have been studied in detail (for HPF).
|
Data Parallelism
|
Task Parallelism
|
Metaproblems (task parallelism where each component data parallel)
|
See HPF Forum or National Software Exchange List or NPAC List for educational resources |
HPF is designed so that parallelism is essentially explicit |
This is embodied in 2 classes of operations:
|
HPF makes much greater use of runtime optimizations than the traditional sequential or parallel compiler |
Fortran 77: Do 1 I=1,N 1 A(I) = FUNC(B(I), B(I +/- 1), C(I), C(I +/- 1), .....) |
Traditionally arrange access to these elements in COMPILER |
HPF: A=FUNC(B,C) |
Invoke ALREADY OPTIMIZED parallel implementation of FUNC In many cases, HPF compiler need not know anything about parallelization except interfaces, data layout etc. |
Much of HPF's Parallel "Power" is buried in its library |
Personal note: Fortran 90D motivated by similarities between Caltech message passing "collective" communication and F90 Intrinsics |
18 Non-elemental (true array) Fortran 90 Intrinsics e.g. CSHIFT, SUM |
7 new HPF Intrinsics e.g. Alignment, Template, Distribution inquiries |
4 new HPF Reduction (combine) Functions |
11 new HPF Combine-Scatter Functions |
22 new HPF Parallel Prefix Functions |
2 new HPF Parallel Sorting Functions |
3 other new HPF Parallel Functions |
7 HPF Local Intrinsics |
74 library routines (==> ~1000 routines when written in clumsy F77) |
Non-Elemental Fortran 90 Intrinsics
|
HPF Intrinsics
|
New Reduction Functions
|
Combining- Scatter Functions
|
Sorting Functions
|
Parallel Prefix Functions
|
Other Functions
|
1. GLOBAL_ALIGNMENT |
2. GLOBAL_DISTRIBUTION |
3. GLOBAL_TEMPLATE |
4. ABSTRACT_TO_PHYSICAL |
5. PHYSICAL_TO_ABSTRACT |
6. LOCAL_TO_GLOBAL |
7. GLOBAL_TO_LOCAL |
See Parallel Computing works General Discussion |
STATIC
|
ADAPTIVE
|
ASYNCHRONOUS
|
INTEGRATION
|
Standardized syntax lowers risk in application development
|
HPF, HPC++ benchmarks will enable easier comparison between different machines
|
Allows vendors to concentrate their scarce software resources
|
Parallelism is a feature of problems |
It can be expressed in different languages with trade-offs in performance, flexibility and convenience |
Fortran 77 Þ HPF |
Fortran 90 Þ HPF |
C++ Þ HPC++
|
HPC++ more naturally handles complex data structures e.g. data parallel collections which are not the staple array data structure of Fortran |
Must have common runtime support
|
The following foils are expanded in HPF Applications Resource at NPAC |
Somewhat related issue ?
|
Classification of Problems (Fox, 1988) |
See overview discussion in Parallel Computing Works |
Synchronous: Data Parallel Tightly coupled and software needs to exploit features of problem structure to get good performance. Comparatively easy as different data elements are essentially identical. |
Loosely Synchronous:
|
Asynchronous:
|
Embarrassingly parallel:
|
Metaproblems
|
We know the following areas MUST be addressed for a really useful HPF compiler |
We roughly know how to do all of this and why |
Support for irregular data distributions
|
Enhancements to Forall and similar constructs |
Sparse and tree data structures |
Interface/Coupling with other languages (e.g., C++) |
Parallel I/O |
Heterogeneous systems |
Some sort of support of or interface with task or perhaps control parallelism |
Synchronous Problems where |
Data Structure is an array manipulated
|
Regular grid partial differential equation solvers
|
"Crystalline" (regular) particle dynamics or Monte Carlo
|
These above two are local or nearest neighbor - use CSHIFT |
Full matrix Algorithms (may need MIMD for good performance) |
FFT |
Low Level Image Processing
|
O(N2) Particle Dynamics
|
In Embarrassingly Parallel Problems, the basic entities can be very complicated but they can be calculated independently.
|
Search databases for given text (can be SIMD) |
Analyze separately 10**7 events recorded from Synchrotron (SSC) collisions. ( definitely MIMD "farm" ) |
Calculate matrix elements for (chemistry) Hamiltonian (may need MIMD) |
Programs such as MOPAC, GAUSSIAN
|
Quantum Monte Carlo
|
Molecular Dynamics
|
Basic steps in the algorithm:
|
Unstructured Finite Element Meshes
|
Performance of SIMD or Autonomous SIMD (Maspar) versus MIMD unclear for these irregular applications |
Basic Steps:
|
Optimization not trivial - for instance on the CM-5
|
Cluster algorithms for Monte Carlo spin systems |
This is a very similar problem to region finding in image processing |
Dominant problem is component labelling - can be done using "scan" -- intrinsic library support |
Data structure is an array with a |
difficult (irregular) parallel |
Algorithm hidden in a library call. |
Multiple phase problems such as
|
Needs to express "alignment" between phases (different levels of multigrid, particles and grid for PIC) and then optimize distribution and redistribution between phases. |
Irregular problems of this type are only suitable for MIMD. |
Very irregular cases with mismatches between phases perform poorly even on MIMD. |
Particle in the Cell -- Basic steps of algorithm
|
Irregularly Coupled Regular Meshes (ICRM) -- Basic Algorithm
|
Direct methods for sparse matrix solvers
|
O(N) and O(NlogN) fast multipole methods for particle dynamics
|
Original message passing code (Salmon thesis) won "Intel Delta Application Prize"
|
Current version (Warren and Salmon) looks possible with extended PARTI |
Interesting Parallel C++ work |
Clustering algorithm (Appel, Barnes-Hut, Greengard) |
Can replace M body force calculation by one using center of mass of cluster
|
Can do O(N=10,000) particles with O(N2) algorithm |
Three orders of magnitude larger problem possible with clustering algorithm |
Key idea behind data-parallel languages - and perhaps all good languages |
The language expresses problem and not the machine architecture |
Need different approach to express functional parallelism |
Use"object-oriented" feature when problems has natural objects |
Do not use "object-oriented" features when objects are artifacts of target machine |
Need data and functional (object) parallel paradigms in many problems - especially multidisciplinary |
Architecture of "virtual problem" determines nature of language |
1)High Performance Low Latency Links
|
Software model in this case is High Performance Fortran or |
Fortran plus Message Passing |
2) Flexible, "medium" performance links with high functionality. This case has examples: |
For Machine: Network example is Hardware Bus |
For Problem: Components are different programs in a complicated multidisciplinary design environment |
For Software: Model is "Software bus" or "Distributed Computing"
|
Another software example is AVS, MOVIE (Syracuse) .............. |
Can divide a project into
|
HPF Compiler
|
HPF Interpreter
|
Basic Features
|
Highlights
|
Flaws
|
Stock Option Price Modeling |
Data Parallel Pricing Models on CM5 and DECmpp-12000 (Maspar MP-1) |
An Interactive Visualization Environment on Heterogeneous Computing System |
The Distributed System Configuration and Integration |
The Graphical User Interface |
Discussion |
Electromagnetic Scattering Simulation |
The Simulation Model and Physical Parameters |
Major Computational Issues
|
A General Performance Model of the Remote Visualization Environment |
EMS Subtasks and System Implementation |
The GUI and Inter-module communication |
Discussion |
Basic equations calculate matrix elements and then apply linear solvers |
Discussion of HPF is "general discussion of data parallelism" |
HPC++ uses data parallelism from HPF |
Parallelism from problem not language ! |
Distinguish
|
Data parallel
|
AVS - Coarse grain dataflow and commercial support and many image processing modules |
Linda - Shared database accessed by messaging and commercial support |
Speedes - Event driven simulation i.e. temporal synchronization |
PCN, C++, Fortran-M, ...
|
What is Relation of AVS
|
and PVM ?
|
What is Relation of PVM and
|
??????? |
Should we build Integrated Systems |
Fortran-M ==> HPF-M |
or keep subsystems with distinct functionally |
HPF + AVS |
or |
HPF + PVM |
data task |