Given by Geoffrey C. Fox at CPS615 Computational Science on Spring Semester 2000. Foils prepared 13 February 2000
Outside Index
Summary of Material
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe classes of parallel applications illustrating some key features such as load balancing and communication |
We describe programming models and how their features match applications |
Outside Index Summary of Material
Spring Semester 2000 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
gcf@cs.fsu.edu |
We give a simple overview of parallel architectures today with distributed, shared or distributed shared memory |
We describe classes of parallel applications illustrating some key features such as load balancing and communication |
We describe programming models and how their features match applications |
Find what machine or class of Machines you have available |
Examine parallelism seemingly available in your application (algorithm) and decide on mechanism needed to exploit it.
|
Decide on and use programming model (HPF, MPI, Threads, openMP) and explicit realization of it
|
Worry about related issues
|
Evaluate possible tools
|
So imagine the world's simplest problem |
Find the electrostatic potential inside a box whose sides are at a given potential |
Set up a 16 by 16 Grid on which potential defined and which must satisfy Laplace's Equation |
Initialize the internal 14 by 14 grid to anything you like and then apply for ever! |
? New = (? Left + ? Right + ? Up + ? Down ) / 4 |
14 by 14 Internal Grid |
If one has 16 processors, then decompose geometrical area into 16 equal parts |
Each Processor updates 9 12 or 16 grid points independently |
Updating edge points in any processor requires communication of values from neighboring processor |
For instance, the processor holding green points requires red points |
A parallel computer is any old collection of processing elements that cooperate to solve large problems fast
|
Some broad issues:
|
Parallel computers allow several CPUs to contribute to a computation simultaneously. |
For our purposes, a parallel computer has three types of parts:
|
Key points:
|
Colors Used in Following pictures |
Every processor has a memory others can't access. |
Advantages:
|
Disadvantages:
|
On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (owner-computes rule) |
Information needed on edges for update must be communicated via explicitly generated messages |
Messages |
Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster are quite similar.
|
To program these machines:
|
All processors access the same memory. |
Advantages:
|
Disadvantages:
|
On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it |
Nature of parallelism is identical to that for distributed memory machines but communication implicit as "just" access memory |
Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit. |
These machines share data by direct access.
|
Combining the (dis)advantages of shared and distributed memory |
Lots of hierarchical designs are appearing.
|
Distributed Shared Memory machines have communication features of both distributed (messages) and shared (memory access) architectures |
Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax) |
For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, "correct" version is used by other processors accessing this variable and that values living in caches are updated |
4 by 4 regions in each processor
|
8 by 8 regions in each processor
|
Communication is an edge effect |
Give each processor plenty of memory and increase region in each machine |
Large Problems Parallelize Best |
This is (sophisticated) wave equation similar to Laplace example and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor |
The Laplace grid points become finite element mesh nodal points arranged as triangles filling space |
All the action (triangles) is near near wing boundary |
Use domain decomposition but no longer equal area as equal triangle count |
Simulation of cosmological cluster (say 10 million stars ) |
Lots of work per star as very close together ( may need smaller time step) |
Little work per star as force changes slowly and can be well approximated by low order multipole expansion |
Particle dynamics of this type (irregular with sophisticated force calculations) always need complicated decompositions |
Equal area decompositions as shown here to load imbalance |
Equal Volume Decomposition Universe Simulation |
Galaxy or Star or ... |
16 Processors |
If use simpler algorithms (full O(N2) forces) or FFT, then equal area best |
Consider a geometric problem with 4 processors |
In top decomposition, we divide domain into 4 blocks with all points in a given block contiguous |
In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains |
edge/area(bottom) = 2* edge/area(top) |
So minimizing communication implies we keep points in a given processor together |
But this has a flip side. Suppose we are decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by . |
In Top decomposition only the white processor does any work while the other 3 sit idle.
|
In Bottom decomposition all the processors do roughly the same work and so we get good load balance ...... |
Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results
|
Concentrating processors near crack leads to good workload balance |
equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous |
This is NP complete (exponenially hard) optimization problem but in practice many ways of getting good but not exact good decompositions |
Region assigned to 1 processor |
Work Load |
Not Perfect ! |
Not all decompositions are quite the same |
In defending against missile attacks, you track each missile on a separate node -- geometric again |
In playing chess, you decompose chess tree -- an abstract not geometric space |
Computer Chess Tree |
Current Position (node in Tree) First Set Moves Opponents Counter Moves |
California gets its independence |
A parallel algorithm is a collection of tasks and a partial ordering between them. |
Design goals:
|
Sources of parallelism:
|
Data-parallel algorithms exploit the parallelism inherent in many large data structures.
|
Features of Data Parallelism
|
Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like) |
Functional parallelism exploits the parallelism between the parts of many systems.
|
Analysis:
|
Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel |
These are made up of independent concurrent components
|
In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently |
Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism |
A parallel language provides an executable notation for implementing a parallel algorithm. |
Design criteria:
|
Usually a language reflects a particular style of expressing parallelism. |
Data parallel expresses concept of identical algorithm on different parts of array |
Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts |
Data-parallel languages provide an abstract, machine-independent model of parallelism.
|
Advantages:
|
Disadvantages:
|
Examples: HPF, C*, HPC++ |
Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD |
Program is based on typically coarse-grain tasks |
Separate address space and a processor number for each task |
Data shared by explicit messages
|
Examples: MPI, PVM, Occam for parallel computing |
Universal model for distributed computing to link naturally decomposed parts e.g. HTTP, RMI, IIOP etc. are all message passing
|
Advantages:
|
Disadvantages:
|
Experts in Java are familiar with this as it is built in Java Language through thread primitives |
We take "ordinary" languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads
|
openMP is a recent set of compiler directives supporting this model |
This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural |
Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism |
Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the
|
Modules are "natural" message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components
|
Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism
|
Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems |
We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs)
|
All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research |
Database |
Matrix Solver |
Optimization Service |
MPP |
MPP |
Parallel DB Proxy |
NEOS Control Optimization |
Origin 2000 Proxy |
NetSolve Linear Alg. Server |
IBM SP2 Proxy |
Gateway Control |
Agent-based Choice of Compute Engine |
Multidisciplinary Control (WebFlow) |
Data Analysis Server |
The Java Language has several good design features
|
Java has a very good set of libraries covering everything from commerce, multimedia, images to math functions (under development at http://math.nist.gov/javanumerics) |
Java has best available electronic and paper training and support resources |
Java is rapidly getting best integrated program development environments |
Java naturally integrated with network and universal machine supports powerful "write once-run anywhere" model |
There is a large and growing trained labor force |
Can we exploit this in computational science? |
Use of Java for: |
High Performance Network Computing |
Scientific and Engineering Computation |
(Distributed) Modeling and Simulation |
Parallel and Distributed Computing |
Data Intensive Computing |
Communication and Computing Intensive Commercial and Academic Applications |
HPCC Computational Grids ........ |
Very difficult to find a "conventional name" that doesn't get misunderstood by some community! |
The Web integration of Java gives it excellent "network" classes and support for message passing. |
Thus "Java plus message passing" form of parallel computing is actually somewhat easier than in Fortran or C. |
Coarse grain parallelism very natural in Java |
"Data Parallel" languages features are NOT in Java and have to be added (as a translator) of NPAC's HPJava to Java+Messaging just as HPF translates to Fortran plus message passing |
Java has built in "threads" and a given Java Program can run multiple threads at a time
|
Can be used to do more general parallel computing but only on shared memory computers
|
Combine threads on a shared memory machine with message passing between distinct distributed memories |
"Distributed" or "Virtual" Shared memory does support the JavaVM as hardware gives illusion of shared memory to JavaVM |
Message Passing |
Message Passing |
So here is a recipe for developing HPCC (parallel) applications as of January 2000 |
Use MPI for data parallel distributed memory programs as alternatives are HPF, HPC++ or parallelizing compilers today
|
Use openMP or MPI on shared (distributed shared) memory machines
|
Pleasingly Parallel problems can use MPI or web/metacomputing technology |
We don't emphasize openMP in class, as hard work (aka difficult programming model) of MPI is advantageous for class as teaches you parallel computing! |
Today Java can be used for client side applets and in systems middleware but too slow for production scientific code
|
Use metacomputers for pleasingly parallel and metaproblems -- not for closely knit problems with tight synchronization between parts |
Use where possible web and distributed object technology for "coordination" |