Given by Geoffrey C. Fox at CPS615 Spring Semester 00 on February 00. Foils prepared February 13 00
Outside Index
Summary of Material
This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate: |
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax) |
SPMD -- Single Program Multiple Data -- Programming Model |
Stencil dependence of Parallel Program and use of Guard Rings |
Collective Communication |
Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect |
Outside Index Summary of Material
Spring Semester 2000 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
gcf@cs.fsu.edu |
This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate: |
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax) |
SPMD -- Single Program Multiple Data -- Programming Model |
Stencil dependence of Parallel Program and use of Guard Rings |
Collective Communication |
Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect |
So imagine the world's simplest problem |
Find the electrostatic potential inside a box whose sides are at a given potential |
Set up a 16 by 16 Grid on which potential defined and which must satisfy Laplace's Equation |
Initialize the internal 14 by 14 grid to anything you like and then apply for ever! |
? New = (? Left + ? Right + ? Up + ? Down ) / 4 |
14 by 14 Internal Grid |
If one has 16 processors, then decompose geometrical area into 16 equal parts |
Each Processor updates 9 12 or 16 grid points independently |
Updating edge points in any processor requires communication of values from neighboring processor |
For instance, the processor holding green points requires red points |
4 by 4 regions in each processor
|
8 by 8 regions in each processor
|
Communication is an edge effect |
Give each processor plenty of memory and increase region in each machine |
Large Problems Parallelize Best |
Asynchronous Communication:The style in which a node is free to send messages to any other node at any time and regardless of hardware connectivity and "node readiness" |
Blocking and Non-Blocking Communication:
|
Collective Communication: High level communication calls that carry out send/receive operations on groups of nodes e.g. broadcast (all) or multicast (partial set). These routines free user from determining optimal method to carry out such communication and can lead to significant performance gain and ease of programming. |
Interrupt Driven Communication: A system in which messages arriving at a node cause an interruption in the flow of the program executing there. The system (or user program) must handle the incoming message before returning to the place where the interrupt occured.
|
Loosely Synchronous Communication: The Programming model in which an application is divided into compute followed by communication phases.
|
The communication phase is often implemented with collective communication and serves to synchronize processors |
Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will only discuss at high level
|
Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature |
Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" |
Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory |
The concept of guard rings/points/"halos" is well known in sequential case where one has for a trivial example in one dimension (shown above) 16 points. |
The end points are fixed boundary values |
One could save space and dimension PHI(14) and use boundary values by statements for I=1,14 like
|
But this is slower and clumsy to program due to conditionals INSIDE Loop and one dimensions instead PHI(16) storing boundary values in PHI(1) and PHI(16)
|
In analogous 2D sequential case, one could dimension array PHI(?) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update |
Rather dimension PHI(16,16) to include internal and boundary points
|
This is easier and faster as no conditionals (IF statements) in inner loops |
Now we decompose our 16 points (trivial example) into four groups and put 4 points in each processor |
Instead of dimensioning PHI(4) in each processor, one dimensions PHI(6) and runs loops from 2 to 5 with either boundary values or communication setting values of end-points |
Sequential: |
Parallel: |
PHI(6) for Processor 1 |
In bi-color points, upper color is "owning processor" and bottom color is that of processor that needs value for updating neighboring point |
Owned by Green -- needed by Yellow |
Owned by Yellow -- needed by Green |
This will only depend on 3 parameters |
n which is grain size -- amount of problem stored on each processor (bounded by local memory) |
tfloat which is typical time to do one calculation on one node |
tcomm which is typical time to communicate one word between two nodes |
Most importance omission here is communication latency |
Time to communicate = tlatency+ (Num Words)tcomm |
Node A |
Node B |
tcomm |
CPU tfloat |
CPU tfloat |
Memory n |
Memory n |
Consider N by N array of grid points on P Processors where ?P is an integer and they are arranged in a ?P by ?P topology |
Suppose N is exactly divisible by ?P and a general processor has a grain size n = N2/P grid points |
Sequential time T1 = (N-2)2 tcalc |
Parallel Time TP = n tcalc |
Speedup S = T1/TP = P (1 - 2/N)2 = P(1 - 2/?(nP) )2 |
S tends to P as N gets large at fixed P |
This expresses analytically intuitive idea that load imbalance due to boundary effects and will go away for large N |
Largest communication load is communicating 16 words to be compared to calculating 16 updates -- each taking time tcalc |
Each communication is one value of ? probably stored in a 4 byte word and takes time tcomm |
Then on 16 processors, T16 = 16tcalc + 16tcomm |
Speedup S = T1/T16 = 12.25 / (1 + tcomm/tcalc) |
or S = 12.25 / (1 + 0.25 tcomm/tfloat) |
or S ? 12.25 * (1 - 0.25 tcomm/tfloat) |
Consider N grid points in P processors with grain size n = N2/P |
Sequential Time T1 = 4N2 tfloat |
Parallel Time TP = 4 n tfloat + 4 ?n tcomm |
Speed up S = P (1 - 2/N)2 / (1 + tcomm/(?n tfloat) ) |
Both overheads decrease like 1/?n as n increases |
This ignores communication latency but is otherwise accurate |
Speed up is reduced from P by both overheads |
Load Imbalance Communication Overhead |
Efficiency ? = Speed Up S / P (Number of Processors) |
Overhead fcomm = (P TP - T1) / T1 = 1/ ? - 1 |
As fcomm linear in TP, overhead effects tend to be additive |
In 2D Jacobi example fcomm = tcomm/(?n tfloat) |
While efficiency takes approximate form ? ? 1 - tcomm/(?n tfloat) valid when overhead is small |
As expected efficiency is < 1 corresponding to speedup being < P |
In many problems there is an elegant formula fcomm = constant . tcomm/(n1/d tfloat) |
d is system information dimension which is equal to geometric dimension in problems like Jacobi where communication is a surface and calculation a volume effect
|
d=1 for Hadrian's wall and d=2 for Hadrian's Palace floor while for Jacobi in 1 2 or 3 dimensions, d =1 2 or 3 |
Note formula only depend on local node and communication parameters and this implies that parallel computing does scale to large P if you build fast enough networks (tcomm/tfloat) and have a large enough problem (big n) |
For Jacobi, we have |
Calculation 4 n tfloat |
Communication 4 ?n tcomm |
Communication Overhead fcomm = tcomm/(?n tfloat) |
"Smallest" Communication but NOT smallest overhead |
Update Stencil |
Communicated Updated Processor Boundaries |
For Jacobi with fourth order differencing, we have |
Calculation 9 n tfloat |
Communication 8 ?n tcomm |
Communication Overhead fcomm = 0.89 tcomm/(?n tfloat) |
A little bit smaller as communication and computation both doubled |
Update Stencil |
Communicated Updated Processor Boundaries |
For Jacobi with diagonal neighbors, we have |
Calculation 9 n tfloat |
Communication 4(?n + 1 ) tcomm |
Communication Overhead fcomm = 0.5 tcomm/(?n tfloat) |
Quite a bit smaller |
Update Stencil |
Communicated Updated Processor Boundaries |
Now systematically increase size of stencil. You get this in particle dynamics problems as you increase range of force |
Calculation per point increases but communication increases faster and fcomm decreases systematically |
Must re-use communicated values for this to work! |
Update Stencils of increasing range |
Now make range cover full domain as in long range force problems |
fcomm ? tcomm/(n tfloat) |
This is a case with geometric dimension 1 2 or 3 (depending on space particles in) but information dimension always 1 |
This is just like one dimensional case |
First we decompose problem as we have seen |
Four Processors are shown |
Now look at processor in top left |
It needs real boundary values for updates shown as black and green |
Then it needs points from neighboring processors shown hatched with green and other processor color |
Now we see the effect of all guards with four points at center needed by 3 processors and other shaded points by 2 |
One dimensions overlapping grids PHI(10,10) here and arranges communication order properly |