Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 97. Foils prepared 2 September 1997
Outside Index
Summary of Material
This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate: |
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax) |
SPMD -- Single Program Multiple Data -- Programming Model |
Stencil dependence of Parallel Program and use of Guard Rings |
Collective Communication |
Basic Speed Up,Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effects. |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate: |
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax) |
SPMD -- Single Program Multiple Data -- Programming Model |
Stencil dependence of Parallel Program and use of Guard Rings |
Collective Communication |
Basic Speed Up,Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effects. |
Solve Laplace's Equation: |
on a rectangular domain, with specified boundary conditions. |
Use simple iterative Gauss-Seidel algorithm: |
f new = ( f left + f right + f up + f down ) / 4 |
256 Grid Points |
16-Node Concurrent Processor |
16 Grid Points in each Processor |
f is unknown |
f is known
|
X denotes f values to be communicated |
BEGIN TEST=0
|
1 PHINEW (I) = TEMP
|
2 PHIOLD (I) = PHINEW(I)
|
BEGIN TEST = 0
|
Data Parallel typified by CMFortran and its generalization - High Performance Fortran which we have discussed separately |
Typical Data Parallel Fortran Statements are full array statements
|
Message Passing typified by later discussion of Laplace Example which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature |
Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" |
Data parallel software is translated by a compiler into "machine language" which is typically message passing |
A particular way of using MIMD machines - DOMINANT successful use so far. |
Each processor runs same program but in general executes different instructions at different times. |
Will later see corresponds to "loosely synchronous problems". |
Style of current program example -- note although each doing roughly same thing -- i.e. updating grid points -- each node is NOT at same point in update at each clock cycle |
!HPF$ TEMPLATE WORLD(NTOT) (1) |
!HPF$ DISTRIBUTE WORLD(BLOCK) (2)
|
!HPF$ ALIGN PHINEW WITH WORLD (3) |
!HPF$ ALIGN PHIOLD WITH WORLD (3)
|
BEGIN PHINEW (2:NTOT1) =0.5* (EOSHIFT (PHIOLD,1) + EOSHIFT (PHIOLD, -1)) (4)
|
(1) DefInes a data world of NTOT entries |
(2) Breaks up world into equal parts in each processor decomposed in one dimension |
(3) Instructs that PHINEW AND PHIOLD are aligned exactly with world i.e. PHINEW(I) and PHIOLD(I) are stored in the I'th position of world |
(4) EOSHIFT is "End-Off" SHIFT. The calls shift PHIOLD by one both to left and right. The indices take care of WORLD (1 and NTOT) being boundary values
|
BEGIN PHINEW(2:NTOT1) = 0.5 * (PHIOLD (1:NTOT2) + PHIOLD (3:NTOT)) |
(5) Forms TEST as maximum change in absolute value of j at any world location. MAXVAL is a Fortran90 intrinsic function. |
NOTE: Any subscript n:m is an "array section" or set of indices |
Basic operations are SEND and RECEIVE |
But there are a lot of subtleties |
How does program 2 know what message passing it is getting ? Does it come from program 1 or another one ? |
Does program 1 get an acknowledgment (analogous to certified mail) ? |
Do we use system or application level implementations ? -- raises issues of protection, copying |
Broadcasts, Multicasts -- other collective or multiple message/source/destination communications |
Do we have multiple processes or threads per processor ? -- How do we address? |
Asynchronous Communication:The style in which a node is free to send messages to any other node at any time and regardless of hardware connectivity and "node readiness" |
Blocking and Non-Blocking Communication: In the blocking style, receive opeations suspend the calling program until a suitable message has arrived.
|
Collective Communication: High level communication calls that carry out send/receive operations on groups of nodes e.g. broadcast (all) or multicast (partial set). These routines free user from determining optimal method to carry out such communication and can lead to significant performance gain and ease of programming. |
Crystalline Communication: The style in which a node reading or writing is blocked until the matching operation is performed in the target node. Messages can only be sent or received between processors with direct hardware connection |
Domain Decomposition or Data Parallelism: The programming Style in which parallelism is achieved by dividing the data on which algorithm works among the processors. |
Interrupt Driven Communication: A system in which messages arriving at a node cause an interruption in the flow of the program executing there. The system (or user program) must handle the incoming message before returning to the place where the interrupt occured.
|
Loosely Synchronous Communication: The Programming model in which an application is divided into compute followed by communication phases.
|
Processor 1 |
("typical") |
Initialization NLOC=NTOT/NPROC (Assume NTOT divisible by NPROC)
|
BASIC LOOP |
BEGIN TEST = 0 |
Shift to the right SOURCE = PHIOLD(NLOC1) address of data to be sent IF(PROCNUM.EQ.NPROC-1) SOURCE = "DUMMY" DEST = PHIOLD(1) address where data to be stored IF(PROCNUM.EQ.1)DEST = "DUMMY" CALL MPSHIFT(+1, SOURCE, DEST) |
Shift to left SOURCE = FOLD(I1) IF(PROCNUM.EQ.0) SOURCE = "DUMMY" address of data to be sent DEST = PHIOLD(NLOC1+1) IF(PROCNUM.EQ.NPROC-1) DEST = "DUMMY" address where data to be stored CALL MPSHIFT(-1, SOURCE, DEST)
|
The example uses three idealized (not present in real message passing system) primitives |
Message Passing Shift to right MPSHIFT (+1, SOURCE, DEST) |
Sends 1 word in location SOURCE to processor on the right |
Receives word in location DEST from the processor on the left |
SOURCE and DEST are locations -- if set to "DUMMY", then no information is to be sent or received |
Message Passing Shift to left MPSHIFT (-1, SOURCE, DEST) |
Sends 1 word in SOURCE to processor on the left |
Receives word in DEST from processor on the right |
GLOBALMAX (TEST) |
takes TEST from all processors |
forms TESTMAX = maximum value of TEST over all processors |
replaces TEST by TESTMAX in all processors |
Consider for example, shift to right |
Then Processors 0 .. Nproc-2 Send to right
|
We can't necessarily send all messages and then receive all messages. "Standard" send can be "blocking" i.e. will not return unless receive completed by destination processor. In this case, we can deadlock as all "hang" after send, waiting for receive |
So we are more careful |
Let PROCNUM be processor "rank" (number or order in |
one dimensional decomposition)
|
Note: MPI uses reserved word MPI_PROCNULL rather than "DUMMY". Also, we could remove setting of "DUMMY" in calling routine and placed in MPSHIFT as test on PROCNUM |
We can implement MPSHIFT directly in MPI as CALL MPI_SENDRECV(SOURCE,1,MPI_REAL, PROCNUM+1, sendtag, DEST,1,MPI_REAL, PROCNUM-1, recvtag,comm,status) |
Notes: |
MPI_REAL denotes that variable is real |
"sendtag/recvtag" are for this purpose, a largely irrelevant additional message tag |
"comm" is extra system message tag defining "scope" -- i.e. the set of processors involved -- here it is all of them |
"status" tells you about received data. You needn't look at it if you trust your code and hardware |
In MPI, this is a single call |
CALL MPI_ALLREDUCE (TEST,TEST,1,MPI_REAL,MPI_MAX,comm) |
Flag MPI_MAX specifies global maximum |
The implementation is quite subtle and usually involves a logarithmic tree |
There is a clever extension to get maximum in all processors in "same" time as that on one processor on hypercube and other architectures |
In the sequential code, a single processor updates in turn (16x16=256-60=196) internal points. Each update is j --> 0.25 *(j up + jdown + jleft + jright ) involves 4 floating point operations for each point |
In the parallel case, each processor updates in turn the points for which it is responsible - this is "owner computes rule" |
A corner processor updates nine points |
(the small internal points) |
A "general" central processor |
updates sixteen points |
There are several different cases illustrated by the points A,B,C in case of corner processor |
In each case we superimpose stencil on top of point to be updated. |
The large circle must be placed on top of each point "owned" by the processor. The cases A,B,C differ because the points used have different characteristics |
For the corner processor we need to communicate a total of 6 points (marked with an X) |
For a general processor, we need to communicate a total of 16 points (marked with an X) |
In both cases, one is "solving" the "same" problem j --> .25 (jleft +jright + jup + jdown) |
over a set of points |
In sequential case, problem is "solved" over the full domain |
In parallel case, problem is "solved" over part of domain (roughly one sixteenth of it) i.e. each node is solving same problem as sequential case but over a different - smaller geometry |
In sequential case, boundary conditions are "just" fixed values on the edge |
In the parallel case, boundary conditions are in addition, communicated values |
Parallel and Sequential cases differ in
|
This analysis implies SPMD - Single Programming Multiple Data - programming model i.e. each nodes executes the same program but runs through different data |
Also note that not only does each node of the parallel machine run the same program but also this program is very similar to sequential program as long as one isolates boundary and geometry sections |
Note that this problem can be solved on SIMD or MIMD computer. Thus SPMD example is not totally general. There are SPMD examples which require MIMD machines |
In sequential case, one could dimension array PHI(F) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update |
Rather dimension PHI(16,16) to include internal and boundary points
|
Similarily we will not dimension PHI(4,4) but rather PHI(6,6) to hold communicated values |
Now we preload PHI(1, . ), PHI ( . , 1), PHI(6, . ), PHI( . , 6) by communication and then basic "parallel" program update is identical to sequential case and one loops over x(I), y(J) with I and J in range 2 to 5 |
As "load imbalance" - not all processors have the same number of internal points, some differences are seen for "edge" processors with only 9 or 12 internal points |
For example, in the top right corner processor one needs to only run I, J from 2 to 4 |
We look at speed up for this simple case |
Suppose you run a problem on one processor - the sequential version
|
Now run the "same" problem (definition of "same" can be tricky but it is not subtle here) on a parallel machine with P nodes where C.P.U. same as sequential machine Let execution time be TP |
Then speed up is S=T1/TP |
We would like S to be P What is it in our case and why ? |
The speed up S is reduced from P for two reasons
|
In some cases, one also has effects from parallel method being different from sequential method but that is not relevant here |
We have processors with different amounts of work - as some update 16, some update 12 and some 9 internal points |
Each update takes time tcalc = 4 tfloat where tfloat is time taken to do floating point addition or multiplication |
Processors doing 9 or 12 updates must wait for those doing 16 to finish |
T1 = 196 tcalc T16 = 16 tcalc as all that counts is maximum amount of work done by any one processor |
Speed up degraded by load imbalance is: Sload = 196/16 = 12.25 This is speed up ignoring degradation due to communications |
Suppose communicating a single word - here a j value probably stored in a 4 byte word - take time tcomm |
The previous foil showed that increasing stencil made slight improvements! |
This foil shows that larger stencils have much lower overheads (and hence better parallel performance) than simple Laplace's equation with 5 point stencil |
Showing linear overhead behavior for fc |