Full HTML for

Basic foilset Master Set for Parallel Programming for Laplace's Equation

Given by Geoffrey C. Fox at CPS615 Spring Semester 00 on February 00. Foils prepared February 13 00
Outside Index Summary of Material


This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)
SPMD -- Single Program Multiple Data -- Programming Model
Stencil dependence of Parallel Program and use of Guard Rings
Collective Communication
Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect

Table of Contents for full HTML of Master Set for Parallel Programming for Laplace's Equation

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 Parallel Programming for Laplace's Equation
2 Abstract of Parallel Programming for Laplace's Equation
3 Potential in a Vacuum Filled Rectangular Box
4 Basic Sequential Algorithm
5 Update on the Grid
6 Parallelism is Straightforward
7 Communication is Needed
8 Communication Must be Reduced
9 Various Styles/Terms in Messaging Passing I
10 Various Styles/Terms in Messaging Passing II
11 Approaches to Parallel Programming
12 Sequential Programming with Guard Rings
13 Sequential Guard Rings in Two Dimensions
14 Parallel Guard Rings in One Dimension
15 Summary of Parallel Guard Rings in One Dimension
16 Setup of Parallel Jacobi in One Dimension
17 Performance Analysis Parameters
18 Analytical analysis of Load Imbalance
19 Example of Communication Overhead
20 General Analytical Form of Communication Overhead for Jacobi
21 General Speed Up and Efficiency Analysis I
22 General Speed Up and Efficiency Analysis II
23 Communication to Calculation Ratio as a function of template I
24 Communication to Calculation Ratio as a function of template II
25 Communication to Calculation Ratio as a function of template III
26 Communication to Calculation IV
27 Communication to Calculation V
28 Parallel Guard Rings in Two Dimensions I
29 Parallel Guard Rings in Two Dimensions II
30 Parallel Guard Rings in Two Dimensions III

Outside Index Summary of Material



HTML version of Basic Foils prepared February 13 00

Foil 1 Parallel Programming for Laplace's Equation

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Spring Semester 2000
Geoffrey Fox
Northeast Parallel Architectures Center
Syracuse University
111 College Place
Syracuse NY
gcf@npac.syr.edu
gcf@cs.fsu.edu

HTML version of Basic Foils prepared February 13 00

Foil 2 Abstract of Parallel Programming for Laplace's Equation

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:
Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)
SPMD -- Single Program Multiple Data -- Programming Model
Stencil dependence of Parallel Program and use of Guard Rings
Collective Communication
Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect

HTML version of Basic Foils prepared February 13 00

Foil 3 Potential in a Vacuum Filled Rectangular Box

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
So imagine the world's simplest problem
Find the electrostatic potential inside a box whose sides are at a given potential
Set up a 16 by 16 Grid on which potential defined and which must satisfy Laplace's Equation

HTML version of Basic Foils prepared February 13 00

Foil 4 Basic Sequential Algorithm

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Initialize the internal 14 by 14 grid to anything you like and then apply for ever!
? New = (? Left + ? Right + ? Up + ? Down ) / 4

HTML version of Basic Foils prepared February 13 00

Foil 5 Update on the Grid

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
14 by 14 Internal Grid

HTML version of Basic Foils prepared February 13 00

Foil 6 Parallelism is Straightforward

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
If one has 16 processors, then decompose geometrical area into 16 equal parts
Each Processor updates 9 12 or 16 grid points independently

HTML version of Basic Foils prepared February 13 00

Foil 7 Communication is Needed

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Updating edge points in any processor requires communication of values from neighboring processor
For instance, the processor holding green points requires red points

HTML version of Basic Foils prepared February 13 00

Foil 8 Communication Must be Reduced

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
4 by 4 regions in each processor
  • 16 Green (Compute) and 16 Red (Communicate) Points
8 by 8 regions in each processor
  • 64 Green and "just" 32 Red Points
Communication is an edge effect
Give each processor plenty of memory and increase region in each machine
Large Problems Parallelize Best

HTML version of Basic Foils prepared February 13 00

Foil 9 Various Styles/Terms in Messaging Passing I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Asynchronous Communication:The style in which a node is free to send messages to any other node at any time and regardless of hardware connectivity and "node readiness"
Blocking and Non-Blocking Communication:
  • In the blocking style, receive operations suspend the calling program until a suitable message has arrived.
  • In the non-blocking style, a read immediately returns with (typically) a status code indicating whether or not a suitable message has arrived. Message send operations usually return as soon as the data is safely "in transit" whether or not it has been received
Collective Communication: High level communication calls that carry out send/receive operations on groups of nodes e.g. broadcast (all) or multicast (partial set). These routines free user from determining optimal method to carry out such communication and can lead to significant performance gain and ease of programming.

HTML version of Basic Foils prepared February 13 00

Foil 10 Various Styles/Terms in Messaging Passing II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Interrupt Driven Communication: A system in which messages arriving at a node cause an interruption in the flow of the program executing there. The system (or user program) must handle the incoming message before returning to the place where the interrupt occured.
  • Active messages and Java RMI - Remote Method Invocation (UNIX RPC) are implemented this way
Loosely Synchronous Communication: The Programming model in which an application is divided into compute followed by communication phases.
  • The BSP (Bulk Synchronization ) system uses this.
The communication phase is often implemented with collective communication and serves to synchronize processors

HTML version of Basic Foils prepared February 13 00

Foil 11 Approaches to Parallel Programming

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will only discuss at high level
  • Typical Data Parallel Fortran Statements are full array statements
  • B=A1 + A2
  • B=EOSHIFT(A,-1)
  • Function operations on arrays representing full data domain
Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature
Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"
Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory

HTML version of Basic Foils prepared February 13 00

Foil 12 Sequential Programming with Guard Rings

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
The concept of guard rings/points/"halos" is well known in sequential case where one has for a trivial example in one dimension (shown above) 16 points.
The end points are fixed boundary values
One could save space and dimension PHI(14) and use boundary values by statements for I=1,14 like
  • IF( I.EQ.1) Valueonleft = BOUNDARYLEFT
  • ELSE Valueonleft = PHI(I-1) etc.
But this is slower and clumsy to program due to conditionals INSIDE Loop and one dimensions instead PHI(16) storing boundary values in PHI(1) and PHI(16)
  • Updates are performed as DO I =2,15
  • and without any test Valueonleft = PHI(I-1)

HTML version of Basic Foils prepared February 13 00

Foil 13 Sequential Guard Rings in Two Dimensions

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
In analogous 2D sequential case, one could dimension array PHI(?) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update
Rather dimension PHI(16,16) to include internal and boundary points
  • Run loops over x(I) and y(J) from 2 to 15 to only cover internal points
  • Preload boundary values in PHI(1, . ), PHI( . , 16), PHI(.,1), PHI( . ,16)
This is easier and faster as no conditionals (IF statements) in inner loops

HTML version of Basic Foils prepared February 13 00

Foil 14 Parallel Guard Rings in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Now we decompose our 16 points (trivial example) into four groups and put 4 points in each processor
Instead of dimensioning PHI(4) in each processor, one dimensions PHI(6) and runs loops from 2 to 5 with either boundary values or communication setting values of end-points
Sequential:
Parallel:
PHI(6) for Processor 1

HTML version of Basic Foils prepared February 13 00

Foil 15 Summary of Parallel Guard Rings in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
In bi-color points, upper color is "owning processor" and bottom color is that of processor that needs value for updating neighboring point
Owned by Green -- needed by Yellow
Owned by Yellow -- needed by Green

HTML version of Basic Foils prepared February 13 00

Foil 16 Setup of Parallel Jacobi in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index

HTML version of Basic Foils prepared February 13 00

Foil 17 Performance Analysis Parameters

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
This will only depend on 3 parameters
n which is grain size -- amount of problem stored on each processor (bounded by local memory)
tfloat which is typical time to do one calculation on one node
tcomm which is typical time to communicate one word between two nodes
Most importance omission here is communication latency
Time to communicate = tlatency+ (Num Words)tcomm
Node A
Node B
tcomm
CPU tfloat
CPU tfloat
Memory n
Memory n

HTML version of Basic Foils prepared February 13 00

Foil 18 Analytical analysis of Load Imbalance

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Consider N by N array of grid points on P Processors where ?P is an integer and they are arranged in a ?P by ?P topology
Suppose N is exactly divisible by ?P and a general processor has a grain size n = N2/P grid points
Sequential time T1 = (N-2)2 tcalc
Parallel Time TP = n tcalc
Speedup S = T1/TP = P (1 - 2/N)2 = P(1 - 2/?(nP) )2
S tends to P as N gets large at fixed P
This expresses analytically intuitive idea that load imbalance due to boundary effects and will go away for large N

HTML version of Basic Foils prepared February 13 00

Foil 19 Example of Communication Overhead

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Largest communication load is communicating 16 words to be compared to calculating 16 updates -- each taking time tcalc
Each communication is one value of ? probably stored in a 4 byte word and takes time tcomm
Then on 16 processors, T16 = 16tcalc + 16tcomm
Speedup S = T1/T16 = 12.25 / (1 + tcomm/tcalc)
or S = 12.25 / (1 + 0.25 tcomm/tfloat)
or S ? 12.25 * (1 - 0.25 tcomm/tfloat)

HTML version of Basic Foils prepared February 13 00

Foil 20 General Analytical Form of Communication Overhead for Jacobi

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Consider N grid points in P processors with grain size n = N2/P
Sequential Time T1 = 4N2 tfloat
Parallel Time TP = 4 n tfloat + 4 ?n tcomm
Speed up S = P (1 - 2/N)2 / (1 + tcomm/(?n tfloat) )
Both overheads decrease like 1/?n as n increases
This ignores communication latency but is otherwise accurate
Speed up is reduced from P by both overheads
Load Imbalance Communication Overhead

HTML version of Basic Foils prepared February 13 00

Foil 21 General Speed Up and Efficiency Analysis I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Efficiency ? = Speed Up S / P (Number of Processors)
Overhead fcomm = (P TP - T1) / T1 = 1/ ? - 1
As fcomm linear in TP, overhead effects tend to be additive
In 2D Jacobi example fcomm = tcomm/(?n tfloat)
While efficiency takes approximate form ? ? 1 - tcomm/(?n tfloat) valid when overhead is small
As expected efficiency is < 1 corresponding to speedup being < P

HTML version of Basic Foils prepared February 13 00

Foil 22 General Speed Up and Efficiency Analysis II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
In many problems there is an elegant formula fcomm = constant . tcomm/(n1/d tfloat)
d is system information dimension which is equal to geometric dimension in problems like Jacobi where communication is a surface and calculation a volume effect
  • We will see soon case where d is NOT geometric dimension
d=1 for Hadrian's wall and d=2 for Hadrian's Palace floor while for Jacobi in 1 2 or 3 dimensions, d =1 2 or 3
Note formula only depend on local node and communication parameters and this implies that parallel computing does scale to large P if you build fast enough networks (tcomm/tfloat) and have a large enough problem (big n)

HTML version of Basic Foils prepared February 13 00

Foil 23 Communication to Calculation Ratio as a function of template I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
For Jacobi, we have
Calculation 4 n tfloat
Communication 4 ?n tcomm
Communication Overhead fcomm = tcomm/(?n tfloat)
"Smallest" Communication but NOT smallest overhead
Update Stencil
Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 24 Communication to Calculation Ratio as a function of template II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
For Jacobi with fourth order differencing, we have
Calculation 9 n tfloat
Communication 8 ?n tcomm
Communication Overhead fcomm = 0.89 tcomm/(?n tfloat)
A little bit smaller as communication and computation both doubled
Update Stencil
Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 25 Communication to Calculation Ratio as a function of template III

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
For Jacobi with diagonal neighbors, we have
Calculation 9 n tfloat
Communication 4(?n + 1 ) tcomm
Communication Overhead fcomm = 0.5 tcomm/(?n tfloat)
Quite a bit smaller
Update Stencil
Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 26 Communication to Calculation IV

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Now systematically increase size of stencil. You get this in particle dynamics problems as you increase range of force
Calculation per point increases but communication increases faster and fcomm decreases systematically
Must re-use communicated values for this to work!
Update Stencils of increasing range

HTML version of Basic Foils prepared February 13 00

Foil 27 Communication to Calculation V

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Now make range cover full domain as in long range force problems
fcomm ? tcomm/(n tfloat)
This is a case with geometric dimension 1 2 or 3 (depending on space particles in) but information dimension always 1

HTML version of Basic Foils prepared February 13 00

Foil 28 Parallel Guard Rings in Two Dimensions I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
This is just like one dimensional case
First we decompose problem as we have seen
Four Processors are shown

HTML version of Basic Foils prepared February 13 00

Foil 29 Parallel Guard Rings in Two Dimensions II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Now look at processor in top left
It needs real boundary values for updates shown as black and green
Then it needs points from neighboring processors shown hatched with green and other processor color

HTML version of Basic Foils prepared February 13 00

Foil 30 Parallel Guard Rings in Two Dimensions III

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *
Full HTML Index
Now we see the effect of all guards with four points at center needed by 3 processors and other shaded points by 2
One dimensions overlapping grids PHI(10,10) here and arranges communication order properly

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Feb 21 2000