Full HTML for Basic Master Set for Parallel Programming for Laplace's Equation

Full HTML for

Basic foilset Master Set for Parallel Programming for Laplace's Equation

Given by Geoffrey C. Fox at CPS615 Spring Semester 00 on February 00. Foils prepared February 13 00
Outside Index Summary of Material

This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:

Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)

SPMD -- Single Program Multiple Data -- Programming Model

Stencil dependence of Parallel Program and use of Guard Rings

Collective Communication

Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect

Table of Contents for full HTML of Master Set for Parallel Programming for Laplace's Equation

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

Parallel Programming for Laplace's Equation
2

Abstract of Parallel Programming for Laplace's Equation
3

Potential in a Vacuum Filled Rectangular Box
4

Basic Sequential Algorithm
5

Update on the Grid
6

Parallelism is Straightforward
7

Communication is Needed
8

Communication Must be Reduced
9

Various Styles/Terms in Messaging Passing I
10

Various Styles/Terms in Messaging Passing II
11

Approaches to Parallel Programming
12

Sequential Programming with Guard Rings
13

Sequential Guard Rings in Two Dimensions
14

Parallel Guard Rings in One Dimension
15

Summary of Parallel Guard Rings in One Dimension
16

Setup of Parallel Jacobi in One Dimension
17

Performance Analysis Parameters
18

Analytical analysis of Load Imbalance
19

Example of Communication Overhead
20

General Analytical Form of Communication Overhead for Jacobi
21

General Speed Up and Efficiency Analysis I
22

General Speed Up and Efficiency Analysis II
23

Communication to Calculation Ratio as a function of template I
24

Communication to Calculation Ratio as a function of template II
25

Communication to Calculation Ratio as a function of template III
26

Communication to Calculation IV
27

Communication to Calculation V
28

Parallel Guard Rings in Two Dimensions I
29

Parallel Guard Rings in Two Dimensions II
30

Parallel Guard Rings in Two Dimensions III

Outside Index Summary of Material

HTML version of Basic Foils prepared February 13 00

Foil 1 Parallel Programming for Laplace's Equation

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Spring Semester 2000

Geoffrey Fox

Northeast Parallel Architectures Center

Syracuse University

111 College Place

Syracuse NY

gcf@npac.syr.edu

gcf@cs.fsu.edu

HTML version of Basic Foils prepared February 13 00

Foil 2 Abstract of Parallel Programming for Laplace's Equation

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:

Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)

SPMD -- Single Program Multiple Data -- Programming Model

Stencil dependence of Parallel Program and use of Guard Rings

Collective Communication

Basic Speed Up, Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effect

HTML version of Basic Foils prepared February 13 00

Foil 3 Potential in a Vacuum Filled Rectangular Box

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

So imagine the world's simplest problem

Find the electrostatic potential inside a box whose sides are at a given potential

Set up a 16 by 16 Grid on which potential defined and which must satisfy Laplace's Equation

HTML version of Basic Foils prepared February 13 00

Foil 4 Basic Sequential Algorithm

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Initialize the internal 14 by 14 grid to anything you like and then apply for ever!

? New = (? Left + ? Right + ? Up + ? Down ) / 4

HTML version of Basic Foils prepared February 13 00

Foil 5 Update on the Grid

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

14 by 14 Internal Grid

HTML version of Basic Foils prepared February 13 00

Foil 6 Parallelism is Straightforward

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

If one has 16 processors, then decompose geometrical area into 16 equal parts

Each Processor updates 9 12 or 16 grid points independently

HTML version of Basic Foils prepared February 13 00

Foil 7 Communication is Needed

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Updating edge points in any processor requires communication of values from neighboring processor

For instance, the processor holding green points requires red points

HTML version of Basic Foils prepared February 13 00

Foil 8 Communication Must be Reduced

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

4 by 4 regions in each processor

16 Green (Compute) and 16 Red (Communicate) Points

8 by 8 regions in each processor

64 Green and "just" 32 Red Points

Communication is an edge effect

Give each processor plenty of memory and increase region in each machine

Large Problems Parallelize Best

HTML version of Basic Foils prepared February 13 00

Foil 9 Various Styles/Terms in Messaging Passing I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Asynchronous Communication:The style in which a node is free to send messages to any other node at any time and regardless of hardware connectivity and "node readiness"

Blocking and Non-Blocking Communication:

In the blocking style, receive operations suspend the calling program until a suitable message has arrived.
In the non-blocking style, a read immediately returns with (typically) a status code indicating whether or not a suitable message has arrived. Message send operations usually return as soon as the data is safely "in transit" whether or not it has been received

Collective Communication: High level communication calls that carry out send/receive operations on groups of nodes e.g. broadcast (all) or multicast (partial set). These routines free user from determining optimal method to carry out such communication and can lead to significant performance gain and ease of programming.

HTML version of Basic Foils prepared February 13 00

Foil 10 Various Styles/Terms in Messaging Passing II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Interrupt Driven Communication: A system in which messages arriving at a node cause an interruption in the flow of the program executing there. The system (or user program) must handle the incoming message before returning to the place where the interrupt occured.

Active messages and Java RMI - Remote Method Invocation (UNIX RPC) are implemented this way

Loosely Synchronous Communication: The Programming model in which an application is divided into compute followed by communication phases.

The BSP (Bulk Synchronization ) system uses this.

The communication phase is often implemented with collective communication and serves to synchronize processors

HTML version of Basic Foils prepared February 13 00

Foil 11 Approaches to Parallel Programming

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will only discuss at high level

Typical Data Parallel Fortran Statements are full array statements
B=A1 + A2
B=EOSHIFT(A,-1)
Function operations on arrays representing full data domain

Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature

Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"

Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory

HTML version of Basic Foils prepared February 13 00

Foil 12 Sequential Programming with Guard Rings

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

The concept of guard rings/points/"halos" is well known in sequential case where one has for a trivial example in one dimension (shown above) 16 points.

The end points are fixed boundary values

One could save space and dimension PHI(14) and use boundary values by statements for I=1,14 like

IF( I.EQ.1) Valueonleft = BOUNDARYLEFT
ELSE Valueonleft = PHI(I-1) etc.

But this is slower and clumsy to program due to conditionals INSIDE Loop and one dimensions instead PHI(16) storing boundary values in PHI(1) and PHI(16)

Updates are performed as DO I =2,15
and without any test Valueonleft = PHI(I-1)

HTML version of Basic Foils prepared February 13 00

Foil 13 Sequential Guard Rings in Two Dimensions

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

In analogous 2D sequential case, one could dimension array PHI(?) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update

Rather dimension PHI(16,16) to include internal and boundary points

Run loops over x(I) and y(J) from 2 to 15 to only cover internal points
Preload boundary values in PHI(1, . ), PHI( . , 16), PHI(.,1), PHI( . ,16)

This is easier and faster as no conditionals (IF statements) in inner loops

HTML version of Basic Foils prepared February 13 00

Foil 14 Parallel Guard Rings in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Now we decompose our 16 points (trivial example) into four groups and put 4 points in each processor

Instead of dimensioning PHI(4) in each processor, one dimensions PHI(6) and runs loops from 2 to 5 with either boundary values or communication setting values of end-points

Sequential:

Parallel:

PHI(6) for Processor 1

HTML version of Basic Foils prepared February 13 00

Foil 15 Summary of Parallel Guard Rings in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

In bi-color points, upper color is "owning processor" and bottom color is that of processor that needs value for updating neighboring point

Owned by Green -- needed by Yellow

Owned by Yellow -- needed by Green

HTML version of Basic Foils prepared February 13 00

Foil 16 Setup of Parallel Jacobi in One Dimension

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

HTML version of Basic Foils prepared February 13 00

Foil 17 Performance Analysis Parameters

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

This will only depend on 3 parameters

n which is grain size -- amount of problem stored on each processor (bounded by local memory)

tfloat which is typical time to do one calculation on one node

tcomm which is typical time to communicate one word between two nodes

Most importance omission here is communication latency

Time to communicate = tlatency+ (Num Words)tcomm

Node A

Node B

tcomm

CPU tfloat

Memory n

HTML version of Basic Foils prepared February 13 00

Foil 18 Analytical analysis of Load Imbalance

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Consider N by N array of grid points on P Processors where ?P is an integer and they are arranged in a ?P by ?P topology

Suppose N is exactly divisible by ?P and a general processor has a grain size n = N2/P grid points

Sequential time T1 = (N-2)2 tcalc

Parallel Time TP = n tcalc

Speedup S = T1/TP = P (1 - 2/N)2 = P(1 - 2/?(nP) )2

S tends to P as N gets large at fixed P

This expresses analytically intuitive idea that load imbalance due to boundary effects and will go away for large N

HTML version of Basic Foils prepared February 13 00

Foil 19 Example of Communication Overhead

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Largest communication load is communicating 16 words to be compared to calculating 16 updates -- each taking time tcalc

Each communication is one value of ? probably stored in a 4 byte word and takes time tcomm

Then on 16 processors, T16 = 16tcalc + 16tcomm

Speedup S = T1/T16 = 12.25 / (1 + tcomm/tcalc)

or S = 12.25 / (1 + 0.25 tcomm/tfloat)

or S ? 12.25 * (1 - 0.25 tcomm/tfloat)

HTML version of Basic Foils prepared February 13 00

Foil 20 General Analytical Form of Communication Overhead for Jacobi

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Consider N grid points in P processors with grain size n = N2/P

Sequential Time T1 = 4N2 tfloat

Parallel Time TP = 4 n tfloat + 4 ?n tcomm

Speed up S = P (1 - 2/N)2 / (1 + tcomm/(?n tfloat) )

Both overheads decrease like 1/?n as n increases

This ignores communication latency but is otherwise accurate

Speed up is reduced from P by both overheads

Load Imbalance Communication Overhead

HTML version of Basic Foils prepared February 13 00

Foil 21 General Speed Up and Efficiency Analysis I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Efficiency ? = Speed Up S / P (Number of Processors)

Overhead fcomm = (P TP - T1) / T1 = 1/ ? - 1

As fcomm linear in TP, overhead effects tend to be additive

In 2D Jacobi example fcomm = tcomm/(?n tfloat)

While efficiency takes approximate form ? ? 1 - tcomm/(?n tfloat) valid when overhead is small

As expected efficiency is < 1 corresponding to speedup being < P

HTML version of Basic Foils prepared February 13 00

Foil 22 General Speed Up and Efficiency Analysis II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

In many problems there is an elegant formula fcomm = constant . tcomm/(n1/d tfloat)

d is system information dimension which is equal to geometric dimension in problems like Jacobi where communication is a surface and calculation a volume effect

We will see soon case where d is NOT geometric dimension

d=1 for Hadrian's wall and d=2 for Hadrian's Palace floor while for Jacobi in 1 2 or 3 dimensions, d =1 2 or 3

Note formula only depend on local node and communication parameters and this implies that parallel computing does scale to large P if you build fast enough networks (tcomm/tfloat) and have a large enough problem (big n)

HTML version of Basic Foils prepared February 13 00

Foil 23 Communication to Calculation Ratio as a function of template I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

For Jacobi, we have

Calculation 4 n tfloat

Communication 4 ?n tcomm

Communication Overhead fcomm = tcomm/(?n tfloat)

"Smallest" Communication but NOT smallest overhead

Update Stencil

Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 24 Communication to Calculation Ratio as a function of template II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

For Jacobi with fourth order differencing, we have

Calculation 9 n tfloat

Communication 8 ?n tcomm

Communication Overhead fcomm = 0.89 tcomm/(?n tfloat)

A little bit smaller as communication and computation both doubled

Update Stencil

Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 25 Communication to Calculation Ratio as a function of template III

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

For Jacobi with diagonal neighbors, we have

Calculation 9 n tfloat

Communication 4(?n + 1 ) tcomm

Communication Overhead fcomm = 0.5 tcomm/(?n tfloat)

Quite a bit smaller

Update Stencil

Communicated Updated Processor Boundaries

HTML version of Basic Foils prepared February 13 00

Foil 26 Communication to Calculation IV

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Now systematically increase size of stencil. You get this in particle dynamics problems as you increase range of force

Calculation per point increases but communication increases faster and fcomm decreases systematically

Must re-use communicated values for this to work!

Update Stencils of increasing range

HTML version of Basic Foils prepared February 13 00

Foil 27 Communication to Calculation V

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Now make range cover full domain as in long range force problems

fcomm ? tcomm/(n tfloat)

This is a case with geometric dimension 1 2 or 3 (depending on space particles in) but information dimension always 1

HTML version of Basic Foils prepared February 13 00

Foil 28 Parallel Guard Rings in Two Dimensions I

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

This is just like one dimensional case

First we decompose problem as we have seen

Four Processors are shown

HTML version of Basic Foils prepared February 13 00

Foil 29 Parallel Guard Rings in Two Dimensions II

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Now look at processor in top left

It needs real boundary values for updates shown as black and green

Then it needs points from neighboring processors shown hatched with green and other processor color

HTML version of Basic Foils prepared February 13 00

Foil 30 Parallel Guard Rings in Two Dimensions III

From Master Set for Parallel Programming for Laplace's Equation CPS615 Spring Semester 00 -- February 00. *

Full HTML Index

Now we see the effect of all guards with four points at center needed by 3 processors and other shaded points by 2

One dimensions overlapping grids PHI(10,10) here and arranges communication order properly

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Feb 21 2000