Full HTML for Basic CPS615 Laplace Example -- Programming Models and Performance

Full HTML for

Basic foilset CPS615 Laplace Example -- Programming Models and Performance

Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 97. Foils prepared 2 September 1997
Outside Index Summary of Material

This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:

Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)

SPMD -- Single Program Multiple Data -- Programming Model

Stencil dependence of Parallel Program and use of Guard Rings

Collective Communication

Basic Speed Up,Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effects.

Table of Contents for full HTML of CPS615 Laplace Example -- Programming Models and Performance

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

1

CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1997 --
The Laplace Example
Programming and Performance
2

Abstract of Laplace Example for CPS615
3

Parallel Computing
Algorithms and
Software --
Laplace Example
4

The Solution of Laplace's Equation
5

Discretized Form of Laplace'e Equation on a Parallel Processor
6

Basic Structure of Domain to be Updated in Parallel Version
7

Sequential and Introduction to Parallel Coding for the
Laplace Example
8

SEQUENTIAL LAPLACE PROGRAMMING
JACOBI ITERATION IN ONE DIMENSION
(constant in y direction)
9

SEQUENTIAL LAPLACE PROGRAMMING
JACOBI ITERATION IN TWO DIMENSIONS
10

Approaches to Parallel Programming
11

SPMD or SCMD
Single Program (code) Multiple Data
12

Data Parallel
Programming for
Laplace Example
13

Parallel Laplace Programming
Data Parallel for Jacobi Iteration in One Dimension
14

Notes on HPF Implementation of Lapace Solver
15

Message Passing Model
Used for Parallel
Programming for
Laplace Example
16

Basic Message Passing Approach
17

Various Styles/Terms in Messaging Passing - I
18

Various Styles/Terms in Messaging Passing - II
19

Parallel Laplace Programming: Set Up of
Message Passing for Jacobi Iteration in One Dimension
20

Node Program: Message Passing for Laplace Sover
21

Collective Communication Primitives
22

Implementation of MPSHIFT(+1, SOURCE,DEST)
23

Possible Implementation of MPSHIFT in MPI
24

Implementation of SHIFT in MPI
25

Implementation of GLOBALMAX (TEST)
26

General Features
of
Laplace Example
27

What does the Laplace Update calculation look like?
28

The Various Stencil Special Cases
29

Communication Loads
30

What is the relation of sequential and parallel programming models ?
31

More on SPMD Programming Model and Sequential/Parallel Comparison
32

Programming with Guard Rings - Sequential
33

Programming with Guard Rings - Parallel
34

Special Case of Corner Processor
35

Analysis of Parallel Overheads
Efficiency and
Speedup
36

General Formalism for Speed Up
37

What Affects Speed Up ?
38

Load Imbalance and Speed-Up for Laplace Example -- I
39

Load Imbalance and Speed-Up for Laplace Example -- II
40

Analytical Analysis of Load Imbalance
41

Communication Overhead
42

Analytical Form of Speed Up for Communication Overhead
43

General Form of Efficiency
44

Communication to Calculation Ratio as a function of template
45

Performance for Increasing Stencil
46

Matrix Multiplication on the Hypercube
47

Efficiency of QCD Physics Simulation on JPL MarkIIIfp Hypercube
48

General Analysis of Overheads and Efficiency
49

Speed Up as a Function of Grain Size

Outside Index Summary of Material

HTML version of Basic Foils prepared 2 September 1997

Foil 1 CPS615 -- Base Course for the Simulation Track of Computational Science
Fall Semester 1997 --
The Laplace Example
Programming and Performance

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Geoffrey Fox

NPAC

Room 3-131 CST

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 2 September 1997

Foil 2 Abstract of Laplace Example for CPS615

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:

Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)

SPMD -- Single Program Multiple Data -- Programming Model

Stencil dependence of Parallel Program and use of Guard Rings

Collective Communication

Basic Speed Up,Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effects.

HTML version of Basic Foils prepared 2 September 1997

Foil 3 Parallel Computing
Algorithms and
Software --
Laplace Example

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 4 The Solution of Laplace's Equation

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Solve Laplace's Equation:

on a rectangular domain, with specified boundary conditions.

Use simple iterative Gauss-Seidel algorithm:

f new = ( f left + f right + f up + f down ) / 4

HTML version of Basic Foils prepared 2 September 1997

Foil 5 Discretized Form of Laplace'e Equation on a Parallel Processor

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

256 Grid Points

16-Node Concurrent Processor

16 Grid Points in each Processor

f is unknown

f is known

Update Stencil

HTML version of Basic Foils prepared 2 September 1997

Foil 6 Basic Structure of Domain to be Updated in Parallel Version

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

X denotes f values to be communicated

HTML version of Basic Foils prepared 2 September 1997

Foil 7 Sequential and Introduction to Parallel Coding for the
Laplace Example

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 8 SEQUENTIAL LAPLACE PROGRAMMING
JACOBI ITERATION IN ONE DIMENSION
(constant in y direction)

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

DIMENSION PHINEW(NTOT), PHIOLD(NTOT)
NTOT1 = NTOT-1

BEGIN TEST=0

DO 1 I=2, NTOT1
TEMP = 0.5 * (PHIOLD(I-1) + PHIOLD(I+1))
TEST = AMAX1 (TEST, ABS (TEMP-PHIOLD(I)))

1 PHINEW (I) = TEMP

DO 2 I=2, NTOT1

2 PHIOLD (I) = PHINEW(I)

IF (TEST > CONVG) GO TO BEGIN
ELSE STOP

HTML version of Basic Foils prepared 2 September 1997

Foil 9 SEQUENTIAL LAPLACE PROGRAMMING
JACOBI ITERATION IN TWO DIMENSIONS

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

DIMENSION PHINEW (NTOT, NTOT), PHIOLD (NTOT, NOT)
NTOT1 = NTOT-1

BEGIN TEST = 0

DO 1 I=2, NTOT1
DO 1 J=2, NTOT1
TEMP=0.25 * (PHIOLD(I, J + 1) + PHIOLD (I, J-1) + PHIOLD (I+1,J) +
PHIOLD (I-1,J))
TEST = AMAX1(TEST,ABS(TEMP-PHIOLD(I,J)))

1 PHINEW (I,J) = TEMP
- DO 2 I=2, NTOT1
- DO 2 J=2, NTOT1
2 PHIOLD (I,J) = PHINEW (I,J)
- IF (TEST>CONVG) GO TO BEGIN
- ELSE STOP

HTML version of Basic Foils prepared 2 September 1997

Foil 10 Approaches to Parallel Programming

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Data Parallel typified by CMFortran and its generalization - High Performance Fortran which we have discussed separately

Typical Data Parallel Fortran Statements are full array statements

B=A1 + A2
B=EOSHIFT(A,-1)
Function operations on arrays representing full data domain

Message Passing typified by later discussion of Laplace Example which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature

Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"

Data parallel software is translated by a compiler into "machine language" which is typically message passing

HTML version of Basic Foils prepared 2 September 1997

Foil 11 SPMD or SCMD
Single Program (code) Multiple Data

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

A particular way of using MIMD machines - DOMINANT successful use so far.

Each processor runs same program but in general executes different instructions at different times.

Will later see corresponds to "loosely synchronous problems".

Style of current program example -- note although each doing roughly same thing -- i.e. updating grid points -- each node is NOT at same point in update at each clock cycle

HTML version of Basic Foils prepared 2 September 1997

Foil 12 Data Parallel
Programming for
Laplace Example

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 13 Parallel Laplace Programming
Data Parallel for Jacobi Iteration in One Dimension

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

!HPF$ TEMPLATE WORLD(NTOT) (1)

!HPF$ DISTRIBUTE WORLD(BLOCK) (2)

DIMENSION PHINEW (NTOT), PHIOLD(NTOT)

!HPF$ ALIGN PHINEW WITH WORLD (3)

!HPF$ ALIGN PHIOLD WITH WORLD (3)

NTOT1 = NTOT-1
NTOT2 = NTOT-2

BEGIN PHINEW (2:NTOT1) =0.5* (EOSHIFT (PHIOLD,1) + EOSHIFT (PHIOLD, -1)) (4)

TEST = MAXVAL (ABS(PHINEW(2:NTOT1)-PHIOLD(2:NTOT1))) (5)
PHIOLD=PHINEW
IF (TEST>CONVG) GO TO BEGIN
ELSE STOP

HTML version of Basic Foils prepared 2 September 1997

Foil 14 Notes on HPF Implementation of Lapace Solver

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

(1) DefInes a data world of NTOT entries

(2) Breaks up world into equal parts in each processor decomposed in one dimension

(3) Instructs that PHINEW AND PHIOLD are aligned exactly with world i.e. PHINEW(I) and PHIOLD(I) are stored in the I'th position of world

(4) EOSHIFT is "End-Off" SHIFT. The calls shift PHIOLD by one both to left and right. The indices take care of WORLD (1 and NTOT) being boundary values

EOSHIFT (and corresponding circular shift (CSHIFT) are "standard" Fortran90 array manipulation routines. They are not specific to parallel computing. We can write this statement (4) as a direct array operation.

BEGIN PHINEW(2:NTOT1) = 0.5 * (PHIOLD (1:NTOT2) + PHIOLD (3:NTOT))

(5) Forms TEST as maximum change in absolute value of j at any world location. MAXVAL is a Fortran90 intrinsic function.

NOTE: Any subscript n:m is an "array section" or set of indices

HTML version of Basic Foils prepared 2 September 1997

Foil 15 Message Passing Model
Used for Parallel
Programming for
Laplace Example

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 16 Basic Message Passing Approach

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Basic operations are SEND and RECEIVE

But there are a lot of subtleties

How does program 2 know what message passing it is getting ? Does it come from program 1 or another one ?

Does program 1 get an acknowledgment (analogous to certified mail) ?

Do we use system or application level implementations ? -- raises issues of protection, copying

Broadcasts, Multicasts -- other collective or multiple message/source/destination communications

Do we have multiple processes or threads per processor ? -- How do we address?

HTML version of Basic Foils prepared 2 September 1997

Foil 17 Various Styles/Terms in Messaging Passing - I

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Asynchronous Communication:The style in which a node is free to send messages to any other node at any time and regardless of hardware connectivity and "node readiness"

Blocking and Non-Blocking Communication: In the blocking style, receive opeations suspend the calling program until a suitable message has arrived.

In the non-blocking style, a read immediately returns with (typically) a status code indicating whether or not a suitable message has arrived. Message send operations usually return as soon as the data is safely "in transit" whether or not it has been received

Collective Communication: High level communication calls that carry out send/receive operations on groups of nodes e.g. broadcast (all) or multicast (partial set). These routines free user from determining optimal method to carry out such communication and can lead to significant performance gain and ease of programming.

HTML version of Basic Foils prepared 2 September 1997

Foil 18 Various Styles/Terms in Messaging Passing - II

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Crystalline Communication: The style in which a node reading or writing is blocked until the matching operation is performed in the target node. Messages can only be sent or received between processors with direct hardware connection

Domain Decomposition or Data Parallelism: The programming Style in which parallelism is achieved by dividing the data on which algorithm works among the processors.

Interrupt Driven Communication: A system in which messages arriving at a node cause an interruption in the flow of the program executing there. The system (or user program) must handle the incoming message before returning to the place where the interrupt occured.

Active messages and Java RMI - Remote Method Invocation (UNIX RPC) are implemented this way

Loosely Synchronous Communication: The Programming model in which an application is divided into compute followed by communication phases.

The BSP (Bulk Synchronization ) system uses this. The communcation phase is often implemented with collective communication and serves to synchronize processors

HTML version of Basic Foils prepared 2 September 1997

Foil 19 Parallel Laplace Programming: Set Up of
Message Passing for Jacobi Iteration in One Dimension

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Processor 1

("typical")

HTML version of Basic Foils prepared 2 September 1997

Foil 20 Node Program: Message Passing for Laplace Sover

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

DIMENSION PHINEW(NLOC+2), PHIOLD(NLOC+2)

Initialization NLOC=NTOT/NPROC (Assume NTOT divisible by NPROC)

I1=2
NLOC1=NLOC + 1
IF (PROCNUM.EQ.NPROC-1 .or. PROCNUM.EQ.0) NLOC1=NLOC1-1

BASIC LOOP

BEGIN TEST = 0

Shift to the right SOURCE = PHIOLD(NLOC1) address of data to be sent IF(PROCNUM.EQ.NPROC-1) SOURCE = "DUMMY" DEST = PHIOLD(1) address where data to be stored IF(PROCNUM.EQ.1)DEST = "DUMMY" CALL MPSHIFT(+1, SOURCE, DEST)

Shift to left SOURCE = FOLD(I1) IF(PROCNUM.EQ.0) SOURCE = "DUMMY" address of data to be sent DEST = PHIOLD(NLOC1+1) IF(PROCNUM.EQ.NPROC-1) DEST = "DUMMY" address where data to be stored CALL MPSHIFT(-1, SOURCE, DEST)

DO 1 I=I1,NLOC1 TEMP = 0.5 * (PHIOLD(I-1) + PHIOLD(I+1)) TEST = AMAX1(TEST,ABS(TEMP-PHIOLD(I))) 1 PHINEW (I) = TEMP DO 2 I=I1, NLOC1 2 PHIOLD(I) = PHINEW(I)
CALL GLOBALMAX (TEST) IF (TEST>CONVG) GO TO BEGIN ELSE STOP

HTML version of Basic Foils prepared 2 September 1997

Foil 21 Collective Communication Primitives

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

The example uses three idealized (not present in real message passing system) primitives

Message Passing Shift to right MPSHIFT (+1, SOURCE, DEST)

Sends 1 word in location SOURCE to processor on the right

Receives word in location DEST from the processor on the left

SOURCE and DEST are locations -- if set to "DUMMY", then no information is to be sent or received

Message Passing Shift to left MPSHIFT (-1, SOURCE, DEST)

Sends 1 word in SOURCE to processor on the left

Receives word in DEST from processor on the right

GLOBALMAX (TEST)

takes TEST from all processors

forms TESTMAX = maximum value of TEST over all processors

replaces TEST by TESTMAX in all processors

HTML version of Basic Foils prepared 2 September 1997

Foil 22 Implementation of MPSHIFT(+1, SOURCE,DEST)

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Consider for example, shift to right

Then Processors 0 .. Nproc-2 Send to right

Processors 1 .. Nproc-1 Receive from left

We can't necessarily send all messages and then receive all messages. "Standard" send can be "blocking" i.e. will not return unless receive completed by destination processor. In this case, we can deadlock as all "hang" after send, waiting for receive

So we are more careful

HTML version of Basic Foils prepared 2 September 1997

Foil 23 Possible Implementation of MPSHIFT in MPI

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Let PROCNUM be processor "rank" (number or order in

one dimensional decomposition)

IPAR = PROCNUM/2

IF (IPAR.e.q.0)
- THEN ! even processors
- IF (SOURCE.NE."DUMMY")
- CALL MPI_SEND (SOURCE,1,MPI_REAL,PROCNUM+1,tag,comm)
- IF (DEST.NE"DUMMY")
- CALL MPI_RECV (DEST,1,MPI_REAL, PROCNUM-1,tag,comm,status)
- ELSE ! odd processors
- IF (DEST.NE."DUMMY")
- CALL MPI_RECV (DEST,1,MPI_REAL, PROCNUM-1,tag,comm)
- IF (SOURCE.NE."DUMMY")
- CALL MPI_SEND (SOURCE,1,MPI_REAL,PROCNUM+1,tag,comm,status)

Note: MPI uses reserved word MPI_PROCNULL rather than "DUMMY". Also, we could remove setting of "DUMMY" in calling routine and placed in MPSHIFT as test on PROCNUM

HTML version of Basic Foils prepared 2 September 1997

Foil 24 Implementation of SHIFT in MPI

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

We can implement MPSHIFT directly in MPI as CALL MPI_SENDRECV(SOURCE,1,MPI_REAL, PROCNUM+1, sendtag, DEST,1,MPI_REAL, PROCNUM-1, recvtag,comm,status)

Notes:

MPI_REAL denotes that variable is real

"sendtag/recvtag" are for this purpose, a largely irrelevant additional message tag

"comm" is extra system message tag defining "scope" -- i.e. the set of processors involved -- here it is all of them

"status" tells you about received data. You needn't look at it if you trust your code and hardware

HTML version of Basic Foils prepared 2 September 1997

Foil 25 Implementation of GLOBALMAX (TEST)

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

In MPI, this is a single call

CALL MPI_ALLREDUCE (TEST,TEST,1,MPI_REAL,MPI_MAX,comm)

Flag MPI_MAX specifies global maximum

The implementation is quite subtle and usually involves a logarithmic tree

There is a clever extension to get maximum in all processors in "same" time as that on one processor on hypercube and other architectures

HTML version of Basic Foils prepared 2 September 1997

Foil 26 General Features
of
Laplace Example

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 27 What does the Laplace Update calculation look like?

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

In the sequential code, a single processor updates in turn (16x16=256-60=196) internal points. Each update is j --> 0.25 *(j up + jdown + jleft + jright ) involves 4 floating point operations for each point

In the parallel case, each processor updates in turn the points for which it is responsible - this is "owner computes rule"

A corner processor updates nine points

(the small internal points)

A "general" central processor

updates sixteen points

HTML version of Basic Foils prepared 2 September 1997

Foil 28 The Various Stencil Special Cases

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

There are several different cases illustrated by the points A,B,C in case of corner processor

In each case we superimpose stencil on top of point to be updated.

The large circle must be placed on top of each point "owned" by the processor. The cases A,B,C differ because the points used have different characteristics

HTML version of Basic Foils prepared 2 September 1997

Foil 29 Communication Loads

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

For the corner processor we need to communicate a total of 6 points (marked with an X)

For a general processor, we need to communicate a total of 16 points (marked with an X)

HTML version of Basic Foils prepared 2 September 1997

Foil 30 What is the relation of sequential and parallel programming models ?

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

In both cases, one is "solving" the "same" problem j --> .25 (jleft +jright + jup + jdown)

over a set of points

In sequential case, problem is "solved" over the full domain

In parallel case, problem is "solved" over part of domain (roughly one sixteenth of it) i.e. each node is solving same problem as sequential case but over a different - smaller geometry

In sequential case, boundary conditions are "just" fixed values on the edge

In the parallel case, boundary conditions are in addition, communicated values

HTML version of Basic Foils prepared 2 September 1997

Foil 31 More on SPMD Programming Model and Sequential/Parallel Comparison

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Parallel and Sequential cases differ in

geometry
boundaries
We also saw this in "society" analogy with Hadrian's wall

This analysis implies SPMD - Single Programming Multiple Data - programming model i.e. each nodes executes the same program but runs through different data

Also note that not only does each node of the parallel machine run the same program but also this program is very similar to sequential program as long as one isolates boundary and geometry sections

Note that this problem can be solved on SIMD or MIMD computer. Thus SPMD example is not totally general. There are SPMD examples which require MIMD machines

HTML version of Basic Foils prepared 2 September 1997

Foil 32 Programming with Guard Rings - Sequential

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

In sequential case, one could dimension array PHI(F) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update

Rather dimension PHI(16,16) to include internal and boundary points

Run loops over x(I) and y(J) from 2 to 15 to only cover internal points
Preload boundary values in PHI(1, . ), PHI( . , 16),PHI( . ,1),
PHI( . ,16)
This is easier and faster as no conditionals (IF statements) in inner loops

HTML version of Basic Foils prepared 2 September 1997

Foil 33 Programming with Guard Rings - Parallel

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Similarily we will not dimension PHI(4,4) but rather PHI(6,6) to hold communicated values

Now we preload PHI(1, . ), PHI ( . , 1), PHI(6, . ), PHI( . , 6) by communication and then basic "parallel" program update is identical to sequential case and one loops over x(I), y(J) with I and J in range 2 to 5

As "load imbalance" - not all processors have the same number of internal points, some differences are seen for "edge" processors with only 9 or 12 internal points

HTML version of Basic Foils prepared 2 September 1997

Foil 34 Special Case of Corner Processor

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

For example, in the top right corner processor one needs to only run I, J from 2 to 4

HTML version of Basic Foils prepared 2 September 1997

Foil 35 Analysis of Parallel Overheads
Efficiency and
Speedup

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 36 General Formalism for Speed Up

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

We look at speed up for this simple case

Suppose you run a problem on one processor - the sequential version

Let execution time be T1

Now run the "same" problem (definition of "same" can be tricky but it is not subtle here) on a parallel machine with P nodes where C.P.U. same as sequential machine Let execution time be TP

Then speed up is S=T1/TP

We would like S to be P What is it in our case and why ?

HTML version of Basic Foils prepared 2 September 1997

Foil 37 What Affects Speed Up ?

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

The speed up S is reduced from P for two reasons

Load Imbalance
Communication Overhead

In some cases, one also has effects from parallel method being different from sequential method but that is not relevant here

HTML version of Basic Foils prepared 2 September 1997

Foil 38 Load Imbalance and Speed-Up for Laplace Example -- I

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

We have processors with different amounts of work - as some update 16, some update 12 and some 9 internal points

Each update takes time tcalc = 4 tfloat where tfloat is time taken to do floating point addition or multiplication

HTML version of Basic Foils prepared 2 September 1997

Foil 39 Load Imbalance and Speed-Up for Laplace Example -- II

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Processors doing 9 or 12 updates must wait for those doing 16 to finish

T1 = 196 tcalc T16 = 16 tcalc as all that counts is maximum amount of work done by any one processor

Speed up degraded by load imbalance is: Sload = 196/16 = 12.25 This is speed up ignoring degradation due to communications

HTML version of Basic Foils prepared 2 September 1997

Foil 40 Analytical Analysis of Load Imbalance

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 41 Communication Overhead

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Suppose communicating a single word - here a j value probably stored in a 4 byte word - take time tcomm

HTML version of Basic Foils prepared 2 September 1997

Foil 42 Analytical Form of Speed Up for Communication Overhead

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 43 General Form of Efficiency

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 44 Communication to Calculation Ratio as a function of template

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 45 Performance for Increasing Stencil

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

The previous foil showed that increasing stencil made slight improvements!

This foil shows that larger stencils have much lower overheads (and hence better parallel performance) than simple Laplace's equation with 5 point stencil

HTML version of Basic Foils prepared 2 September 1997

Foil 46 Matrix Multiplication on the Hypercube

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

Showing linear overhead behavior for fc

HTML version of Basic Foils prepared 2 September 1997

Foil 47 Efficiency of QCD Physics Simulation on JPL MarkIIIfp Hypercube

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 48 General Analysis of Overheads and Efficiency

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

HTML version of Basic Foils prepared 2 September 1997

Foil 49 Speed Up as a Function of Grain Size

From CPS615 Laplace Example -- Programming Models and Performance CPS615 Basic Simulation Track for Computational Science -- Fall Semester 97. *

Full HTML Index

© on Tue Oct 7 1997