Full HTML for Basic Designing and Building Parallel Programs 3: MPI Message Passing System

Full HTML for

Basic foilset Designing and Building Parallel Programs 3: MPI Message Passing System

Given by Ian Foster, Gina Goff, Ehtesham Hayder, Chuck Koelbel at DoD Modernization Tutorial on 1995-1998. Foils prepared August 29 98
Outside Index Summary of Material

Day 1

Introduction to Parallel Programming
The OpenMP Programming Language

Day 2

Introduction to MPI
- Sending and receiving messages
- Advanced features
- Examples
The PETSc Library

Table of Contents for full HTML of Designing and Building Parallel Programs 3: MPI Message Passing System

Denote Foils where Image Critical

Denote Foils where Image has important information

Denote Foils where HTML is sufficient

1

Designing and Building Parallel Programs
2

Outline
3

Message Passing Interface (MPI)
4

What is MPI?
5

Compiling and Linking (in MPICH)
6

Running MPI Programs (in MPICH)
7

Sending/Receiving Messages: Issues
8

What Gets Sent: The Buffer
9

Generalizing the Buffer in MPI
10

Advantages of Datatypes
11

To Whom It Gets Sent: Process Identifiers
12

Generalizing the Process Identifier in MPI
13

How It Is Identified: Message Tags
14

Sample Program using Library
15

Correct Execution
16

Incorrect Execution
17

What Happened?
18

Solution to the Tag Problem
19

MPI Basic Send/Receive
20

Six-Function MPI
21

Simple Fortran Example
22

Simple Fortran Example (2)
23

Simple Fortran Example (3)
24

Advanced Features in MPI
25

Collective Communication
26

Synchronization
27

Data Movement (1)
28

Data Movement (2)
29

Collective Computation Patterns
30

List of Collective Routines
31

Example: Performing a Sum
32

Buffering Issues
33

Avoiding Buffering Costs
34

Combining Blocking and Send Modes
35

Connecting Programs Together
36

Connecting Programs via Intercommunicators
37

Regular (Cartesian) Grids
38

Regular Grid Example: Getting the Decomposition
39

Regular Grid Example: Conclusion
40

Designing MPI Programs
41

Jacobi Iteration: The Problem
42

Jacobi Iteration: MPI Program Design
43

Jacobi Iteration: MPI Program Design
44

Jacobi Iteration: MPI Program
45

Jacobi Iteration: MPI Prog. II

Outside Index Summary of Material

HTML version of Basic Foils prepared August 29 98

Foil 1 Designing and Building Parallel Programs

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Ian Foster Gina Goff

Ehtesham Hayder Charles Koelbel

A Tutorial Presented by the Department of Defense HPC Modernization Programming Environment & Training Program

HTML version of Basic Foils prepared August 29 98

Foil 2 Outline

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Day 1

Introduction to Parallel Programming
The OpenMP Programming Language

Day 2

Introduction to MPI
- Sending and receiving messages
- Advanced features
- Examples
The PETSc Library

HTML version of Basic Foils prepared August 29 98

Foil 3 Message Passing Interface (MPI)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A standard message-passing library

p4, NX, PVM, Express, PARMACS are precursors

An MPI program defines a set of processes

(usually one process per node)

... that communicate by calling MPI functions

(point-to-point and collective)

... and can be constructed in a modular fashion

(communicators are the key)

HTML version of Basic Foils prepared August 29 98

Foil 4 What is MPI?

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI defines a language-independent interface

Not an implementation

Bindings are defined for different languages

So far, C and Fortran 77, parts of C++ and F90

Multiple implementations

MPICH is a widely-used portable implementation
See http://www.mcs.anl.gov/mpi/

HTML version of Basic Foils prepared August 29 98

Foil 5 Compiling and Linking (in MPICH)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Special compiler commands for simple programs

mpicc -o first first.c mpif77 -o firstf firstf.f

Options enable MPI profiling features

-mpilog: Generate log files of MPI calls
-mpitrace: Trace execution of MPI calls
-mpianim: Real-time animation (on some systems)

Standard makefiles for larger programs

See /usr/local/mpi/examples/Makefile.in

HTML version of Basic Foils prepared August 29 98

Foil 6 Running MPI Programs (in MPICH)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

To run a program on two processors

mpirun -np 2 hello

To list available command-line argument

mpirun -h

To list commands mpirun would execute

mpirun -t -np 2 hello

Standard does not specify startup mechanism

But many systems are similar to MPICH

HTML version of Basic Foils prepared August 29 98

Foil 7 Sending/Receiving Messages: Issues

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Questions:

What is sent?
To whom is the data sent?
How does the receiver identify it?

HTML version of Basic Foils prepared August 29 98

Foil 8 What Gets Sent: The Buffer

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

1st generation message passing systems allowed only a contiguous array of bytes

Hid the real data structure from hardware and programmer
- Might avoid efficient support
Required pre-packing dispersed data, e.g.:
- Rows of a matrix stored columnwise
Prevented communications between machines with different data representations

HTML version of Basic Foils prepared August 29 98

Foil 9 Generalizing the Buffer in MPI

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI specifies the buffer by starting address, datatype, and count

Starting address is obvious
Datatypes are constructed recursively from
- Elementary (all C and Fortran datatypes)
- Contiguous array of datatypes
- Strided blocks of datatypes
- Indexed array of blocks of datatypes
- General structures
Count is number of datatype elements

HTML version of Basic Foils prepared August 29 98

Foil 10 Advantages of Datatypes

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Specifications of elementary datatypes allows heterogeneous communication.

Elimination of length in favor of count is clearer

Specifying application-oriented layouts allows maximal use of special hardware

HTML version of Basic Foils prepared August 29 98

Foil 11 To Whom It Gets Sent: Process Identifiers

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

1st generation message passing systems used hardware addresses

Was inflexible
- Had to recode on moving to a new machine
Was inconvenient
- Required programmer to map problem topology onto machine connections
Was insufficient
- Didn't support operations over a submachine (e.g., sum across a row of processes)

HTML version of Basic Foils prepared August 29 98

Foil 12 Generalizing the Process Identifier in MPI

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI supports process groups

Initial "all" group
Group management routines
- Split group
- Define group from list

All communication takes place in groups

Source/destination ids refer to rank in group
Communicator = group + context

HTML version of Basic Foils prepared August 29 98

Foil 13 How It Is Identified: Message Tags

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

1st generation message passing systems used an integer "tag" (a.k.a. "type" or "id") to match messages when received

Most systems allowed wildcard on receive
- Unsafe due to unexpected message arrival
Most could match sender id, some w/ wildcards
- Wildcards unsafe; strict checks inconvenient
All systems let users pick the tags
- Unsafe for libraries due to interference

HTML version of Basic Foils prepared August 29 98

Foil 14 Sample Program using Library

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Calls Sub1 and Sub2 are from different libraries

Same sequence of calls on all processes, with no global synch

Sub1();
Sub2();

HTML version of Basic Foils prepared August 29 98

Foil 15 Correct Execution

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 16 Incorrect Execution

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 17 What Happened?

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Each library was self-consistent

Correctly handled all messages it knew about

Interaction between the libraries killed them

"Intercepting" a message broke both

The lesson:

Don't take messages from strangers

Other examples teach other lessons:

Clean up your own messages
Don't use other libraries' tags
Etc. ...

HTML version of Basic Foils prepared August 29 98

Foil 18 Solution to the Tag Problem

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

A separate communication context for each family of messages

Used for queueing and matching
This is the context for communicators

No wild cards allowed, for security

Allocated by the system, for security

Tags retained for use within a context

wild cards OK for tags

HTML version of Basic Foils prepared August 29 98

Foil 19 MPI Basic Send/Receive

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Thus the basic (blocking) send is:

MPI_Send(start, count, datatype, dest, tag, comm)

and the receive is:

MPI_Recv(start, count, datatype, source, tag, comm, status)
The source, tag, and count of the message actually received can be retrieved from status

HTML version of Basic Foils prepared August 29 98

Foil 20 Six-Function MPI

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI is very simple: 6 functions allow you to write many programs

MPI_Init
MPI_Finalize
MPI_Comm_size
MPI_Comm_rank
MPI_Send
MPI_Recv

HTML version of Basic Foils prepared August 29 98

Foil 21 Simple Fortran Example

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

program main

include 'mpif.h'

integer rank, size, tag, count, i, ierr

integer src, dest, st_source, st_tag, st_count

integer status(MPI_STATUS_SIZE)

call MPI_INIT( ierr )

call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )

call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )

print *, 'Process ', rank, ' of ', size, ' is alive'

...

HTML version of Basic Foils prepared August 29 98

Foil 22 Simple Fortran Example (2)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

...

dest = size - 1

if (rank .eq. 0) then

data = 0

call MPI_SEND( data, 1, MPI_INTEGER, rank+1, 99,

+ MPI_COMM_WORLD, ierr )

else if (rank .ne. dest) then

call MPI_RECV(data, count, MPI_INTEGER, rank-1,

+ tag, MPI_COMM_WORLD, status, ierr )

data = data + rank

call MPI_SEND( data, 1, MPI_INTEGER, rank+1, 99,

+ MPI_COMM_WORLD, ierr )

else

call MPI_RECV(data, count, MPI_INTEGER, rank-1,

+ tag, MPI_COMM_WORLD, status, ierr )

print *, rank, ' received', data

endif

...

HTML version of Basic Foils prepared August 29 98

Foil 23 Simple Fortran Example (3)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

...

call MPI_FINALIZE( ierr )

end

HTML version of Basic Foils prepared August 29 98

Foil 24 Advanced Features in MPI

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Collective communication

More than point-to-point operations

Buffering

Major cost of communication is data copying

Nonblocking communication

Allows concurrency between computation and communication (if hardware permits)

Intercommunicators

Cartesian Grids

HTML version of Basic Foils prepared August 29 98

Foil 25 Collective Communication

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Provides standard interfaces to common global operations

Communications: broadcast
Computation: summation

A collective operation uses a process group

All processes in group call same operation at (roughly) the same time

Message tags not needed (generated internally)

HTML version of Basic Foils prepared August 29 98

Foil 26 Synchronization

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI_Barrier(comm)

Blocks until all processes in communicator comm call it

Useful for producing meaningful timings

If processors are not synched before the timing, benchmarks may show negative-time messages

Typically, synchronization comes from messages themselves

HTML version of Basic Foils prepared August 29 98

Foil 27 Data Movement (1)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 28 Data Movement (2)

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 29 Collective Computation Patterns

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

HTML version of Basic Foils prepared August 29 98

Foil 30 List of Collective Routines

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

"ALL" routines deliver results to all participating processes

Routines ending in "V" allow different sized inputs on different processors

HTML version of Basic Foils prepared August 29 98

Foil 31 Example: Performing a Sum

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

rank = MPI_COMM_RANK( comm )

if (rank .eq. 0) then

read *, n

end if

call MPI_BCAST(n, 1, MPI_INTEGER, 0, comm )

lo = rank*n+1

hi = lo+n-1

sum = 0.0d0

do i = lo, hi

sum = sum + 1.0d0 / i

end do

call MPI_REDUCEALL( sum, sumout, 1, MPI_DOUBLE,

& MPI_ADD_DOUBLE, comm)

HTML version of Basic Foils prepared August 29 98

Foil 32 Buffering Issues

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Where does data go when you send it?

Multiple buffer copies, as on left?
Straight to the network, as on right?

Right is more efficient, but not always correct

HTML version of Basic Foils prepared August 29 98

Foil 33 Avoiding Buffering Costs

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Copies are not needed if

Send does not return until the data is delivered,
or
The data is not touched after the send

MPI provides modes to arrange this

Synchronous: Do not return until recv is posted
Ready: Matching recv is posted before send
Buffered: If you really want buffering

HTML version of Basic Foils prepared August 29 98

Foil 34 Combining Blocking and Send Modes

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

All combinations are legal

Red are fastest, Blue are slow

HTML version of Basic Foils prepared August 29 98

Foil 35 Connecting Programs Together

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI provides support for connecting separate message-passing programs

Intercommunicators connect disjoint communicators

HTML version of Basic Foils prepared August 29 98

Foil 36 Connecting Programs via Intercommunicators

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Programs use separate (disjoint) communicators for internal messages

But share MPI_COMM_WORLD

To connect them

Use MPI_INTERCOMM_CREATE or MPI_INTERCOMM_MERGE to form intercommunicator
Use MPI_SEND, et al. as usual

HTML version of Basic Foils prepared August 29 98

Foil 37 Regular (Cartesian) Grids

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

MPI contains routines to simplify writing programs for regular grid operations

MPI_CART_CREATE etc. for the decomposition
MPI_SENDRECV for exchanging data
MPI_PROCNULL for boundaries

HTML version of Basic Foils prepared August 29 98

Foil 38 Regular Grid Example: Getting the Decomposition

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

int size, dims[2], periods[2];

MPI_Comm_size( MPI_COMM_WORLD, &size );

MPI_Dims_create( size, 2, dims );

periods[0] = periods[1] = 0;

MPI_Cart_create( MPI_COMM_WORLD, 2, dims,

periods, 1, &comm2d );

HTML version of Basic Foils prepared August 29 98

Foil 39 Regular Grid Example: Conclusion

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Getting neighbors

Sending data

Can use any MPI send and receive routine
Neighbors at boundary have rank MPI_PROCNULL
Send/receive to MPI_PROCNULL has no effect
No explicit support for managing "ghost points"

MPI_CART_SHIFT( comm2d, 0, 1, &nbrleft, &nbrright );

MPI_CART_SHIFT( comm2d, 1, 1, &nbrbottom, &nbrtop );

HTML version of Basic Foils prepared August 29 98

Foil 40 Designing MPI Programs

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partitioning

Before tackling MPI

Communication

May point to collective operations

Agglomeration

Needed to produce MPI processes

Mapping

Handled by MPI

HTML version of Basic Foils prepared August 29 98

Foil 41 Jacobi Iteration: The Problem

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Numerically solve a PDE on a square mesh

Method:

Update each mesh point by the average of its neighbors
Repeat until converged

HTML version of Basic Foils prepared August 29 98

Foil 42 Jacobi Iteration: MPI Program Design

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Partitioning is simple

Every point is a micro-task

Communication is simple

4 nearest neighbors in Cartesian mesh
Reduction for convergence test

Agglomeration works along dimensions

1-D packing for high-latency machines
2-D packing for others (most general)
One process per processor practically required

HTML version of Basic Foils prepared August 29 98

Foil 43 Jacobi Iteration: MPI Program Design

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

Mapping: Cartesian grid directly supported by MPI virtual topologies

The system is helping you with the mapping!

For generality, write as the 2-D version

Create a 1?P (or P?1) grid for 1-D version
MPI_PROC_NULL avoids special-case coding

Adjust array bounds, iterate over local array

For convenience, include shadow region to hold communicated values (not iterated over)

HTML version of Basic Foils prepared August 29 98

Foil 44 Jacobi Iteration: MPI Program

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

PARAMETER(nxblock=nx/nxp, nyblock=ny/nyp, nxlocal=nxblock+1, nylocal=nyblock+1)

REAL u(0:nxlocal,0:nylocal),unew(0:nxlocal,0:nylocal),f(0:nxlocal,0:nylocal)

dims(1) = nxp; dims(2) = nyp;

periods(1) = .false.; periods(2) = .false.

reorder = .true.

ndim = 2

call MPI_CART_CREATE(MPI_COMM_WORLD,ndim,dims,periods, reorder,comm2d,ierr)

CALL MPI_COMM_RANK( MPI_COMM_WORLD, myrank, ierr )

CALL MPI_CART_COORDS( MPI_COMM_WORLD, myrank, 2, coords, ierr )

CALL MPI_CART_SHIFT( comm2d, 0, 1, nbrleft, nbrright, ierr )

CALL MPI_CART_SHIFT( comm2d, 1, 1, nbrbottom, nbrtop, ierr )

CALL MPI_Type_vector( nyblock, 1, nxlocal+1, MPI_REAL, rowtype )

CALL MPI_Type_commit( rowtype, ierr );

dx = 1.0/nx; dy = 1.0/ny; err = tol * 1e6

DO j = 0, nylocal

DO i = 0, nxlocal
- f(i,j) = -2*(dx*(i+coords(1)*nxblock))**2 + 2*dx*(i+coords(1)*nxblock)&
- u(i,j) = 0.0
- unew(i,j) = 0.0
END DO

END DO

HTML version of Basic Foils prepared August 29 98

Foil 45 Jacobi Iteration: MPI Prog. II

From Designing and Building Parallel Programs 3: MPI Message Passing System DoD Modernization Tutorial -- 1995-1998. *

Full HTML Index

DO WHILE (err > tol)

CALL MPI_SEND(u(1,1),nyblock,MPI_REAL,nbrleft,0,comm2d)
CALL MPI_RECV(u(nxlocal,1),nyblock,MPI_REAL,nbrright,0,comm2d)
CALL MPI_SEND(u(nxlocal-1,1),nyblock,MPI_REAL, nbrright,1,comm2d)
CALL MPI_RECV(u(0,1),nyblock,MPI_REAL,nbrleft,1,comm2d)
CALL MPI_SEND(u(1,1),1,rowtype,nbrtop,2,comm2d)
CALL MPI_RECV(u(1,nylocal),1,rowtype,nbrbottom,2,comm2d)
CALL MPI_SEND(u(nxlocal-1,1),1,rowtype,nbrbottom,3,comm2d)
CALL MPI_RECV(u(1,0),1,rowtype,nbrtop,3,comm2d)
myerr = 0.0
DO j=1, nylocal-1
- DO i = 1, nxlocal-1
  - unew(i,j) = (u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1)+f(i,j))/4
  - myerr = max(err, ABS(unew(i,j)-u(i,j)))
- END DO
END DO
CALL MPI_ALLREDUCE(myerr,err,1,MPI_REAL,MPI_MAX,MPI_COMM_WORLD,ierr)
DO j=1, nylocal-1
- DO i = 1, nxlocal-1
  - u(i,j) = unew(i,j)
- END DO
END DO

END DO

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Aug 29 1998