CSE302/CS320/ECE392
Introduction to Parallel Programming
Lecture 13
Message Passing Interface
Introduction to MPI
-
MPI - standard for explicit message passing in applications
on distributed memory machines
-
Standard needed for:
-
Portability and ease-of-use
-
Providing hardware vendors with a well defined
set of routines to implement efficiently
-
Development of the parallel software industry
Introduction to MPI
-
MPI provides:
-
Point-to-point message passing
-
Collective communication
-
Support for process groups
-
Support for communication contexts
-
Support for application topologies
-
Enviromental enquiry functions
The MPI Programming Model
-
A computation comprises one or more processes that communicate
using the MPI library routines
-
In most implementations, a fixed set of processes is created
at program initialization and one process is created per processor
-
Processes may execute different programs, hence the MPI programming
model is an MPMD model
Process Model and Groups
-
Fundamental computational unit is a process. Each process has:
-
An independent thread of control
-
A separate address space
-
MPI processes execute in MIMD style, but:
-
No mechanism for loading code onto processors or assigning
processes to processors
-
No mechanism for creating or destroying processes
-
MPI supports dynamic process groups
-
Process groups can be created and destroyed
-
Membership is static
-
Groups may overlap
Communication Scope
-
In MPI, a process is specified by:
-
A group
-
A rank relative to the group

-
A message label is specified by:
-
A message context
-
A message tag relative to the context
-
Groups are used to partition process space
-
Contexts are used to partition ``message tag space''
-
Groups and contexts are bound together to form a
communicator object. Contexts are not visible at the
application level
-
A communicator defines the scope of a communication
operation
Process Groups
-
A group is a set of processes
-
These processes are uniquely labelled by the integers
-
The integer that labels a process is called its rank within the
group
-
A group is represented by an opaque object
-
User does not know internal structure
-
Must call inquiry routine to determine attributes
-
Referenced by means of a handle
mpi_group_size (group, size, ierr)
mpi_group_rank (group, rank, ierr)
-
Initially, all processes are members of the group given by the
pre-defined communicator MPI_COMM_WORLD
Group Management
-
Operations on groups are local and involve no communication
-
Can convert rank in one group to rank in a second group:
mpi_group_translate (group1, n, ranks1, group2, ranks2, ierr)
-
Can check if two groups are the same:
mpi_group_compare (group1, group2, result, ierr)
Group Management
-
Set-wise operations on pairs of groups: union, intersection, and
difference
-
List ranks to be included in new group:
mpi_group_incl(group, n, ranks, newgroup, ierr)
ranks(i) in group has rank i in newgroup
-
List ranks to be excluded from new group:
mpi_group_excl(group, n, ranks, newgroup, ierr)
-
Can also form new groups by specifying ranges of ranks to be included in
or excluded from new group.
mpi_group_range_incl(group, n, ranges, newgroup, ierr)
mpi_group_range_excl(group, n, ranges, newgroup, ierr)
ranges is array of n triplets giving (first, last, stride)
of each range included
Communicators
-
Communicators are used to create independent ``message universes''.
-
Communicators are used to disambiguate message selection when an
application calls a library routine that performs message passing.
Nondeterminacy may arise:
-
If processes enter the library routine asynchronously
-
If processes enter the library routine synchronously,
but there are outstanding communication operations
-
A communicator:
-
Binds together groups and contexts
-
Defines the scope of a communication operation
-
Is represented by an opaque object
-
Is referenced by a handle
Asynchronous Library Calls
-
The following shows the correct and a possible incorrect sequence of
communication operations
-
Deadlock results in the second case
-
Need to differentiate between messages sent in library routine and
rest of application
Synchronous Library Calls
-
In this case the library call is made synchronously
-
Still have problems if there are communication operations outstanding on
entry
-
Again we must separate communication operations inside the library from
those in the rest of the application
Point-to-point Communication
-
MPI provides for point-to-point communication between pairs of processes
-
Message selectivity is by rank and message tag
-
Rank and tag are interpreted relative to the scope of the communication
-
The scope is specified by the communicator
-
Rank and tag may be wildcarded
-
The components of a communicator may not be wildcarded
Communcation Completion
-
A communication operation is locally complete on a
process if the process has completed its part in the
operation
-
A communication operation is globally complete if
all processes involved have completed their part in
the operation. A communication operation is globally
complete if and only if it is locally complete for
all processes
Blocking Send/Recv
-
The standard blocking send routine:
mpi_send(buffer,numitems,datatype,dest,tag,comm,error)
-
IN - buffer,numitems,datatype,dest,tag,comm
-
OUT - error
-
The standard blocking receive routine:
mpi_recv(buffer,maxitems,datatype,src,tag,comm,status,error)
-
IN - buffer,maxitems,datatype,src,tag,comm
-
OUT - status,error
Return Status Objects
Blocking Behaviour
-
Blocking send/receive:
-
Returns when send/receive is locally complete
-
Message buffer can be overwritten/read
-
Non-blocking send/receive:
-
Returns immediately
-
Message buffer contents cannot be overwritten/read until process
verifies completion of send/receive
Non-blocking Send/Recv
-
The standard non-blocking send routine:
mpi_isend(buffer,numitems,datatype,dest,tag,comm,req_id,error)
-
IN - buffer,numitems,datatype,dest,tag,comm
-
OUT - req_id,error
-
The standard blocking receive routine:
mpi_irecv(buffer,maxitems,datatype,src,tag,comm,req_id,error)
-
IN - buffer,maxitems,datatype,src,tag,comm
-
OUT - req_id,error
Completion Routines
-
Two basic ways to check on non-blocking routine status:
-
Call a routine that blocks until completion
-
Call a test routine that returns a flag to indicate if complete
-
Use of non-blocking communication and completion routines allows
for overlapping computation with communication
mpi_wait(req_id,status,ierr)
mpi_test(req_id,flag,status,ierr)
-
mpi_wait blocks until communication is complete
-
mpi_status returns immediately and sets flag to true if communication
is complete
Multiple Completions
-
Versions of wait and status routines that act on
multiple communication operations
-
Block until all communication operations in a given
list have completed
mpi_waitall(count, req_id_list, status_list, ierr)
-
Block until one of the communication operations in a
given list has completed
mpi_waitany(count, req_id_list, index, status_list, ierr)
-
Block until at least one of the communication
operations in a given list has completed
mpi_waitsome(count, req_id_list, count_done, index_list, status_list, ierr)
-
There are similar test routines for checking
completion status of a list of communication operations
Communication Modes
-
Different modes for point-to-point communications
-
Standard mode - send can be initiated without matching receive
being initiated
-
Ready mode - send may be initiated only when matching receive has
been initiated
-
Synchronous mode - same as standard except that the send will not
complete until message delivery is guaranteed
-
Buffered mode - similar to standard mode, but completion always
independent of matching receive, and message may be buffered to
ensure this
Buffered Mode
-
In buffered mode a user-supplied buffer may be
used to buffer messages so that the sending process
can always return from the send before the message
just been received
-
To supply the system with the user buffer:
mpi_buffer_attach(buffer, size, ierr)
-
To get user buffer back from system:
mpi_buffer_detach(buffer, size, ierr)
-
This will block until all communication using buffer
has completed
Flavors of Communication
-
For a send operation there are:
-
4 communication modes
-
2 blocking modes
-
types of send
-
For a receive operation there are:
-
1 communication mode
-
2 blocking modes
-
types of receive
Naming Conventions
-
Send routines:

-
Receive routines:

-
Any type of receive routine can be used to receive messages from
any type of send routine
Send/Receive Operations
-
In many applications, processes send to one process
while receiving from another
-
Deadlock may arise if care is not taken
-
MPI provides routines for such send/receive operations
-
For distinct send/receive buffers:
mpi_sendrecv
-
For identical send/receive buffers:
mpi_sendrecv_replace
Collective Communication
-
Involves coordinated communication within a group of processes
-
No message tags used
-
All collective routines block until they are locally complete
-
Two broad classes:
-
Data movement routines - broadcast, gather, scatter
-
Global computation routines - reduction, scan
Communicating Non-Primitive Datatypes
-
Primitive datatypes:
MPI_INTEGER, MPI_REAL, MPI_DOUBLE
MPI_COMPLEX, MPI_DOUBLE_COMPLEX
MPI_LOGICAL, MPI_CHARACTER, MPI_BYTE
-
MPI also supports array sections and structures through general datatypes
-
A general datatype is a sequence of primitive types and integer byte
displacements:
datatype={(type0,disp0),(type1,disp1),...,(typeN,dispN)}
-
Together with a base address, a datatype defines a communication buffer
Contiguous Datatype Constructor
-
A general datatype is built up hierarchically from simpler components
mpi_type_contiguous (count, oldtype, newtype, ierr)
-
The above creates a new datatype made up of count repetitions of oldtype
-
For example:
oldtype = {(double, 0), (char, 8)}
then if count=3,
newtype = {(double,0),(char,8),(double,16),(char,24),(double,32),(char,40)}
Vector Datatype Constructor
-
This constructor replicates a datatype, taking blocks at fixed offsets.
mpi_type_vector(count, blocklen,stride, oldtype, newtype, ierr)
-
The new datatype consists of:
-
Count blocks
-
Each block is a repetition of blocklen items of oldtype
-
The start of successive blocks is offset by stride items of oldtype
-
If count=2, stride=4, blocklen=3, then newtype is:
{(double,0),(char,8),(double,16),(char,24),(double,32),(char,40),
(double,64),(char,72),(double,80),(char,88),(double,96),(char,104)}
Indexed Datatype Constructor
-
This constructor replicates a datatype, taking blocks at fixed offsets.
mpi_type_indexed (count, B, I, oldtype, newtype, ierr)
-
The new datatype consists of:
-
Count blocks
-
The ith block is of length B[i] items of oldtype
-
The offset of the start of the ith block is I[i] items of oldtype
-
If count=2, I=64,0, and B=3,1, then newtype is:
{(double,64),(char,72),(double,80),(char,88),(double,96),(char,104),
(double,0),(char,8)}
Structure Datatype Constructor
-
This constructor generalizes the indexed datatype by allowing each block
to be of a different datatype.
mpi_type_struct(count, B, I, T, newtype, ierr)
-
The new datatype consists of:
o count blocks,
o the length of the [Image]th block is B[i] items of type T[i],
o the offset of the start of the [Image]th block is I[i] bytes.
-
If count=3, T=MPI_FLOAT,type1,MPI_CHAR, I=0,16,26, and
B=2,1,3, then newtype is:
{(float,0),(float,4),(double,16),(char,24),(char,26),(char,27),(char,28)}
Other Datatype Routines
-
The size of a datatype is termed the extent.
-
The extent is given by:
mpi_type_extent(datatype, extent, ierr)
-
mpi_type_hvector is same as mpi_type_vector, except stride is given in
bytes
-
mpi_type_hindexed is same as mpi_type_indexed, except offsets in array of
offsets are given in bytes
-
The offsets in a datatype may be given relative to a ``base address,''
given by the MPI constant mpi_bottom
-
The address of a location can be found thus:
mpi\_address(location, address, ierr)
Example of a General Datatype
-
Send A(1:17:2, 3:11, 2:10) to E
#include "mpi.h"
main()
{
float a[100][100][100],e[9][9][9];
int oneslice, twoslice, threeslice, sizeofreal;
int rank, ierr, status[MPI_STATUS_SIZE];
mpi_init (ierr);
mpi_comm_rank (MPI_COMM_WORLD, rank, ierr);
if (rank==0) {
mpi_type_extent(MPI_REAL, sizeofreal, ierr);
mpi_type_vector(9, 1, 2, MPI_REAL, oneslice, ierr);
mpi_type_hvector(9, 1, 100*sizeofreal, oneslice, twoslice, ierr);
mpi_type_hvector(9, 1, 100*100*sizeofreal, twoslice, threeslice, ierr);
mpi_type_commit(threeslice, ierr);
mpi_send(a(1,3,2), 1, threeslice, 1, 0, MPI_COMM_WORLD, ierr);
} else if (rank==1) {
mpi_recv(e, 9*9*9, MPI_REAL, 0, 0, MPI_COMM_WORLD, status, ierr);
}
mpi_finalize (ierr);
}
Application Topologies
-
In many applications, processes are arranged with a particular topology,
e.g., a regular grid
-
MPI supports general application topologies by a graph in which
communicating processes are connected by an arc
-
MPI also provides explicit support for Cartesian grid topologies
mpi_cart_create(comm_old, ndims, dims, period, reorder, comm_cart, ierr)
-
Periodicity in each grid direction may be specified
-
Inquiry routines transform between rank in group and location in topology
-
For Cartesian topologies, row-major ordering is used for processes
Topological Inquiry Routines
-
mpi_topo_test returns the type of topology associated with a
communicator
-
Can find number of dimensions in a Cartesian topology:
mpi_cartdim_get (comm, ndims, ierr)
-
More information on a Cartesian topology can be obtained with:
mpi_cart_get (comm, maxdims, periods, coords, ierr)
-
Mapping of coordinate position in Cartesian topology to rank:
mpi_cart_rank (comm, coords, rank, ierr)
-
Mapping of rank to coordinate position:
mpi_cart_coords (comm, rank, maxdims, coords, ierr)
Uses of Topologies
-
Knowledge of application topology can be used to efficiently assign
processes to processors
-
Cartesian grids can be divided into hyperplanes by removing specified
dimensions
-
MPI provides support for shifting data along a specified dimension of a
Cartesian grid
-
MPI provides support for performing collective communication operations
along a specified grid direction
Pack and Unpack
-
MPI provides routines that:
-
Pack data into a contiguous buffer before sending it
-
Unpack data from a contiguous buffer after receiving it
-
These routines are provided:
-
For compatibility with other message passing libraries
-
To allow a message to be received in parts
-
To buffer outgoing messages in user space, thereby overriding
the system buffering policy
Packing Data
-
To pack data:
mpi_pack(inbuf, incount, datatype, outbuf, outsize, position, comm, ierr)
This takes incount items of specified datatype from buffer
inbuf and packs them into outbuf starting at offset position
-
The offset position is specified in bytes
-
On return position is set to the next location in outbuf
Unpacking Data
-
To unpack data:
mpi_unpack (inbuf, insize, position, outbuf, outcount, datatype, comm, ierr)
This extracts outcount items of specified datatype from the
buffer inbuf, starting at offset position, and stores in the buffer outbuf
-
On return position is set to the next location in inbuf after the data
just unpacked
Example of Unpacking
-
A message consists of sequences of reals or integers prefixed by a type
identifier and the number of items
Code for Unpacking Example
mpi_recv(msg,size,MPI_PACKED,MPI_ANY_SOURCE,MPI_ANY_TAG,comm,status,ierr);
pos=0;
/* get typeid : 0 - end of message, 1 - integers, 2 - reals */
mpi_unpack(message,size,pos,typeid,1,MPI_INTEGER,comm,ierr);
/* check for end of message */
while (typeid>0) {
/* unpack number of items field */
mpi_unpack(message,size,pos,nitems,1,MPI_INTEGER,comm,ierr);
/* select type of unpack based on typeid */
if (typeid==1) {
mpi_unpack(message,size,pos,ints(nints),nitems,MPI_INTEGER,comm,ierr)
nints+=nitems;
} else {
mpi_unpack(message,size,pos,reals(nreals),nitems,MPI_REAL,comm,ierr);
nreals+=nitems;
}
/* unpack next typeid */
mpi_unpack(message,size,pos,typeid,1,MPI_INTEGER,comm,ierr)
}
Carolin Tschopp
Tue Jan 23 10:01:39 CST 1996