Given by Geoffrey C. Fox at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 10 Oct 1995
Outside Index
Summary of Material
This covers MPI from a user's point of view and is meant to be a supplement to other online resources from MPI Forum, David Walker's Tutorial, Ian Foster's "Designing and Building Parallel Programs", Gropp,Lusk and Skjellum "Using MPI" |
An Overview is based on subset of 6 routines that cover send/receive, environment inquiry (for rank and total number of processors) initialize and finalization with simple examples |
Processor Groups, Collective Communication and Computation and Derived Datatypes are also discussed |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This covers MPI from a user's point of view and is meant to be a supplement to other online resources from MPI Forum, David Walker's Tutorial, Ian Foster's "Designing and Building Parallel Programs", Gropp,Lusk and Skjellum "Using MPI" |
An Overview is based on subset of 6 routines that cover send/receive, environment inquiry (for rank and total number of processors) initialize and finalization with simple examples |
Processor Groups, Collective Communication and Computation and Derived Datatypes are also discussed |
MPI collected ideas from many previous message passing systems and put them into a "standard" so we could write portable (runs on all current machines) and scalable (runs on future machines we can think of) parallel software |
MPI agreed May 1994 after a process that began with a workshop in April 1992 |
MPI plays same role to message passing systems that HPF does to data parallel languages |
BUT whereas MPI has essentially all one could want -- as message passing fully understood |
HPF will still evolve as many unsolved data parallel compiler issues
|
HPF runs on SIMD and MIMD machines and is high level as it expresses a style of programming or problem architecture |
MPI runs on MIMD machines (in principle it could run on SIMD but unnatural and inefficient) -- it expresses a machine architecture |
Traditional Software Model is
|
So in this analogy MPI is universal "machine-language" of Parallel processing |
Point to Point Message Passing |
Collective Communication -- messages between >2 simultaneous processes |
Support for Process Groups -- messaging in subsets of processors |
Support for communication contexts -- general specification of message labels and ensuring unique to a set of routines as in a precompiled library
|
Support for application (virtual) topologies analogous to distribution types in HPF |
Inquiry routines to find out about environment such as number of processors |
Kitchen Sink has 129 functions and each has many arguments
|
It is not a complete operating environment and does not have ability to create and spawn processes etc. |
PVM is previous dominant approach
|
MPI outside distributed computing world with HTTP of the Web, ATM protocols and systems like ISIS from Cornell |
However it does look as though MPI is being adopted as general messaging system by parallel computer vendors |
We find a good example when we consider typical Matrix Algorithm |
(matrix multiplication) |
A i,j = Sk B i,k C k,j |
summed over k'th column of B and k'th row of C |
Consider a square decomposition of 16 by 16 matrices B and C as for Laplace's equation. (Good Choice) |
Each operation involvea a subset(group) of 4 processors |
All MPI routines are prefixed by MPI_
|
MPI constants are in upper case as MPI datatype MPI_FLOAT for floating point number in C |
Specify overall constants with
|
C routines are actually integer functions and always return a status (error) code |
Fortran routines are really subroutines and have returned status code as argument
|
There a set of predefined constants in include files for each language and these include: |
MPI_SUCCESS -- succesful return code |
MPI_COMM_WORLD (everything) and MPI_COMM_SELF(current process) are predefined reserved communicators in C and Fortran |
Fortran elementary datatypes are: |
MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_COMPLEX, MPI_DOUBLE_COMPLEX, MPI_LOGICAL, MPI_CHARACTER, MPI_BYTE, MPI_PACKED |
C elementary datatypes are: |
MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED |
call MPI_INIT(mpierr) -- initialize |
call MPI_COMM_RANK (comm,rank,mpierr) -- find processor label (rank) in group |
call MPI_COMM_SIZE(comm,size,mpierr) -- find total number of processors |
call MPI_SEND (sndbuf,count,datatype,dest,tag,comm,mpierr) -- send a message |
call MPI_RECV (recvbuf,count,datatype,source,tag,comm,status,mpierr) -- receive a message |
call MPI_FINALIZE(mpierr) -- End Up |
This MUST be called to set up MPI before any other MPI routines may be called |
For C: int MPI_Init(int *argc, char **argv[])
|
For Fortran: call MPI_INIT(mpierr)
|
This allows you to identify each processor by a unique integer called the rank which runs from 0 to N-1 where there are N processors |
If we divide the region 0 to 1 by domain decomposition into N parts, the processor with rank r controls
|
for C:int MPI_Comm_rank(MPI_Comm comm, int *rank)
|
for FORTRAN: call MPI_COMM_RANK (comm,rank,mpierr) |
This returns in integer size number of processes in given communicator comm (remember this specifies processor group) |
For C: int MPI_Comm_size(MPI_Comm comm,int *size) |
For Fortran: call MPI_COMM_SIZE (comm,size,mpierr)
|
Before exiting, an MPI application it is courteous to clean up the MPI state and MPI_FINALIZE does this. No MPI routine may be called in a given process after that process has called MPI_FINALIZE |
for C: int MPI_Finalize() |
for Fortran: MPI_FINALIZE(mpierr)
|
#include <stdio.h> |
#include <mpi.h> |
void main(int argc,char *argv[]) |
{
|
} |
Parallel I/O has technical issues -- how best to optimize access to a file whose contents may be stored on N different disks which can deliver data in parallel and |
Semantic issues -- what does printf in C (and PRINT in Fortran) mean? |
The meaning of printf/PRINT is both undefined and changing
|
Today, memory costs have declined and ALL mainstream MIMD distributed memory machines whether clusters of workstations or integrated systems such as T3D/ Paragon/ SP-2 have enough memory on each node to run UNIX |
Thus printf today means typically that the node on which it runs will stick it out on "standard output" file for that node
|
New MPI-IO initiative will link I/O to MPI in a standard fashion |
Data is propagated between processors via messages which can be divided into packets but at MPI level we only see logically single complete messages |
The building block is Point to Point Communication with one processor sending information and one other receiving it |
Collective communication involves more than one message
|
Collective Communication can ALWAYS be implemented in terms of elementary point to point communications but is provided
|
Communication is between two processors and receiving process must expect a message although can be uncertain asto message type and sending process |
Information required to specify a message includes
|
Two types of communication operations applicable to send and receive
|
In addition four types of send operation
|
call MPI_SEND (
|
Fortran example:
|
call MPI_SEND (sndbuf,count,datatype,dest,tag,comm,mpierr) |
call MPI_RECV(
|
Note that return_status is used after completion of receive to find actual received length (buffer_len is a MAXIMUM), actual source processor rank and actual message tag |
In C syntax is |
int error_message=MPI_Recv(void *start_of_buffer,int buffer_len, MPI_DATATYPE datatype, int source_rank, int tag, MPI_Comm communicator, MPI_Status *return_status) |
integer status(MPI_STATUS_SIZE) An array to store status |
integer mpierr, count, datatype, source, tag, comm |
integer recvbuf(100) |
count=100 |
datatype=MPI_REAL |
comm=MPI_COMM_WORLD |
source=MPI_ANY_SOURCE accept any source processor |
tag=MPI_ANY_TAG accept anmy message tag |
call MPI_RECV (recvbuf,count,datatype,source,tag,comm,status,mpierr) |
Note source and tag can be wild-carded |
#include "mpi.h" |
main( int argc, char **argv ) |
{
|
} |
In C status is a structure of type MPI_Status
|
In Fortran the status is an integer array and different elements give:
|
In C and Fortran, the number of elements (called count) in the message can be found from call to |
MPI_GET_COUNT (IN status, IN datatype, |
OUT count, OUT error_message)
|
SEND Blocking Nonblocking |
Standard MPI_Send MPI_Isend |
Ready MPI_Rsend MPI_Irsend |
Synchronous MPI_Ssend MPI_Issend |
Buffered MPI_Bsend MPI_Ibsend |
RECEIVE Blocking Nonblocking |
Standard MPI_Recv MPI_Irecv |
Any type of receive routine routine can be used to receive messages from any type of send routine |
MPI_BARRIER(comm) Global Synchronization within a given communicator |
MPI_BCAST Global Broadcast |
MPI_GATHER Concatenate data from all processors in a communicator into one process
|
MPI_SCATTER takes data from one processor and scatters over all processors |
MPI_ALLTOALL sends data from all processes to all other processes |
MPI_SENDRECV exchanges data between two processors -- often used to implement "shifts"
|
#include "mpi.h" |
main( int argc, char **argv ) |
{
|
} |
One can often perform computing during a collective communication |
MPI_REDUCE performs reduction operation of type chosen from
|
MPI_ALLREDUCE is as MPI_REDUCE but stores result in all -- not just one -- processors |
MPI_SCAN performs reductions with result for processor r depending on data in processors 0 to r |
Four Processors where each has a send buffer of size 2 |
0 1 2 3 Processors |
(2,4) (5,7) (0,3) (6,2) Initial Send Buffers |
MPI_BCAST with root=2 |
(0,3) (0,3) (0,3) (0,3) Resultant Buffers |
MPI_REDUCE with action MPI_MIN and root=0 |
(0,2) (_,_) (_,_) (_,_) Resultant Buffers |
MPI_ALLREDUCE with action MPI_MIN and root=0 |
(0,2) (0,2) (0,2) (0,2) Resultant Buffers |
MPI_REDUCE with action MPI_SUM and root=1 |
(_,_) (13,16) (_,_) (_,_) Resultant Buffers |
Four Processors where each has a send buffer of size 2 |
0 1 2 3 Processors |
(2,4) (5,7) (0,3) (6,2) Initial Send Buffers |
MPI_SENDRECV with 0,1 and 2,3 paired |
(5,7) (2,4) (6,2) (0,3) Resultant Buffers |
MPI_GATHER with root=0 |
(2,4,5,7,0,3,6,2) (_,_) (_,_) (_,_) Resultant Buffers |
Four Processors where only rank=0 has send buffer |
(2,4,5,7,0,3,6,2) (_,_) (_,_) (_,_) Initial send Buffers |
MPI_SCATTER with root=0 |
(2,4) (5,7) (0,3) (6,2) Resultant Buffers |
All to All Communication with i'th location in j'th processor being sent to j'th location in i'th processor |
Processor 0 1 2 3 |
Start (a0,a1,a2,a3) (b0,b1,b2,b3) (c0,c1,c2,c3) (d0,d1,d2,d3) |
After (a0,b0,c0,d0) (a1,b1,c1,d1) (a2,b2,c2,d2) (a3,b3,c3,d3 |
There are extensions MPI_ALLTOALLV to handle case where data stored in noncontiguous fashion in each processor and when each processor sends different amounts of data to other processors |
Many MPI routines have such "vector" extensions |
These are an elegant solution to a problem we struggled with a lot in the early days -- all message passing is naturally built on buffers holding contiguous data |
However often (usually) the data is not stored contiguously. One can address this with a set of small MPI_SEND commands but we want messages to be as big as possible as latency is so high |
One can copy all the data elements into a single buffer and transmit this but this is tedious for the user and not very efficient |
It has extra memory to memory copies which are often quite slow |
So derived datatypes can be used to set up arbitary memory templates with variable offsets and primitive datatypes. Derived datatypes can then be used in "ordinary" MPI calls in place of primitive datatypes MPI_REAL MPI_FLOAT etc. |
Derived Datatypes should be declared integer in Fortran and MPI_Datatype in C |
Generally have form { (type0,disp0), (type1,disp1) ... (type(n-1),disp(n-1)) } with list of primitive data types typei and displacements (from start of buffer) dispi |
call MPI_TYPE_CONTIGUOUS (count, oldtype, newtype, ierr)
|
one must use call MPI_TYPE_COMMIT(derivedtype,ierr) |
before one can use the type derivedtype in a communication call |
call MPI_TYPE_FREE(derivedtype,ierr) frees up space used by this derived type |
integer derivedtype, ... |
call MPI_TYPE_CONTIGUOUS(10, MPI_REAL, derivedtype, ierr) |
call MPI_TYPE_COMMIT(derivedtype, ierr) |
call MPI_SEND(data, 1, derivedtype, dest,tag, MPI_COMM_WORLD, ierr) |
call MPI_TYPE_FREE(derivedtype, ierr) |
is equivalent to simpler single call |
call MPI_SEND(data, 10, MPI_REAL, dest, tag, MPI_COMM_WORLD, ierr) |
and each sends 10 contiguous real values at location data to process dest |
MPI_TYPE_VECTOR (count,blocklen,stride,oldtype,newtype,ierr)
|
MPI_TYPE_INDEXED (count,blocklens,indices,oldtype,newtype,ierr)
|
Assume each processor stores NLOC by NLOC set of grid points in an array PHI dimensioned PHI(NLOC2,NLOC2) with NLOC=NLOC+2 to establish guard rings |
integer North,South,East,West |
# These are the processor ranks of 4 nearest neighbors |
integer rowtype,coltype # the new derived types |
# Fortran stores elements in columns contiguously |
# (C has opposite convention!) |
call MPI_TYPE_CONTIGUOUS(NLOC, MPI_REAL, coltype, ierr) |
call MPI_TYPE_COMMIT(coltype,ierr) |
# rows (North and South) are not contiguous |
call MPI_TYPE_VECTOR(NLOC, 1, NLOC2, MPI_REAL, rowtype, ierr) |
call MPI_TYPE_COMMIT(rowtype,ierr) |
call MPI_SEND(array(2,2), 1, coltype, west,0,comm,ierr) |
call MPI_SEND(array(2,NLOC+1), 1, coltype, east,0,comm,ierr) |
call MPI_SEND(array(2,2), rowtype, north, 0,comm,ierr) |
call MPI_SEND(array(NLOC+1,2), 1, rowtype, south, 0,comm,ierr) |
More general versions of MPI_?SEND and associated inquiry routines to see if messages have arrived. Use of these allows you to overlap communication and computation. In general this is not used even though more efficient
|
Application Topology routines allow to find rank of nearest neighbor processors as North,South,East,West in Jacobi iteration |
Packing and Unpacking of data to make single buffers -- derived datatypes are usually a more elegant approach to this |
Communicators to set up subgroups of processors (remember matrix example) and to set up independent MPI universes as needed to build libraries so that messages generated by library do not interfere with those from other libraries or user code
|