This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.
There are certain communication patterns that appear in many different types of applications. Rather than requiring each programmer to code these using point to point communication, MPI provides routines that handle these patterns for you, called collective communication routines. This module will survey these routines, the communication pattern established, and their syntax. In the lab exercise, you will add collective communication routines to an incomplete program.
References
Lab Exercises
Quiz
Evaluation
Collective communication involves all the processes in a communicator.
(Communicators were introduced in
Basics of MPI
Programming.) The purpose of collective communication is to
manipulate a "common" piece or set of information. Collective
communication routines have been built by using point-to-point
communication routines. You can build your own collective
communication routines, but it might involve a lot of tedious work and
might not be as efficient.
Although other message-passing libraries provide some collective
communication calls, none of them provides a set of collective
communication routines as complete and robust as those provided by
MPI. In this talk, we introduce these routines in three categories:
synchronization, data movement, and global computation.
MPI collective communication routines differ in many ways from MPI
point-to-point communication routines, which were introduced in
MPI Point-to-Point Communication I.
Here are the characteristics of MPI collective communication routines:
MPI collective communication can be divided into three subsets:
synchronization, data movement, and global computation, which are
covered in the following three sections.
1. Introduction
2. MPI Collective Communication Routines
2.1 Characteristics
2.2 Barrier Synchronization Routine
In parallel applications in the distributed memory environment, explicit or implicit synchronization is sometimes required. As with other message-passing libraries, MPI provides a function call, MPI_BARRIER, to synchronize all processes within a communicator. A barrier is simply a synchronization primitive. A node calling it will be blocked until all the nodes within the group have called it. The syntax of MPI_BARRIER for both C and FORTRAN programs is given below:
C
MPI_Barrier |
(MPI_Comm comm) |
where:
MPI_Comm |
is an MPI predefined structure for communicators, and |
comm |
is a communicator. |
MPI_BARRIER |
(comm, ierr) |
where:
comm |
is an integer denoting a communicator |
ierr |
is an integer return error code. |
Now, let's take a look at the functionality and syntax of these routines.
2.3 Data Movement Routines
MPI provides three types of collective data movement routines. They are
broadcast, gather, and scatter. There are also allgather and alltoall
routines. The gather, scatter, allgather, and alltoall routines have
variable data versions. For their variable data versions, each process can send and/or
receive a different number of elements. The list of MPI collective data
movement routines are:
int MPI_Bcast |
(void* buffer, int count,
MPI_Datatype datatype, int root,
MPI_Comm comm) |
MPI_BCAST |
(buffer, count, datatype, root, comm, ierr)
|
where:
buffer |
is the starting address of a buffer, |
count |
is an integer indicating the number of data elements in the buffer, |
datatype |
is MPI defined constant indicating the data type of the elements in the buffer, |
root |
is an integer indicating the rank of broadcast root process, and |
comm |
is the communicator. |
The MPI_BCAST must be called by each node in the group, specifying the
same comm
and
root
. The message is sent from the root
process to all processes in the group, including the root process.
2.3.2 Gather and Scatter
If an array is scattered across all processors in the group and
one wants to collect each piece of the array into a specified
array in the order of process rank, the call to use
is GATHER.
On the other hand, if one wants to distribute the data into n equal segments,
where the ith segment is sent to the ith process in the group which
has n processes, use SCATTER. MPI provides two variants of the
gather/scatter operations: one in which the numbers of data items
collected from/sent to nodes can be different; and a more efficient one
in the special case where the number per node is the same.
Their syntax is given below:
C
int MPI_Gather |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount,
MPI_Datatype rtype, int root,
MPI_Comm comm ) |
int MPI_Scatter |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount, MPI_Datatype
rtype, int root, MPI_Comm
comm) |
MPI_GATHER |
(sbuf, scount, stype, rbuf, rcount, rtype, root,
comm, ierr) |
MPI_SCATTER |
(sbuf, scount, stype, rbuf, rcount, rtype, root,
comm, ierr) |
where, for the Gather routines:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements in the send buffer, |
stype |
is the data type of send buffer elements, |
rbuf |
is the starting address of the receive buffer, |
rcount |
is the number of elements for any single receive, |
rtype |
is the data type of the receive buffer elements, |
root |
is the rank of receiving process, and |
comm |
is the communicator. |
and for the Scatter routines:
sbuf |
is the address of the send buffer, |
scount |
is the number of elements sent to each process, |
stype |
is the data type of the send buffer elements, |
rbuf |
is the address of the receive buffer, |
rcount |
is the number of elements in the receive buffer, |
rtype |
is the data type of the receive buffer elements, |
root |
is the rank of the sending process, and |
comm |
is the communicator. |
In the gather operation, each process (root process included) sends scount elements of type stype of sbuf to the root process. The root process receives the messages and stores them in rank order in the rbuf. For scatter, the reverse holds. The root process sends a buffer of N chunks of data (N = number of processors in the group) so that processor 1 gets the first element, processor 2 gets the second element, etc.
This picture is augmented by the following example for gather.
DIMENSION A(25,100), b(100), cpart(25), ctotal(100)
INTEGER root
DATA root/0/
DO I=1,25
cpart(I) = 0.
DO K=1,100
cpart(I) = cpart(I) + A(I,K)*b(K)
END DO
END DO
call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,
root, MPI_COMM_WORLD, ierr)
b: vector shared by all processors
c: vector updated by each processor independently
int MPI_Gatherv |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf int *rcount, int*
displs, MPI_Datatype rtype,
int root, MPI_Comm
comm) |
int MPI_Scatterv |
(void* sbuf, int* scount,
int* displa, MPI_Datatype
stype, void* rbuf, int
rcount, MPI_Datatype rtype,
int root, MPI_Comm
comm) |
MPI_GATHERV |
(sbuf, scount, stype, rbuf, rcount, displs,
rtype, root, comm, ierr) |
MPI_SCATTERV |
(sbuf, scount, displs, stype, rbuf, rcount,
rtype, root, comm, ierr) |
The variables for Gatherv are:
sbuf |
starting address of send buffer, |
scount |
number of elements in send buffer, |
stype |
data type of send buffer elements, |
rbuf |
starting address of receive buffer, |
rcount |
array containing number of elements to be received from each process, |
displs |
array specifying the displacement relative to rbuf at which to place the incoming data from corresponding process, |
rtype |
data type of receive buffer, |
root |
rank of receiving process, |
comm |
group communicator. |
The variables for Scatterv are:
sbuf |
address of send buffer, |
scount |
integer array specifying the number of elements to send to each process, |
displs |
array specifying the displacement relative to sbuf from which to take the data going out to the corresponding process, |
stype |
data type of send buffer elements, |
rbuf |
address of receive buffer, |
rcount |
number of elements in receive buffer, |
rtype |
data type of receive buffer elements, |
root |
rank of sending process, |
comm |
group communicator |
For the purpose of illustrating the usage of MPI_GATHERV and MPI_SCATTERV, we give two Fortran program fragments below:
int MPI_Allgather |
(void* sbuf, int scount,
MPI_Datatype stype, void* rbuf,
int rcount, MPI_Datatype
rtype, MPI_Comm comm )
|
int MPI_Allgatherv |
(void* sbuf, int scount,
MPI_Datatype stype, void* rbuf,
int* rcount, int* displs,
MPI_Datatype rtype, MPI_Comm
comm) |
MPI_ALLGATHER |
(sbuf, scount, stype, rbuf, rcount, rtype, comm,
ierr) |
MPI_ALLGATHERV |
(sbuf, scount, stype, rbuf, rcount, displs, rtype,
comm, ierr) |
The variables for Allgather are:
sbuf |
starting address of send buffer, |
scount |
number of elements in send buffer, |
stype |
data type of send buffer elements, |
rbuf |
address of receive buffer, |
rcount |
number of elements received from any process, |
rtype |
data type of receive buffer elements, |
comm |
group communicator. |
DIMENSION A(25,100), b(100), cpart(25), ctotal(100) DO I=1,25 cpart(I) = 0. DO K=1,100 cpart(I) = cpart(I) + A(I,K)*b(K) END DO END DO call MPI_ALLGATHER(cpart,25,MPI_REAL,ctotal,25, MPI_REAL, MPI_COMM_WORLD, ierr)
The syntax of MPI_ALLTOALL is:
2.3.5 All to All
In applications like matrix transpose and FFT, an MPI_ALLTOALL call is very
helpful. This is an extension to ALLGATHER where each process sends distinct
data to each receiver. The jth block from processor i is received by processor
j and stored in ith block. A graphic representation of the MPI_ALLTOALL is
shown below:
C
int MPI_Alltoall |
(void* sbuf, int scount,
MPI_Datatype stype, void*
rbuf, int rcount,
MPI_Datatype rtype, MPI_Comm
comm ) |
MPI_ALLTOALL |
(sbuf, scount, stype, rbuf, rcount, rtype, comm,
ierr) |
The variables for Alltoall are:
sbuf |
is the starting address of send buffer, |
scount |
is the number of elements sent to each process, |
stype |
is the data type of send buffer elements, |
rbuf |
is the address of receive buffer, |
rcount |
is the number of elements received from any process, |
rtype |
is the data type of receive buffer elements, |
comm |
is the group communicator. |
Note: Same specification as ALLGATHER, except sbuf must contain scount*NPROC elements.
int MPI_Reduce |
(void* sbuf, void* rbuf,
int count, MPI_Datatype stype,
MPI_Op op, int root,
MPI_Comm comm) |
int MPI_Allreduce |
(void* sbuf, void* rbuf,
int count, MPI_Datatype stype,
MPI_Op op, MPI_Comm
comm) |
int MPI_Reduce_scatter
|
(void* sbuf, void* rbuf,
int* rcounts, MPI_Datatype stype,
MPI_Op op, MPI_Comm
comm) |
MPI_REDUCE |
(sbuf, rbuf, count, stype, op, root, comm,
ierr) |
MPI_ALLREDUCE |
(sbuf, rbuf, count, stype, op, comm,
ierr) |
MPI_REDUCE_SCATTER |
(sbuf, rbuf, rcounts, stype, op, comm,
ierr) |
The differences among these three reduces:
sbuf |
is the address of send buffer, |
rbuf |
is the address of receive buffer, |
count |
is the number of elements in send buffer, |
stype |
is the data type of elements of send buffer, |
op |
is the reduce operation (which may be MPI predefined, or your own), |
root |
is the rank of the root process, |
comm |
is the group communicator. |
where * is the reduction function which may be either an MPI predefined
function or a user defined function.
The syntax of the MPI scan routine is
int MPI_Scan |
(void* sbuf, void* rbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm
comm) |
MPI_SCAN |
(sbuf, rbuf, count, datatype, op, comm,
ierr) |
sbuf |
starting address of send buffer, |
rbuf |
starting address of receive buffer, |
count |
number of elements in input buffer, |
datatype |
data type of elements of input buffer, |
op |
operation, |
comm |
group communicator. |
Examples of MPI predefined operations are summarized in Table 1. MPI also provides a mechanism for user-defined operations used in MPI_ALLREDUCE and MPI_REDUCE.
Name | Meaning | C type | FORTRAN type |
---|---|---|---|
MPI_MAX | maximum | integer, float | integer, real, complex |
MPI_MIN | minimum | integer, float | integer, real, complex |
MPI_SUM | sum | integer, float | integer, real, complex |
MPI_PROD | product | integer, float | integer, real, complex |
MPI_LAND | logical and | integer | logical |
MPI_BAND | bit-wise and | integer, MPI_BYTE | integer, MPI_BYTE |
MPI_LOR | logical or | integer | logical |
MPI_BOR | bit-wise or | integer, MPI_BYTE | integer, MPI_BYTE |
MPI_LXOR | logical xor | integer | logical |
MPI_BXOR | bit-wise xor | integer, MPI_BYTE | integer, MPI_BYTE |
MPI_MAXLOC | max value and location | combination of int, float, double, and long double | combination of integer, real, complex, double precision |
MPI_MINLOC | min value and location | combination of int, float, double, and long double | combination of integer, real, complex, double precision |
INTEGER maxht, globmx . . . (calculations which determine maximum height) . . call MPI_REDUCE (maxht, globmx, 1, MPI_INTEGER, MPI_MAX, 0, MPI_COMM_WORLD, ierr) IF (taskid.eq.0) then . . (Write output) . END IF
Simple approach: One processor broadcasts the same message to all the other processors, one at a time.
Amount of data transferred: (N-1)*p
Smarter approach: Let other processors help propagate the message. In the first step, processor 1 sends the message to processor 2. In the second step, both processors can now participate: processor 1 sends to processor 3, while processor 2 sends to processor 4. Similarly, by the third step, four processors (1, 3, 2, and 4) are sending the message to the four remaining processors (5, 6, 7, and 8).
Solid line: data transfer
Amount of data transferred: (N-1)*p
Compare the number of steps in the two approaches:
Simple approach: (N-1)*p
The latter scales better as N increases.
Simple approach: One processor scatters the N messages to
the other processors, one at a time.
Amount of data transferred: (N-1)*p
Smarter approach: Let other processors help propagate the
message.
In the first step, processor 1 sends half of the data (size 4*p)
to processor 2. In the second step, both processors can now
participate: processor 1 sends one half of the data it received in step 1
(size 2*p) to processor 3, while processor 2 sends one half of
the data it received in step 1 to processor 4.
At the third step, four processors
(1, 3, 2, and 4) are sending the final messages (size p) to
the remaining four processors (5, 6, 7, and 8).
Amount of data transferred: log2N * N * p/2
Compare the number of steps in the two approaches:
Simple approach: (N-1)*p
The latter scales better as N increases.
Using MPI:
Portable Parallel Programming with the Message-Passing Interface,
by William Gropp, Ewing Lusk, and Anthony Skjellum.
Published 10/21/94 by MIT Press, 328 pages.
Dotted line: carry-over from previous transfer
N = number of processors
Number of steps: log2N
p = size of message
Smarter approach: log2N3.2 Example: Scatter to 8 Processors
Example of two ways (simple way, better way) to do a scatter.
N = number of processors
Number of steps: (N-1)*p
p = size of message
Solid line: data transfer
Dotted line: carry-over from previous transfer
N = number of processors
Number of steps: log2N
p = size of message
Smarter approach: log2N
References
Book
World Wide Web
Frequently Asked Questions
A sampling of programs available on the Web that
illustrate commands covered in this module.
Take a multiple-choice quiz on this material, and submit it for grading.
EXTRA CREDIT: Take a multiple-choice quiz on this material, and submit it for grading.
Lab: MPI Collective Communication.
Please complete this short evaluation form. Thank you!
URL http://www.tc.cornell.edu/Edu/Talks/MPI/Collective/more.html