Cornell Theory Center


Discussion:
MPI Collective Communication I

10/98

This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.


There are certain communication patterns that appear in many different types of applications. Rather than requiring each programmer to code these using point to point communication, MPI provides routines that handle these patterns for you, called collective communication routines. This module will survey these routines, the communication pattern established, and their syntax. In the lab exercise, you will add collective communication routines to an incomplete program.


Table of Contents

  1. Introduction
  2. MPI Collective Communication Routines
    2.1 Characteristics
    2.2 Barrier Synchronization Routine
    2.3 Data Movement Routines
    2.3.1 Broadcast
    2.3.2 Gather and Scatter
    2.3.3 Gatherv and Scatterv
    2.3.4 Allgather
    2.3.5 Alltoall
    2.4 Global Computation Routines
  3. Performance Issues
    3.1 Example: Broadcast to 8 Processors
    3.2 Example: Scatter to 8 Processors

References Lab Exercises Quiz Evaluation

[Table of Contents] [Section 1] [Section 2] [Section 3] [Less Detail]


1. Introduction

Collective communication involves all the processes in a communicator. (Communicators were introduced in Basics of MPI Programming.) The purpose of collective communication is to manipulate a "common" piece or set of information. Collective communication routines have been built by using point-to-point communication routines. You can build your own collective communication routines, but it might involve a lot of tedious work and might not be as efficient.

Although other message-passing libraries provide some collective communication calls, none of them provides a set of collective communication routines as complete and robust as those provided by MPI. In this talk, we introduce these routines in three categories: synchronization, data movement, and global computation.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Less Detail]


2. MPI Collective Communication Routines


2.1 Characteristics

MPI collective communication routines differ in many ways from MPI point-to-point communication routines, which were introduced in MPI Point-to-Point Communication I. Here are the characteristics of MPI collective communication routines:

MPI collective communication can be divided into three subsets: synchronization, data movement, and global computation, which are covered in the following three sections.


2.2 Barrier Synchronization Routine

In parallel applications in the distributed memory environment, explicit or implicit synchronization is sometimes required. As with other message-passing libraries, MPI provides a function call, MPI_BARRIER, to synchronize all processes within a communicator. A barrier is simply a synchronization primitive. A node calling it will be blocked until all the nodes within the group have called it. The syntax of MPI_BARRIER for both C and FORTRAN programs is given below:
C
MPI_Barrier (MPI_Comm comm)

where:
MPI_Comm is an MPI predefined structure for communicators, and
comm is a communicator.

FORTRAN
MPI_BARRIER (comm, ierr)

where:
comm is an integer denoting a communicator
ierr is an integer return error code.


2.3 Data Movement Routines

MPI provides three types of collective data movement routines. They are broadcast, gather, and scatter. There are also allgather and alltoall routines. The gather, scatter, allgather, and alltoall routines have variable data versions. For their variable data versions, each process can send and/or receive a different number of elements. The list of MPI collective data movement routines are:

Now, let's take a look at the functionality and syntax of these routines.


2.3.1 Broadcast

In many cases, one processor needs to send (broadcast) some data (either a scalar or vector) to all the processes in a group. MPI provides the broadcast primitive MPI_BCAST to accomplish this task. The syntax of the MPI_BCAST is given by:
C
int MPI_Bcast (void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
FORTRAN
MPI_BCAST (buffer, count, datatype, root, comm, ierr)

where:
buffer is the starting address of a buffer,
count is an integer indicating the number of data elements in the buffer,
datatype is MPI defined constant indicating the data type of the elements in the buffer,
root is an integer indicating the rank of broadcast root process, and
comm is the communicator.

The MPI_BCAST must be called by each node in the group, specifying the same comm and root. The message is sent from the root process to all processes in the group, including the root process.


2.3.2 Gather and Scatter

If an array is scattered across all processors in the group and one wants to collect each piece of the array into a specified array in the order of process rank, the call to use is GATHER. On the other hand, if one wants to distribute the data into n equal segments, where the ith segment is sent to the ith process in the group which has n processes, use SCATTER. MPI provides two variants of the gather/scatter operations: one in which the numbers of data items collected from/sent to nodes can be different; and a more efficient one in the special case where the number per node is the same. Their syntax is given below:
C
int MPI_Gather (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm )
int MPI_Scatter (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)
FORTRAN
MPI_GATHER (sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)
MPI_SCATTER (sbuf, scount, stype, rbuf, rcount, rtype, root, comm, ierr)

where, for the Gather routines:
sbuf is the starting address of send buffer,
scount is the number of elements in the send buffer,
stype is the data type of send buffer elements,
rbuf is the starting address of the receive buffer,
rcount is the number of elements for any single receive,
rtype is the data type of the receive buffer elements,
root is the rank of receiving process, and
comm is the communicator.

and for the Scatter routines:
sbuf is the address of the send buffer,
scount is the number of elements sent to each process,
stype is the data type of the send buffer elements,
rbuf is the address of the receive buffer,
rcount is the number of elements in the receive buffer,
rtype is the data type of the receive buffer elements,
root is the rank of the sending process, and
comm is the communicator.

In the gather operation, each process (root process included) sends scount elements of type stype of sbuf to the root process. The root process receives the messages and stores them in rank order in the rbuf. For scatter, the reverse holds. The root process sends a buffer of N chunks of data (N = number of processors in the group) so that processor 1 gets the first element, processor 2 gets the second element, etc.


Gather & Scatter Effect

In order to illustrate these functions, we give a graph below:

This picture is augmented by the following example for gather.


Sample Code


      DIMENSION A(25,100), b(100), cpart(25), ctotal(100)
      INTEGER root
      DATA root/0/

      DO I=1,25
            cpart(I) = 0.
            DO K=1,100
                  cpart(I) = cpart(I) + A(I,K)*b(K)
            END DO
      END DO
      call MPI_GATHER(cpart,25,MPI_REAL,ctotal,25,MPI_REAL,
            root, MPI_COMM_WORLD, ierr)

Here we give two Fortran program fragments to further show the use of MPI_GATHER and MPI_SCATTER.
MPI_GATHER

MPI_SCATTER


2.3.3 Gatherv and Scatterv

MPI_GATHERV and MPI_SCATTERV are the varying versions of MPI_GATHER and MPI_SCATTER. MPI_GATHERV extends the functionality of MPI_GATHER by changing the receive count from an integer to an integer array and providing a new argument displs(array). MPI_GATHERV allows a varying count of data from each processor and allows some flexibility for where the gathered data is placed in the root as well. As a counterpart of MPI_GATHERV, MPI_SCATTERV is an extension of MPI_SCATTER in the same relationship as MPI_GATHERV is to MPI_GATHER.

C
int MPI_Gatherv (void* sbuf, int scount, MPI_Datatype stype, void* rbuf int *rcount, int* displs, MPI_Datatype rtype, int root, MPI_Comm comm)
int MPI_Scatterv (void* sbuf, int* scount, int* displa, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, int root, MPI_Comm comm)
FORTRAN
MPI_GATHERV (sbuf, scount, stype, rbuf, rcount, displs, rtype, root, comm, ierr)
MPI_SCATTERV (sbuf, scount, displs, stype, rbuf, rcount, rtype, root, comm, ierr)

The variables for Gatherv are:
sbuf starting address of send buffer,
scount number of elements in send buffer,
stype data type of send buffer elements,
rbuf starting address of receive buffer,
rcount array containing number of elements to be received from each process,
displs array specifying the displacement relative to rbuf at which to place the incoming data from corresponding process,
rtype data type of receive buffer,
root rank of receiving process,
comm group communicator.

The variables for Scatterv are:
sbuf address of send buffer,
scount integer array specifying the number of elements to send to each process,
displs array specifying the displacement relative to sbuf from which to take the data going out to the corresponding process,
stype data type of send buffer elements,
rbuf address of receive buffer,
rcount number of elements in receive buffer,
rtype data type of receive buffer elements,
root rank of sending process,
comm group communicator

For the purpose of illustrating the usage of MPI_GATHERV and MPI_SCATTERV, we give two Fortran program fragments below:

MPI_GATHERV
MPI_SCATTERV


2.3.4 Allgather

MPI_ALLGATHER can be thought of as MPI_GATHER where all processes, not just the root, receive the result. The jth block of the receive buffer is the block of data sent from the jth process. A similar relationship holds for MPI_ALLGATHERV and MPI_GATHERV. The syntax of MPI_ALLGATHER and MPI_ALLGATHERV are similar to MPI_GATHER and MPI_GATHERV, respectively. However, the argument root is dropped from MPI_ALLGATHER and MPI_ALLGATHERV.

C
int MPI_Allgather (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm )
int MPI_Allgatherv (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int* rcount, int* displs, MPI_Datatype rtype, MPI_Comm comm)
FORTRAN
MPI_ALLGATHER (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)
MPI_ALLGATHERV (sbuf, scount, stype, rbuf, rcount, displs, rtype, comm, ierr)

The variables for Allgather are:
sbuf starting address of send buffer,
scount number of elements in send buffer,
stype data type of send buffer elements,
rbuf address of receive buffer,
rcount number of elements received from any process,
rtype data type of receive buffer elements,
comm group communicator.
Note: the arguments are the same as MPI_GATHER or MPI_GATHERV except for no root argument.

Example:

Sample Code

      DIMENSION A(25,100), b(100), cpart(25), ctotal(100)

      DO I=1,25
            cpart(I) = 0.
            DO K=1,100
                  cpart(I) = cpart(I) + A(I,K)*b(K)
            END DO
      END DO
      call MPI_ALLGATHER(cpart,25,MPI_REAL,ctotal,25,
            MPI_REAL, MPI_COMM_WORLD, ierr)


2.3.5 All to All

In applications like matrix transpose and FFT, an MPI_ALLTOALL call is very helpful. This is an extension to ALLGATHER where each process sends distinct data to each receiver. The jth block from processor i is received by processor j and stored in ith block. A graphic representation of the MPI_ALLTOALL is shown below:

The syntax of MPI_ALLTOALL is:

C
int MPI_Alltoall (void* sbuf, int scount, MPI_Datatype stype, void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm )
FORTRAN
MPI_ALLTOALL (sbuf, scount, stype, rbuf, rcount, rtype, comm, ierr)

The variables for Alltoall are:
sbuf is the starting address of send buffer,
scount is the number of elements sent to each process,
stype is the data type of send buffer elements,
rbuf is the address of receive buffer,
rcount is the number of elements received from any process,
rtype is the data type of receive buffer elements,
comm is the group communicator.

Note: Same specification as ALLGATHER, except sbuf must contain scount*NPROC elements.


2.4 Global Computation Routines

There are two types of global computation routines: reduce and scan. The operation function passed to a global computation routine is either a predefined MPI function or a user supplied function. MPI provides four global computation routines and three auxiliary routines for user supplied functions. The auxiliary routines will be covered in the advanced topic talk in this workshop. In this section, we first discuss reduction computation. Then we will discuss scan computation.

MPI_REDUCE

One of the most useful collective operations is a global reduction or combine operation. The partial result in each process in the group is combined in one specified process or all the processes using some desired function. If there are n processes in the process group, and D(i,j) is the jth data item in process i, then the Dj item of data in the root process resulting from a reduce routine evaluation is given by:

Dj = D(0,j)*D(1,j)* ... *D(n-1,j)

where * is the reduction function and it is always assumed associative. All MPI predefined functions are also assumed to be commutative. One may define functions that are assumed to be associative, but not commutative. Each process can provide either one element or a sequence of elements. In both cases the combine operation is executed element-wise on each element of the sequence. There are three versions of reduce. They are MPI_REDUCE, MPI_ALLREDUCE, and MPI_REDUCE_SCATTER. The form of these reduction primitives is listed below:

C
int MPI_Reduce (void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, int root, MPI_Comm comm)
int MPI_Allreduce (void* sbuf, void* rbuf, int count, MPI_Datatype stype, MPI_Op op, MPI_Comm comm)
int MPI_Reduce_scatter (void* sbuf, void* rbuf, int* rcounts, MPI_Datatype stype, MPI_Op op, MPI_Comm comm)
FORTRAN
MPI_REDUCE (sbuf, rbuf, count, stype, op, root, comm, ierr)
MPI_ALLREDUCE (sbuf, rbuf, count, stype, op, comm, ierr)
MPI_REDUCE_SCATTER (sbuf, rbuf, rcounts, stype, op, comm, ierr)

The differences among these three reduces:

sbuf is the address of send buffer,
rbuf is the address of receive buffer,
count is the number of elements in send buffer,
stype is the data type of elements of send buffer,
op is the reduce operation (which may be MPI predefined, or your own),
root is the rank of the root process,
comm is the group communicator.
Notes:
* rbuf significant only at the root process,
* the argument rcounts in MPI_REDUCE_SCATTER is an array.

Scan

A scan or prefix-reduction operation performs partial reductions on distributed data. Particularly, let n be the size of the process group, D(k,j) the jth data item in process k after returning from scan, and d(k,j) the jth data item in process k before the scan. Let k =0, 1, ..., n-1. A scan returns

D(k,j) = d(0,j) * d(1,j) * ... *d(k,j)

where * is the reduction function which may be either an MPI predefined function or a user defined function.

The syntax of the MPI scan routine is

C
int MPI_Scan (void* sbuf, void* rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
FORTRAN
MPI_SCAN (sbuf, rbuf, count, datatype, op, comm, ierr)
sbuf starting address of send buffer,
rbuf starting address of receive buffer,
count number of elements in input buffer,
datatype data type of elements of input buffer,
op operation,
comm group communicator.
Note that a segmented scan can be done by creating a subgroup for each segment.

Predefined Reduce Operations

Examples of MPI predefined operations are summarized in Table 1. MPI also provides a mechanism for user-defined operations used in MPI_ALLREDUCE and MPI_REDUCE.

MPI Predefined Reduce Operations
Name Meaning C type FORTRAN type
MPI_MAX maximum integer, float integer, real, complex
MPI_MIN minimum integer, float integer, real, complex
MPI_SUM sum integer, float integer, real, complex
MPI_PROD product integer, float integer, real, complex
MPI_LAND logical and integer logical
MPI_BAND bit-wise and integer, MPI_BYTE integer, MPI_BYTE
MPI_LOR logical or integer logical
MPI_BOR bit-wise or integer, MPI_BYTE integer, MPI_BYTE
MPI_LXOR logical xor integer logical
MPI_BXOR bit-wise xor integer, MPI_BYTE integer, MPI_BYTE
MPI_MAXLOC max value and location combination of int, float, double, and long double combination of integer, real, complex, double precision
MPI_MINLOC min value and location combination of int, float, double, and long double combination of integer, real, complex, double precision

Example

  INTEGER maxht, globmx
      .
      .
      .  (calculations which determine maximum height)
      .
      .
  call MPI_REDUCE (maxht, globmx, 1, MPI_INTEGER, MPI_MAX, 0, MPI_COMM_WORLD, ierr)
  IF (taskid.eq.0) then
      .
      .  (Write output)
      .
  END IF

User-defined Operations

[Table of Contents] [Section 1] [Section 2] [Section 3] [Less Detail]


3. Performance Issues



3.1 Example: Broadcast to 8 Processors

Example of two ways (simple way, better way) to do a broadcast.


Simple approach: One processor broadcasts the same message to all the other processors, one at a time.




Smarter approach: Let other processors help propagate the message. In the first step, processor 1 sends the message to processor 2. In the second step, both processors can now participate: processor 1 sends to processor 3, while processor 2 sends to processor 4. Similarly, by the third step, four processors (1, 3, 2, and 4) are sending the message to the four remaining processors (5, 6, 7, and 8).


Compare the number of steps in the two approaches:

Simple approach: (N-1)*p
Smarter approach: log2N

The latter scales better as N increases.


3.2 Example: Scatter to 8 Processors

Example of two ways (simple way, better way) to do a scatter.


Simple approach: One processor scatters the N messages to the other processors, one at a time.




Smarter approach: Let other processors help propagate the message. In the first step, processor 1 sends half of the data (size 4*p) to processor 2. In the second step, both processors can now participate: processor 1 sends one half of the data it received in step 1 (size 2*p) to processor 3, while processor 2 sends one half of the data it received in step 1 to processor 4. At the third step, four processors (1, 3, 2, and 4) are sending the final messages (size p) to the remaining four processors (5, 6, 7, and 8).


Compare the number of steps in the two approaches:

Simple approach: (N-1)*p
Smarter approach: log2N

The latter scales better as N increases.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Less Detail]


References

Book

Using MPI: Portable Parallel Programming with the Message-Passing Interface, by William Gropp, Ewing Lusk, and Anthony Skjellum. Published 10/21/94 by MIT Press, 328 pages.

World Wide Web


[Quiz] Take a multiple-choice quiz on this material, and submit it for grading.

[Quiz] EXTRA CREDIT: Take a multiple-choice quiz on this material, and submit it for grading.

[Exercise] Lab: MPI Collective Communication.

[Evaluation] Please complete this short evaluation form. Thank you!


[CTC Home Page] [Search] [Education]
[Copyright Statement] [Feedback] [Resources] URL http://www.tc.cornell.edu/Edu/Talks/MPI/Collective/more.html