Collective routine calls can return as soon as their participation in the collective communication is complete. The completion of a call indicates that the caller is now free to access locations in the communication buffer. It does not indicate that other processors in the group have completed or even started their operations (unless otherwise indicated in the description of the operation). Thus, a collective communication call may, or may not, have the effect of synchronizing all calling processes (except for a barrier call).
MPI guarantees that a message generated by collective communication calls will not be confused with a message generated by point-to-point communication.
The key concept of the collective functions is to have a "group" of participating processes. The routines do not have a group identifier as an explicit argument. Instead, there is a communicator that can be thought of as a group identifier linked with a context.
A barrier is simply a synchronization primitive. A node calling it will block until all the nodes within the group have called it. The syntax is given by
MPI_Barrier(MPI_Comm comm);
Often one node has data that is needed by a collection of other nodes. MPI provides the broadcast primitive to accomplish this task:
MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm);
float x(100); MPI_INIT(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { load(x); MPI_Bcast(&x, 100, MPI_FLOAT, 0, MPI_COMM_WORLD); } else MPI_Bcast(&x, 100, MPI_FLOAT, 0, MPI_COMM_WORLD);
Gather and scatter are inverse operations. Gather collects data from every member of the group (including the root) on the root node in linear order by the rank of the node (that is, the data are concatenated together node-wise on the root). Scatter parcels out data from the root to every member of the group in linear order by node. MPI provides two variants of the gather/scatter operations: one in which the number of data items collected from/sent to each node can be different; and a more efficient one in the special case where the number per node is uniform. You call the uniform case versions of gather and scatter like this:
MPI_Gather(void *sndbuf, int scount, MPI_Datatype datatype, void *recvbuf, int rcount, MPI_Datatype rdatatype, int root, MPI_Comm comm); MPI_Scatter(void *sndbuf, int scount, MPI_Datatype datatype, void *recvbuf, int rcount, MPI_Datatype rdatatype, int root, MPI_Comm comm);
MPI has another primitive, MPI_Gatherv, that extends the functionality of MPI_Gather into a varying count of data from each process. It has the following format for the C binding:
MPI_Gatherv(void *sndbuf, int scounts, MPI_Datatype datatype, void *recvbuf, int rcounts, int *displs, MPI_Datatype rdatatype, int root, MPI_Comm comm);
The inversion operation to MPI_Gatherv is MPI_Scatterv, which has the following format in the C binding:
MPI_Scatterv(void *sndbuf, int scounts, MPI_Datatype datatype, void *recvbuf, int rcounts, int *displs, MPI_Datatype rdatatype, int root, MPI_Comm comm);
float *x, *y, x0[100], y0[100] . . . MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &size); if( rank == 0) { .... allocate memory for x, y .... .... load x, y .... } MPI_Scatter(x, 100, MPI_FLOAT, x0, 100, MPI_FLOAT, 0, comm); MPI_Scatter(y, 100, MPI_FLOAT, y0, 100, MPI_FLOAT, 0, comm); for( i==1; x<100; i++) x0(i) = x0(i)*y0(i); MPI_Gather(x0, 100, MPI_FLOAT, x, 100, MPI_FLOAT, 0, comm); . . .
MPI_Allgather(void *sndbuf, int scount, MPI_Datatype sdatatype, void *recvbuf, int rcount, MPI_Datatype rdatatype, MPI_Comm comm);
Similarly, an allgatherv operation provides a more efficient way to do a gatherv followed by a broadcast: all members of the group receive the collected data.
MPI_Allgatherv(void *sndbuf, int scount, MPI_Datatype sdatatype, void *recvbuf, int rcounts, int *displs, MPI_Datatype rdatatype, MPI_Comm comm);
MPI_Alltoall(void *sndbuf, int scount, MPI_Datatype sdatatype, void *recvbuf, int rcount, MPI_Datatype rdatatype, MPI_Comm comm);
Note that these are typically very expensive operations since they involve communicating a lot of data. You can find the details of their calling sequences in the MPI reference.
Some of the most-used collective operations are global reductions or combine operations. A global reduction combines partial results from each node in the group using some basic function, such as sum, max, or min, and distributes the answer to the root node:
MPI_Reduce(void *sndbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);
MPI Operator Operation ---------------------------------------------- MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location
Integer: MPI_INT, MPI_LONG, MPI_SHORT, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG Floating Point: MPI_FLOAT, MPI_DOUBLE Byte: MPI_BYTE
Combining the two above charts, the valid datatypes for each operation is specified below.
MPI Operator Allowed Types -------------------------------------------------- MPI_MAX, MPI_MIN Integer, Floating point MPI_SUM, MPI_PROD Integer Floating point MPI_LAND, MPI_LOR, MPI_LXOR Integer MPI_BAND, MPI_BOX, MPI_BXOR Integer, Byte
In order to use MPI_MINLOC and MPI_MAXLOC in a reduce operation, one must provide a datatype argument that represents a pair (value and index). MPI provides seven such predefined datatypes. You can use the operations MPI_MINLOC and MPI_MAXLOC with each of the following datatypes.
In C binding:
MPI Datatype Description -------------------------------------- MPI_FLOAT_INT float and int MPI_DOUBLE_INT double and int MPI_LONG_INT long and int MPI_2INT pair of int
MPI_Allreduce(void *sndbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
MPI includes a reduce-scatter operation that has the following syntax in C:
MPI_Reduce_scatter(void *sndbuf, void *recvbuf, int recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
The MPI_Reduce_scatter routine is functionally equivalent to an MPI_Reduce operation with count equal to the sum of recvcounts[i] followed by MPI_Scatterv with sendcounts equal to recvcounts. The purpose for including this primitive to handle this task is to allow for a more efficient implementation. You can find the details in the MPI document.
A scan, or a prefix-reduction operation, performs partial reductions on distributed data. Specifically, the scan operation returns the reduction of the values in the send buffers of processes ranked 0, 1, ..., n into the receive buffer of the node ranked n. The syntax is:
MPI_Scan(void *sndbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
Note that collective operations must still be performed in the same order on each node to ensure correctness and to avoid deadlocks.
Here, we start to introduce some MPI group and communicator routines by example.
void *sbuf, *rbuf, sbuf1, rbuf1; MPI_Group wcomm, wgroup, group1; MPI_Comm subcomm; static int ranks[] = {0}; MPI_Init(&argc, &argv); MPI_Comm_group(MPI_COMM_WORLD, &wgroup); MPI_Comm_rank(MPI_COMM_WORLD, &rank); wcomm = MPI_COMM_WORLD; MPI_Group_excl(wcomm, 1, ranks, &group1); MPI_Comm_create(wgroup, group1, &subcomm); if(rank != 0) { ..... /* compute on nodes within group1 */ MPI_Reduce(sbuf, rbuf, count, MPI_INT, MPI_SUM, 1, subcomm); ..... } MPI_Reduce(sbuf1, rbuf1, count1, MPI_INT, MPI_SUM, 0, wcomm); MPI_Comm_free(&subcomm); MPI_Group_free(&group1); MPI_Group_free(&wgroup); MPI_Finalize();
It is a good practice is to free the group and communicator created in the program using the routines MPI_Comm_free and MPI_Group_free.