This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.
This module takes a closer look at how scatter/gather collective communication is used when the data on the root process is not contiguous, when different processes need to receive/send different size messages, or when the order that data is assigned to/collected from processes is different than the rank order.
References
Lab Exercises
Quiz
Evaluation
Hello, this module is MPI Collective Communication II. Our purpose in this brief module is to introduce you to some features of MPI that allow for greater flexibility when you are using MPI collective operations like gather or scatter.
MPI_Scatterv, MPI_Gatherv, MPI_Allgatherv, MPI_Alltoallv
What does the "v" stand for?
- varying -- size, relative location of messages
Examples for discussion: MPI_Gatherv and MPI_Scatterv
I guess you could say that this module is brought to you by the letter V! The MPI routines we'll be talking about all end with V--for example, MPI_Scatterv. You can pretend that the V stands for something like "varying" or "variable". This is because these routines allow you to vary both the (1) size and the (2) memory locations of the messages that you are using for the communication. We will take MPI_Scatterv and MPI_Gatherv as examples, but you can easily extend these ideas to MPI_Allgatherv and MPI_Alltoallv.
- Advantages
- more flexibility in writing code
- less need to copy data into temporary buffers
- more compact final code
- vendor's implementation may be optimal
(if not, may be trading performance for convenience)
As you'll see, there are several advantages to using these more general MPI calls. The main one is that there is less need to rearrange the data within a processor before doing an MPI collective operation. Furthermore, when you use these calls you give the vendor an opportunity to optimize the data movement for your particular platform. But even if the vendor hasn't bothered to optimize an operation like MPI_Gatherv for you, it will always be more convenient and compact for you to issue just a single call to MPI_Gatherv if your data access pattern is irregular.
2. Scatter vs. Scatterv
2.1 Scatter
Scatter requires contiguous data, uniform message size
- Purpose of scatter operation:
- to send different chunks of data to different processes
- not the same as broadcast (same chunk goes to all)
![]()
For our first example, let's compare scatter and scatterv. Recall that the purpose of a scatter is to send out different chunks of data to different tasks--that is, you're trying to split up your data among the remote processes. Recall also that a scatter is not the same as a broadcast--in a broadcast, the same chunk of data is being sent to everybody. As you can see from the figure, the basic MPI_Scatter requires that the sender's data are stored in contiguous memory addresses and that the chunks are uniform in size. Thus, the first N data items are sent to the first process in the communicator, the next N items go to the second process, and so on. However, for some applications this might be overly restrictive. What if some processes need fewer than N items, for instance? Should you just pad out the sending buffer with unnecessary data? The answer is no: you should use MPI_Scatterv instead.
- Extra capabilities in scatterv:
- gaps allowed between messages in source data
- (but individual messages must still be contiguous)
- irregular message sizes allowed
- data can be distributed to processes in any order
![]()
MPI_Scatterv gives you extra capabilities that are most easily described by comparing this figure to the previous one. First of all, you can see that varying numbers of items are allowed to be sent to the different processors. What's more, the first item in each chunk doesn't have to be positioned at some regular spacing away from the first item of the previous chunk. And the chunks don't even have to be stored in the correct sequence! This is the flexibility I alluded to earlier. However, we have not done away with all the restrictions: the chunks themselves must still be contiguous, and they must not overlap with one another. (I will add here as a parenthetical note that if you really, really want to, you can insert spaces into a chunk by creating an MPI derived datatype that has some built-in stride. But that trick may carry with it a potentially severe penalty in efficiency, and the chunks still can't overlap.)
- INTEGER SENDCOUNTS(0:NPROC-1), DISPLS(0:NPROC-1)
- ...
- CALL MPI_SCATTERV
- ( SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE,
- RECVBUF, RECVCOUNT, RECVTYPE,
- ROOT, COMM, IERROR )
The syntax for achieving an MPI_Scatterv operation is really pretty straightforward--dare I say, self-explanatory? Anyway, here's what it looks like in Fortran. Compared to plain scatter, there are two differences: the SENDCOUNTS argument has become an array, and a new argument DISPLS has been added to the list.
- SENDCOUNTS(I) is the number of items of type SENDTYPE
- to send from process ROOT to process I.
Thus its value is significant only on ROOT.
- DISPLS(I) is the displacement from SENDBUF to the
- beginning of the I-th message, in units of SENDTYPE.
It also has significance only for the ROOT process.
And that's pretty much it! The other arguments in the list simply mirror the usual MPI_Scatter calling sequence. The C syntax differs only in minor C-like details; if you want, you can see more about those details in the MPI Collective Communication module.
- INTEGER RECVCOUNTS(0:NPROC-1), DISPLS(0:NPROC-1)
- ...
- CALL MPI_GATHERV
- ( SENDBUF, SENDCOUNT, SENDTYPE,
- RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE,
- ROOT, COMM, IERROR )
5. When might these calls be useful?
5.1 Example Problem
Example: Process 0 reads the initial data and distributes it to
other processes, according to the domain decomposition.
![]()
Problem:What if the number of points can't be divided evenly?
A typical use of MPI_Scatterv might arise when you distribute initial data in a data-parallel code. In other words, we're going to assume you've parallelized your code via domain decomposition. Obviously, if the number of points is not divisible by the number of processes, not all the processes will receive the same number of data points. In this simplified illustration, process 0 has all the data initially, while the colors indicate a possible target distribution of data points across processes including process 0.
- Solution: Use MPI_Scatterv with arrays as follows...
- (Assume NPTS is not evenly divisible by NPROCS)
- NMIN = NPTS/NPROCS
- NEXTRA = MOD(NPTS,NPROCS)
- K = 0
- DO I = 0, NPROCS-1
- IF (I .LT. NEXTRA) THEN
- SENDCOUNTS(I) = NMIN + 1
- ELSE
- SENDCOUNTS(I) = NMIN
- END IF
- DISPLS(I) = K
- K = K + SENDCOUNTS(I)
- END DO
- CALL MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS ...)
The code here divides up the points among processors as evenly as possible. First, it figures out how many extra points there are by using the modulo function. Then it loops over the processes. Via the SENDCOUNTS array, it assigns one extra point to each process until all NEXTRA extra points have been accounted for. And while it's doing this, it's using the DISPLS array to keep track of the relative starting locations of the chunks. Finally, MPI_SCATTERV is called with the calculated SENDCOUNTS and DISPLS arrays in order to perform the desired data distribution.
Thanks for your attention! Here are some references that may be helpful, and please do try the exercises so you can get better acquainted with some of the more advanced features of MPI collective communication.
Message Passing Interface Forum (June 1995) MPI: A Message Passing
Interface Standard.
References
A sampling of programs available on the Web that
illustrate commands covered in this module.
Take a multiple-choice quiz on this material, and submit it for grading.
Lab exercises for MPI Collective Communication II
Please complete this short evaluation form. Thank you!