This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.
References
Lab Exercises
Quiz
Evaluation
If a point-to-point message-passing routine is called repeatedly with the same arguments, persistent communication can be used to avoid redundancy in setting up the message each time it is sent. Persistent communication reduces the overhead of communication between the parallel task and the network adapter, but not the overhead between network adapters on different nodes.
One class of program that is appropriate for persistent communication is data decomposition problems in which points are updated based on the values of neighboring points. In this case, for many iterations, tasks send points that border their neighbors domain, and receive points that border theirs. At each iteration, the location, amount, and type of message data, the destination or source task, and the communicator stay the same. The same message tags can be used, because persistent communication requires the communication to be completed within each loop iteration.
Persistent communication is non-blocking. You can choose to have both sides of the communication use persistent communication, or only one side.
MPI objects are the internal representations of important
entities such as groups, communicators, and datatypes. To increase
program safety, programmers cannot directly create, write to, or
destroy objects. They are manipulated via handles, that are
returned from or passed to MPI routines. An example of a handle is
MPI_COMM_WORLD
, which accesses a communicator object.
You have also encounted the request
handle returned by
the non-blocking communication calls.
The request object accessed by this handle is the internal representation of a send or receive call. It archives all the information contained in the arguments to the message passing call (but not the message data itself), plus the communication mode and the status of the message.
When a program calls a non-blocking message-passing routine such as
MPI_Isend
, a request object is created, and then the
communication is started. These steps are equivalent to two other MPI
calls, MPI_Send_init
and MPI_Start
. When
the program calls MPI_Wait
, it waits until all necessary
local operations have completed, and then frees the memory used to
store the request object. This second step equals a call to
MPI_Request_free
.
When you call a non-blocking message-passing routine many times with the same arguments, you are repeatedly creating the same request object. Similarly, when you wait for completion of these communications, you repeatedly free the request object.
The idea behind persistent communication is allow the request
object to persist, and be reused, after the
MPI_Wait
call. You create the request object once (using
MPI_Send_init
), start and complete the communication as
many times as needed (using MPI_Start
and
MPI_Wait
) and then free the request object once (using
MPI_Request_free
).
The plain text in the worksheet above shows a computational loop that uses a non-blocking receive and a non-blocking send. Since the loop is repeated many times, and the arguments to the communication routines do not change, this program can use persistent communication to improve performance.
The selection boxes on the worksheet show the steps required to convert the program to use persistent communication for both the send and receive.
The first step in converting the program is to initialize the persistent communication. This is done outside the loop. The receive is initialized with:
MPI_Recv_init
(S)Since the worksheet uses standard mode, the persistent request is
created using MPI_Send_init
.
The initialization routines create persistent request objects and
return handles (req[0]
and req[1]
in the
worksheet). They do not cause any data to be transferred.
The initialization routines (MPI_Send_init
and
variants, and MPI_Recv_init
) have the same argument lists
as the non-blocking message-passing calls (MPI_Isend
and
variants, and MPI_Irecv
). When adding the initialization
routine, simply "borrow" the argument list from the message-passing
call which is to be replaced.
In addition to creating request objects using
MPI_Send_init
, it is sometimes useful to initialize a
request handle to the null request
MPI_REQUEST_NULL. This can simplify writing code when all tasks
do not do exactly the same thing, or when behavior may vary at
runtime.
The second step in porting the program is to replace the receive and send calls (which create requests and start the communication) with calls that only start the communication, based on pre-existing requests.
There are two start calls:
These are non-blocking calls. Therefore, the call to
MPI_Waitall
at the end of the loop is conserved. In
fact, a Wait is required between successive MPI_Start
calls that use the same request object -- a request object cannot
track two communications at once.
After the loop exits, the program no longer needs the persistent request objects. Since persistent request objects are not deallocated by Wait or Test, they must be explicitly deallocated with:
MPI_Request_free
(S)
msize (bytes) | mode | improvement | mode | improvement |
---|---|---|---|---|
8 | async | 19 % | sync | 15 % |
4096 | async | 11 % | sync | 4.7 % |
8192 | async | 5.9 % | sync | 2.9 % |
800,000 | - | - | sync | 0 % |
8,000,000 | - | - | sync | 0 % |
These timings were made with a very simple Fortran program in which, within a loop, one task repeatedly sends a message to a second task. For the three smaller message sizes, the number of loop iterations was varied between 10,000 and 1,000,000. The improvement with persistent communication did not vary with number of loop iterations, or for sends vs receives.
Measurements were made for asynchronous and synchronous communication. All asynchronous communication was measured using standard mode. Since standard mode switches from asynchronous to synchronous behavior above the eager limit, for the 8192-byte runs the eager limit was raised above the default (4096 bytes) to 8193 bytes (using the MP_EAGER_LIMIT environment variable). Synchronous communication was measured using synchronous mode for message size less than or equal to 4096 bytes, and standard mode for size messages greater than 4096 bytes.
Message Passing Interface Forum (June 1995) MPI: A Message Passing
Interface Standard.
References
Take a multiple-choice quiz on this material, and submit it for grading.
Lab exercise for MPI Persistent Communication
Please complete this short evaluation form. Thank you!