Cornell Theory Center


Discussion
MPI Persistent Communication

10/98

This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.


Persistent communication is an optimization for point to point message passing. This module describes when it is appropriate, how it reduces system overhead, and how it is used in a program. Improvement in wallclock time for IBM's MPI on the SP is documented. In the lab exercises, you will modify a code to use persistent communication, and compare the timing for the original and optimized version.

Table of Contents

  1. Overview
  2. Internals
  3. Usage
  4. Performance

References Lab Exercises Quiz Evaluation

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Less Detail]


1. Overview

If a point-to-point message-passing routine is called repeatedly with the same arguments, persistent communication can be used to avoid redundancy in setting up the message each time it is sent. Persistent communication reduces the overhead of communication between the parallel task and the network adapter, but not the overhead between network adapters on different nodes.

One class of program that is appropriate for persistent communication is data decomposition problems in which points are updated based on the values of neighboring points. In this case, for many iterations, tasks send points that border their neighbors domain, and receive points that border theirs. At each iteration, the location, amount, and type of message data, the destination or source task, and the communicator stay the same. The same message tags can be used, because persistent communication requires the communication to be completed within each loop iteration.

Persistent communication is non-blocking. You can choose to have both sides of the communication use persistent communication, or only one side.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Less Detail]


2. Internals

MPI objects are the internal representations of important entities such as groups, communicators, and datatypes. To increase program safety, programmers cannot directly create, write to, or destroy objects. They are manipulated via handles, that are returned from or passed to MPI routines. An example of a handle is MPI_COMM_WORLD, which accesses a communicator object. You have also encounted the request handle returned by the non-blocking communication calls.

The request object accessed by this handle is the internal representation of a send or receive call. It archives all the information contained in the arguments to the message passing call (but not the message data itself), plus the communication mode and the status of the message.

When a program calls a non-blocking message-passing routine such as MPI_Isend, a request object is created, and then the communication is started. These steps are equivalent to two other MPI calls, MPI_Send_init and MPI_Start. When the program calls MPI_Wait, it waits until all necessary local operations have completed, and then frees the memory used to store the request object. This second step equals a call to MPI_Request_free.

When you call a non-blocking message-passing routine many times with the same arguments, you are repeatedly creating the same request object. Similarly, when you wait for completion of these communications, you repeatedly free the request object.

The idea behind persistent communication is allow the request object to persist, and be reused, after the MPI_Wait call. You create the request object once (using MPI_Send_init), start and complete the communication as many times as needed (using MPI_Start and MPI_Wait) and then free the request object once (using MPI_Request_free).

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Less Detail]


3. Usage

for (i=1; i<BIGNUM; i++)
{
MPI_Irecv (buf1, cnt, tp, src, tag, com, &req[0]);

MPI_Isend (buf2, cnt, tp, dst, tag, com, &req[1]);

MPI_Waitall (2, req, status);

do_work(buf1,buf2);
}


The plain text in the worksheet above shows a computational loop that uses a non-blocking receive and a non-blocking send. Since the loop is repeated many times, and the arguments to the communication routines do not change, this program can use persistent communication to improve performance.

The selection boxes on the worksheet show the steps required to convert the program to use persistent communication for both the send and receive.

Step 1: Create requests

The first step in converting the program is to initialize the persistent communication. This is done outside the loop. The receive is initialized with:

There are four MPI routines for initialization of persistent sends, corresponding to the four communication modes:

Since the worksheet uses standard mode, the persistent request is created using MPI_Send_init.

The initialization routines create persistent request objects and return handles (req[0] and req[1] in the worksheet). They do not cause any data to be transferred.

The initialization routines (MPI_Send_init and variants, and MPI_Recv_init) have the same argument lists as the non-blocking message-passing calls (MPI_Isend and variants, and MPI_Irecv). When adding the initialization routine, simply "borrow" the argument list from the message-passing call which is to be replaced.

In addition to creating request objects using MPI_Send_init, it is sometimes useful to initialize a request handle to the null request MPI_REQUEST_NULL. This can simplify writing code when all tasks do not do exactly the same thing, or when behavior may vary at runtime.

Step 2: Replace receive and send

The second step in porting the program is to replace the receive and send calls (which create requests and start the communication) with calls that only start the communication, based on pre-existing requests.

There are two start calls:

These are non-blocking calls. Therefore, the call to MPI_Waitall at the end of the loop is conserved. In fact, a Wait is required between successive MPI_Start calls that use the same request object -- a request object cannot track two communications at once.

Step 3: Deallocate requests

After the loop exits, the program no longer needs the persistent request objects. Since persistent request objects are not deallocated by Wait or Test, they must be explicitly deallocated with:

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Less Detail]


4. Performance

Improvement in Wallclock Time
Persistent vs. Conventional Communication
msize (bytes) mode improvement mode improvement
8 async 19 % sync 15 %
4096 async 11 % sync 4.7 %
8192 async 5.9 % sync 2.9 %
800,000 - - sync 0 %
8,000,000 - - sync 0 %

These timings were made with a very simple Fortran program in which, within a loop, one task repeatedly sends a message to a second task. For the three smaller message sizes, the number of loop iterations was varied between 10,000 and 1,000,000. The improvement with persistent communication did not vary with number of loop iterations, or for sends vs receives.

Measurements were made for asynchronous and synchronous communication. All asynchronous communication was measured using standard mode. Since standard mode switches from asynchronous to synchronous behavior above the eager limit, for the 8192-byte runs the eager limit was raised above the default (4096 bytes) to 8193 bytes (using the MP_EAGER_LIMIT environment variable). Synchronous communication was measured using synchronous mode for message size less than or equal to 4096 bytes, and standard mode for size messages greater than 4096 bytes.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Less Detail]


References

Message Passing Interface Forum (June 1995) MPI: A Message Passing Interface Standard.

CTC's MPI Documentation

[FAQ] Frequently Asked Questions


[Quiz] Take a multiple-choice quiz on this material, and submit it for grading.

[Exercise] Lab exercise for MPI Persistent Communication

[Evaluation] Please complete this short evaluation form. Thank you!


[CTC Home Page] [Search] [Education]
[Copyright Statement] [Feedback] [Resources]
URL http://www.tc.cornell.edu/Edu/Talks/MPI/Persistent/more.html