Cornell Theory Center


Discussion:
MPI Point to Point Communication II

This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.


MPI Point to Point Communication I introduced the routines related to sending a message between two processes. There are several programming options for point to point communication that relate to how the system handles the message, and this module will give you an overview of some of them. These topics are followed by a short set of hard-won recommendations to keep in mind during application design and construction.


Table of Contents

  1. Overview
  2. Basic Deadlock
  3. Determining Information about Messages
  4. Special Parameters
  5. IBM MPI Implementation
  6. Programming Recommendations

References Lab Exercises Quiz Evaluation

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


1. Overview

We will start with an introduction to the programming options covered in this module:

Deadlock

Deadlock is an often-encountered situation in parallel processing, resulting when two or more processes are in contention for the same set of resources. In communications, a typical scenario involves two processes wishing to exchange messages, but both trying to give theirs to the other while neither is yet ready to accept one. A number of strategies will be described to help insure against this sort of thing occurring in the application.

Checking and acting on communications-"state"

When involved in non-blocking transactions, the calls wait, test, and probe make it possible for the application to query the status of particular messages, or to check on any that meet a certain set of characteristics, without necessarily taking any completed transactions out of the queue.

Checking the information returned from a transaction (status) allows the application to, for example, take corrective action if an error has occurred.

Using special parameters for special cases

Wildcards - MPI_ANY_SOURCE and MPI_ANY_TAG

Use of these calls can allow the receiving process to specify receipt of a message using a wildcard for either the sender or for the kind of message.

Null Processes and Requests

Special null parameters can simplify applications with regular data structures.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


2. Basic Deadlock

Deadlock is a phenomenon most common with blocking communication. It occurs when all tasks are waiting for events that haven't been initiated yet.

The following diagram represents two SPMD tasks: both are calling blocking standard sends at the same point of the program. Each task's matching receive occurs later in the other task's program.

A simple example of deadlock is one in which the two sends are each waiting on their corresponding receives in order to complete, but those receives are executed after the sends, so if the sends do not complete and return, the receives can never be executed, and both sets of communications will stall indefinitely.

A more complex example of deadlock can occur if the message size is greater than the threshold; deadlock will occur because neither task can synchronize with its matching receive. If the message size is less than or equal to the threshold, deadlock can still occur if insufficient system buffer space is available. Both tasks will be waiting for a receive to draw message data out of the system buffer, but these receives cannot be executed because both tasks are blocked at the send.

Solutions for avoiding deadlock:

There are four ways to avoid deadlock:

  1. different ordering of calls between tasks

    Arrange for one task to post its receive first and for the other to post its send first. That clearly establishes that the message in one direction will precede the other.

  2. non-blocking calls

    Have each task post a non-blocking receive before it does any other communication. This allows each message to be received, no matter what the task is working on when the message arrives or in what order the sends are posted.

  3. MPI_Sendrecv
    MPI_Sendrecv_replace

    Use MPI_Sendrecv (S) . This is an elegant solution. In the _replace (S) version, the system allocates some buffer space (not subject to the threshold limit) to deal with the exchange of messages.

  4. buffered mode

    Use buffered sends so that computation can proceed after copying the message to the user-supplied buffer. This will allow the receives to be executed. Buffered sends were discussed earlier, in MPI Point to Point Communication I.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


3. Determining Information about Messages

Applications can use specific calls to control how they deal with incomplete transactions, or transactions whose state is not known, without necessarily having to complete the operation (this is analogous to wanting to know whether or not your aunt has written you, but not particularly caring what she had to say). The application can be programmed to be aware of the state of its communications, and thereby act intelligently in different situations:

3.1 Wait, Test, and Probe

Having successfully decoupled your communications from your main computational thread, you will nonetheless need to be able to both obtain information about the status of the communication transaction, and arrange to take certain actions regarding the conditions you find. The three calls described here are among the most commonly useful for dealing with these kinds of situations.


3.2 Status

status returns source, tag, error (standard)

Status is the object at which to look to determine information on the message source and tag, and any error incurred on the communication call. In Fortran these are returned as status(MPI_SOURCE), status(MPI_TAG), and status(MPI_ERROR) respectively. In C they are returned as status.MPI_SOURCE, status.MPI_TAG, and status.MPI_ERROR. In the case of sends, the status information is probably not different from what had been in the request, so is of little use.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


4. Special Parameters

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


5. IBM MPI Implementation

5.1 Protocols

IBM's MPI implements the four communication modes with two protocols. Here, protocols can be thought of as code that coordinates the message transfer.

A "rendezvous" protocol is used to implement the synchronous mode and the standard mode for message size greater than the threshold.

The "eager" protocol is used to implement ready mode, buffered mode, and standard mode for message size less than or equal to the threshold. Ready mode is implemented with the eager protocol and no buffer. Standard mode for small messages adds a receive-side system buffer. Buffered mode adds both a user-supplied (send-side) buffer and the receive-side system buffer.


5.2 Optimization and Customization

IBM's MPI implements some optimizations for handling messages smaller than the threshold size. Buffered mode bypasses the user-supplied (send-side) buffer, and copies the message directly to the adapter (and thence to the network). This decreases the overall system overhead by removing the buffer copy. It also slightly increases the time required for the blocking buffered send to return, because the copy to the adapter takes longer than a copy to the user-supplied buffer.

In standard mode, when the message size is smaller than the threshold, the receive-side buffer will be bypassed if the receive call is posted before the send.

IBM's MPI offers environment variables that can be used to adjust message-passing behavior. These can be used to:


6. Programming Recommendations

Avoid deadlock by intelligent arrangement of sends/receives, or by posting non-blocking receives early.

If you choose to use blocking transactions, try to guarantee that deadlock will be avoided by carefully tailoring your communication strategy so that sends and receives are properly paired and in the necessary order; alternatively, post non-blocking receives as early as possible, so that the sends will stay in the system for as little time as is necessary.

Use the appropriate "operation-status" call ("wait", "test", or "probe") to control the operation of non-blocking communications calls.

Correctly knowing the state of communications transactions allows the application to intelligently steer itself to better efficiency in its use of available cycles. Ultimately pending traffic must be accepted, but that action can be long in coming and much can possibly be accomplished while it is incomplete. The wait, test and probe calls allow the application to match the appropriate activity with the particular situation.

Check the value of the "status" fields for problem reports.

Don't just assume that things are running smoothly -- make it a general rule that every transaction is checked for success, and that failures are promptly reported and as much related information as possible is developed and made available for debugging.

Intelligent use of wildcards can greatly simplify logic and code.

Using general receives, receives capable of handling more than one kind of message traffic (in terms of either sender, or message-type, or both), can greatly simplify the structure of your application, and can potentially save on system resources (if you are in the habit of using a unique message buffer for each transaction).

Null processes and requests move tests out of user code.

Use of MPI_PROC_NULL and MPI_REQUEST_NULL does not get rid of boundary tests, it simply allows the programmer to use a call that will be ignored by the system.

[Table of Contents] [Section 1] [Section 2] [Section 3] [Section 4] [Section 5] [Section 6] [Less Detail]


References

CTC's MPI Documentation
http://www.tc.cornell.edu/UserDoc/Software/PTools/mpi/

Extensive materials written at CTC, as well as links to postscript versions of the following papers:

Franke, H. (1995) MPI-F, An MPI Implementation for IBM SP-1/SP-2. Version 1.41 5/2/95.

Franke, H., Wu, C.E., Riviere, M., Pattnaik, P. and Snir, M. (1995) MPI Programming Environment for IBM SP1/SP2. Proceedings of ICDCS 95, Vancouver, 1995.

Gropp, W., Lusk, E. and Skjellum, A. Using MPI. Portable Parallel Programming with the Message-Passing Interface. The MIT Press. Cambridge, Massachusetts.

MPI: A Message-Passing Interface Standard June 1995

Message Passing Interface Forum (1995) MPI: A Message Passing Interface Standard. June 12, 1995. Available in postscript from http://www.epm.ornl.gov/~walker/mpi


[Quiz] Take a multiple-choice quiz on this material, and submit it for grading.

[Exercise] Lab exercises for MPI Point to Point Communication II

[Evaluation] Please complete this short evaluation form. Thank you!


[CTC Home Page] [Search] [Education]
[Copyright Statement] [Feedback] [Resources]
URL http://www.tc.cornell.edu/Edu/Talks/MPI/Pt2pt.II/more.html