This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate within and between them, return to the top page of this module.
To introduce you to MPI, this module will first give an overview of MPI functionality and discuss the value of having a standard message passing library. Having already been through the IBM SP Parallel Environment module, you're now ready to be given a tailored tour of the basic concepts and mechanisms that will be of most value to you as beginning MPI programmers. This includes gaining a familiarity with:
In the lab exercises, you'll be given the opportunity to work with actual MPI code. At the conclusion of this module, you can expect to be conversant with a basic set of six MPI calls necessary to begin designing and coding your own MPI applications.
Please note that much of the content of this module is accessed by links to other files. Printing this file will not capture this information.
1. Overview of MPI
1.1 What Is MPI?
MPI is a message-passing library, a collection of routines for
facilitating communication (exchange of data and
synchronization of tasks)
among the processors in a
distributed-memory parallel program.
The acronym stands for Message-Passing Interface.
MPI is the first standard and portable message passing library
that offers good performance.
MPI is not a true standard; that is, it was not issued by a standards organization such as ANSI or ISO. Instead, it is a "standard by consensus," designed in an open forum that included hardware vendors, researchers, academics, software library developers, and users, representing over 40 organizations. This broad participation in its development ensured MPI's rapid emergence as a widely used standard for writing message-passing programs.
The MPI "standard" was introduced by the MPI Forum in May 1994 and updated in June 1995. The document that defines it is entitled "MPI: A Message-Passing Standard", published by the University of Tennesee, and available on the World Wide Web at Argonne National Laboratory. If you are not already familiar with MPI, you will probably want to print a copy of this document and use it as a reference for the syntax of MPI routines, which this workshop will not cover except to illustrate specific cases.
MPI 2 produces extensions to the MPI message passing standard. This effort did not change MPI; it extended MPI, in the following areas:
MPI 2 was complete as of July 1997.
1.2 What does MPI offer?
MPI offers portability, standardization, performance, functionality,
and several high quality implementations.
MPI is standardized on many levels. For example, since the syntax is standardized, you can rely on your MPI code to execute under any MPI implementation running on your architecture. Since the functional behavior of MPI calls is standardized too, you don't have to worry (as you do with different versions of PVM) about which implementation of MPI is currently on your machine; your MPI calls should behave the same regardless of the implementation. Performance, however, will vary slightly among different implementations.
In a rapidly changing environment of high performance computer and communication technology, portability is on almost everyone's mind. Who wants to develop a program that can be run on only one machine, or only poorly on others? All massively parallel processing (MPP) systems provide some sort of message passing library specific to their hardware. These provide great performance, but an application code written for one platform cannot be ported easily to another.
With MPI, you can write portable programs that still take advantage of the specifications of the hardware and software provided by vendors. Happily, this is mostly taken care of by simply using MPI calls because the implementors have tuned these calls to the underlying hardware and software environment.
A number of environments, including PVM, Express, and P4, have attempted to provide a standardized parallel computing environment. However, none of these attempts has shown the high performance of MPI.
MPI has more than one quality implementation. These implementations provide asynchronous communication, efficient message buffer management, efficient groups, and rich functionality. MPI includes a large set of collective communication operations, virtual topologies, and different communication modes, and MPI supports libraries and heterogeneous networks as well.
Implementations currently available include
1.3 MPI at the Cornell Theory Center
Two implementations of MPI are installed on CTC's SP system:
This workshop focuses on the IBM implementation.
IBM's MPI is the IBM product implementation of the MPI standard for the SP and RS/6000 workstation clusters. It is an evolution of the Message Passing Library (MPL) and has benefited from the MPI-F prototype, developed at the IBM T.J. Watson Research Center. MPI and MPL routines are now in the same library, so it is now possible to mix MPI and MPL calls in the same program.
IBM's MPI is installed on the Cornell Theory Center's SP2 and is available to users who have accounts on this machine. It is also available at all IBM SP sites that have made the switch to PSSP (Parallel System Support Programs) Version 2.1 and the AIX 4.1 operating system.
Programs written using the MPI library run on the SP under the Parallel Environment. These facilities provide:
IBM MPI also includes routines with the "MPE" prefix, which indicates an extension to the MPI standard and is not part of the standard itself. MPE routines are provided to enhance the functionality and the performance of user applications. However, applications that use them will not be directly portable to other MPI implementations.
1.4 How to Use MPI
If you already have a serial version of your program, and are going to modify
it to use MPI, make sure your serial version is as thoroughly debugged as you
can before going parallel. This will make debugging of your parallel version
much easier. Then add calls to MPI routines in the appropriate places in your
program.
If you are writing an MPI program from scratch, and it's not much extra work to write a serial version first (without MPI calls), you should do that. Again, identifying and removing the non-parallel bugs first will make parallel debugging that much easier. Design your parallel algorithm, taking advantage of any parallelism inherent in your serial code, e.g., large arrays that can be broken down into subtasks and processed independently.
When debugging in parallel, start with a few nodes first. If you want to run on more nodes, make sure your program runs successfully on a few nodes first, and increase the number of nodes gradually, e.g., from 2 to 4, then 8, etc. That way, you won't waste a lot of machine time on additional bugs.
This section will give you an introduction to a simple MPI program; the intent here is just to give you a visual image that you can relate to and refer back to if you have questions concerning things like
The program itself is nothing more than the venerable Hello,
world program, so we don't have to be at all concerned about
understanding the purpose or the algorithm -- rather, we can focus
completely on the mechanics of accomplishing an extremely easy task.
2.1 Format of MPI routines
First, we'll look at the actual calling formats used by MPI.
C programs should include the file "mpi.h". This contains definitions for MPI constants and functions.
Fortran programs should include 'mpif.h'. This contains definitions for MPI constants and functions.
The exceptions to the above formats are the timing routines (MPI_WTIME and MPI_WTICK) which are functions for both C and Fortran, and return double-precision reals.
As you'll see, the basic outline of an MPI program follows these general steps:
2.3 An MPI sample program (Fortran)
As you look at the code below, note the six basic calls to MPI routines. Click on the name of each MPI routine to read a detailed description of that routine's purpose and syntax.
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
character(12) message
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
tag = 100
if(rank .eq. 0) then
message = 'Hello, world'
do i=1, size-1
call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
& MPI_COMM_WORLD, ierror)
enddo
else
call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
& MPI_COMM_WORLD, status, ierror)
endif
print*, 'node', rank, ':', message
call MPI_FINALIZE(ierror)
end
To summarize the program: This is a SPMD code, so copies of this program are running on multiple nodes. Each process initializes itself with MPI (MPI_INIT), determines the number of processes (MPI_COMM_SIZE), and learns its rank (MPI_COMM_RANK). Then one process (with rank 0) sends messages in a loop (MPI_SEND), setting the destination argument to the loop index to ensure that each of the other processes is sent one message. The remaining processes receive one message (MPI_RECV). All processes then print the message, and exit from MPI (MPI_FINALIZE).
It is also worth noting what doesn't happen in this program. There is no routine that causes additional copies of the program to run. For MPI-1 all processes are started on the command line, in an implementation-specific manner.
In the lab exercise at the end, you'll be asked to enhance this program with some additional calls to MPI routines.
3. MPI messages
MPI messages consist of two basic parts: the actual data that you want
to send/receive, and an envelope of information that helps to route
the data. There are usually three calling parameters in MPI
message-passing calls that describe the data, and another three
parameters that specify the routing:
startbuf,count,datatype, dest,tag,comm \ | / \ | / \---DATA--/ ENVELOPELet's look at the data and envelope in more detail. We'll describe each parameter, and discuss whether these must be coordinated between the sender and receiver.
3.1 Data
When we use the term buffer to describe parameters in your MPI
calls, we
mean a space in the computer's memory where your MPI messages are to
be sent from or stored to. So, in this context, a buffer is
simply memory the compiler has assigned to a variable (usually an
array) in your program. To specify the buffer, you have to give
three parameters in the MPI call:
The address where the data start. For example, this could be the start of an array in your program.
The number of elements (items) of data in the message. Note that this is elements, not bytes. This makes for portable code, since you don't have to worry about different representations of data types on different computers. The software implementation of MPI determines the number of bytes automatically.
The count specified by the receive call should be greater than or equal to the count specified by the send. If more data is sent than storage is available in the receive buffer, an error will occur.
The type of data to be transmitted. For example, this could be floating point. The datatype should be the same for the send and receive call. As an aside, an exception to this rule is the datatype MPI_PACKED, which is one method of handling mixed-type messages (the preferred method is with a derived datatypes). Type-checking is relaxed in this case.
The types of data already defined for you are called "basic datatypes," and are listed below. WARNING: note that the names are slightly different between the C implementation and the Fortran implementation.
--------------------------------------- MPI Datatype C datatype --------------------------------------- MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED ---------------------------------------
--------------------------------------- MPI Datatype Fortran Datatype --------------------------------------- MPI_INTEGER INTEGER MPI_REAL REAL MPI_DOUBLE_PRECISION DOUBLE PRECISION MPI_COMPLEX COMPLEX MPI_LOGICAL LOGICAL MPI_CHARACTER CHARACTER(1) MPI_BYTE MPI_PACKED ---------------------------------------
You can also define additional datatypes, called derived datatypes. For example, you could define a datatype for noncontiguous data, or for a sequence of mixed basic datatypes. This might make your programming easier, or even make your code run faster. Derived datatypes are beyond the scope of this introduction, but covered in the talk Derived Datatypes.
Some derived datatypes are:
As mentioned earlier, a message consists of the actual data and the message envelope. The envelope provides information on how to match sends to receives. The three parameters used to specify the message envelope are:
Destination is specified by the send and is used to route the message to the appropriate process. Source is specified by the receive. Only messages coming from that source can be accepted by the receive call. The receive can set source to MPI_ANY_SOURCE to indicate that any source is acceptable.
An arbitrary number to help distinguish among messages. The tags specified by the sender and receiver must match. The receiver can specify MPI_ANY_TAG to indicate that any tag is acceptable.
The communicator specified by the send must equal that specified by the receive. We'll describe communicators in more depth later in the module. For now, we'll just say that a communicator defines a communication "universe", and that processes may belong to more than one communicator. In this module, we will only be working with the predefined communicator MPI_COMM_WORLD which includes all processes in the application.
To help understand the message envelope, let's consider the analogy of a bill collection agency that collects for several utilities. When sending a bill the agency must specify:
As promised earlier, we are now going to explain a little more about communicators. We will not go into great detail -- only enough that you will have some understanding of how they are used. Communicators are covered in more depth in the module MPI Groups and Communicator Management.
As described above, a message's eligibility to be picked up by a specific receive call depends on its source, tag, and communicator. Tag allows the program to distinguish between types of messages. Source simplifies programming. Instead of having a unique tag for each message, each process sending the same information can use the same tag. But why is a communicator needed?
Suppose you are sending messages between your processes, but you are also calling a set of libraries you obtained elsewhere, which also runs on multiple nodes and communicates within itself using MPI. In this case, you want to make sure that messages you send go to your processes, and do not get confused with the messages being sent internally between the processes that comprise the library routine.
In this example, we have three processes communicating with each other. Each process also calls a library routine, and the three parallel parts of the library routine communicate with each other. We want to have two different message "spaces" here, one for our messages, and one for the library's messages. We do not want any intermingling of the messages.
You may continue with the example by reading the text below, or by running the Java applet which follows the text. (Or both, if you're feeling adventuresome.) Both illustrate the same thing. The boxes represent parts of three parallel processes. Time progresses from the top to the bottom of each diagram. The numbers in parentheses are NOT parameters, but rather process numbers. For example, send(1) means send a message to process 1. Recv(any) means receive a message from any processor. The user's (caller's) code is in the white (unshaded) boxes. The shaded boxes (callee) represent a (parallel) library package being called by the user. Finally, the arrows represent the movement of a message from sender to receiver.
The diagram below shows what we would like to happen. In this case, everything works as intended.
However, there is no guarantee that things will occur in this order, given the fact that the relative scheduling of processes on different nodes can vary from run to run. Suppose, for example, that we change the third process by adding some computation at the beginning. The sequence of events might then occur as follows:
In this case, communications do not occur as intended. The first "receive" in process 0 now receives the "send" from the library routine in process 1, not the intended (and now delayed) "send" from process 2. As a result, all three processes hang.
This problem is solved by the library developer requesting a new and unique communicator, and specifying this in all send and receive calls made by the library. This creates a library ("callee") message space separate from the user's ("caller") message space.
You might wonder if tags can be used to accomplish separate message spaces. The problem with tags is that they are given values by the programmer, and he/she might use the same tag used by a parallel library using MPI. With communicators, the system, not the programmer, assigns identification -- the system assigns a communicator to the user, and it assigns a different communicator to the library -- so there is no possibility of overlap, as with tags.
4.2 Communicators and process groups
In addition to development of parallel libraries, communicators are useful in organizing communication within an application.
Up to this point, we've looked at communicators that include all processes in the application. But the programmer can also define a subset of processes, called a process group, and attach one or more communicators to the process group. Communication specifying that communicator is now restricted to those processes.
In the example below, the communication pattern is a 2-D mesh. Each of the six boxes represents a process. Each process must exchange data with its upper and lower, and right and left neighbors. Coding this communication is simpler if the processes are grouped by column (for up/down communication) and row (for right/left communication). So, each process belongs to three communicators, which are indicated by the words in that process's box: one communicator for all processes (the default world communicator), one communicator for its row, and one communicator for its column. The communicators are as follows:
world communicator -- all processes -- uncolored in diagram
comm1 -- communicator for row 1 -- yellow in diagram
comm2 -- communicator for row 2 -- purple in diagram
comm3 -- communicator for column 1 -- pink in diagram
comm4 -- communicator for column 2 -- green in diagram
comm5 -- communicator for column 3 -- blue in diagram
This also ties directly into use of collective communications (covered in MPI Collective Communication I).
To revisit the bill collection analogy: one person may have an account with the electric and phone companies (2 communicators) but not with the water company. The electric communicator may contain different people than the phone communicator. A person's ID number (rank) may vary with the utility (communicator). So, it is critical to note that the rank given as message source or destination is the rank in the specified communicator.
Although MPI provides for an extensive, and sometimes complex, set of calls, you can begin with just the six basic calls:
However, for programming convenience and optimization of your code, you should consider using other calls, such as those described in the more advanced talks which come after this one.
Communicators guarantee unique message spaces. In conjunction with process groups, they can be used to limit communication to a subset of processes.
Message Passing Interface Forum (1995) MPI: A Message Passing
Interface Standard. June 12, 1995. Available
online or in
postscript.
References
IBM provides man pages for all MPI routines and constants. The routine must be specified according to C case requirements. For example, to read the man page for the routine that initializes the environment, you must enter:
man MPI_Init
IBM manual for their implementation of MPI is available online
A sampling of programs available on the Web that
illustrate commands covered in this module.
Take a multiple-choice quiz on this material, and submit it for grading.
EXTRA CREDIT: Take a multiple-choice quiz on this material, and submit
it for grading.
Basics of MPI Programming: Lab
Please complete this short evaluation form. Thank you!