Fox Presentation Fall 1995
	The Message Passing Interface MPI for
	CPS 615 -- Computational Science in
	Simulation Track
	October 1, 1995
	Updated Oct 31 1996
		Geoffrey Fox
		NPAC
		Syracuse University
		111 College Place
		Syracuse NY 13244-4100
	Abstract of MPI -- The Message Passing Interface -- Presentation
		This covers MPI from a user's point of view and is meant to be a supplement to other online resources from MPI Forum, David Walker's Tutorial, Ian Foster's "Designing and Building Parallel Programs", Gropp,Lusk and Skjellum "Using MPI"
		An Overview is based on subset of 6 routines that cover send/receive, environment inquiry (for rank and total number of processors) initialize and finalization with simple examples
		Processor Groups, Collective Communication and Computation and Derived Datatypes are also discussed
	MPI Overview -- Comparison with HPF -- I
		MPI collected ideas from many previous message passing systems and put them into a "standard" so we could write portable (runs on all current machines) and scalable (runs on future machines we can think of) parallel software
		MPI agreed May 1994 after a process that began with a workshop in April 1992
		MPI plays same role to message passing systems that HPF does to data parallel languages
		BUT whereas MPI has essentially all one could want -- as message passing fully understood
		HPF will still evolve as many unsolved data parallel compiler issues
			e.g. HPC++ -- the C++ version version of HPF still uncertain 
			and there is no data parallel version of C due to pointers (C* has restrictions)
			HPJava is our new idea
			whereas MPI fine with Fortran C or C++ and even Java 
	MPI Overview -- Comparison with HPF -- II
		HPF runs on SIMD and MIMD machines and is high level as it expresses a style of programming or problem architecture
		MPI runs on MIMD machines (in principle it could run on SIMD but unnatural and inefficient) -- it expresses a machine architecture
		Traditional Software Model is
			Problem --> High Level Language --> Assembly Language --> Machine
			--------      Expresses Problem -- Expresses Machine
		So in this analogy MPI is universal "machine-language" of Parallel processing
	Some Key Features of MPI
		Point to Point Message Passing
		Collective Communication -- messages between >2 simultaneous processes
		Support for Process Groups -- messaging in subsets of processors
		Support for communication contexts -- general specification of message labels and ensuring unique to a set of routines as in a precompiled library
			Note a communicator incorporates contexts and groups as single data structure and defines the scope of communication operation
		Support for application (virtual) topologies analogous to distribution types in HPF
		Inquiry routines to find out about environment such as number of processors
	Some Difficulties with MPI
		Kitchen Sink has 129 functions and each has many arguments
			This completeness is strength and weakness!
			Hard to implement efficiently and hard to learn
		It is not  a complete operating environment and does not have ability to create and spawn processes etc.
		PVM is previous dominant approach
			It is very simple with much less functionality than MPI
			However it runs on essentially all machines including workstation clusters
			Further it is complete albeit simple operating environment
		MPI outside distributed computing world with HTTP of the Web, ATM protocols and systems like ISIS from Cornell 
		However it does look as though MPI is being adopted as general messaging system by parallel computer vendors 
	Why use Processor Groups?
		We find a good example when we consider typical Matrix Algorithm
		(matrix multiplication)
		A i,j = Sk B i,k C k,j
		summed over k'th column of B and k'th row of C
		Consider a square decomposition of 16 by 16 matrices B and C as for Laplace's equation. (Good Choice)
		Each operation involvea a subset(group) of 4 processors
	MPI Conventions
		All MPI routines are prefixed by MPI_
			C is always MPI_Xnnnnn(parameters) : C is case sensitive
			Fortran is case insensitive but we will write MPI_XNNNNN(parameters)
		MPI constants are in upper case as MPI datatype MPI_FLOAT for floating point number in C
		Specify overall constants with
			#include "mpi.h" in C programs
			      include "mpif.h" in Fortran
		C routines are actually integer functions and always return a status (error) code
		Fortran routines are really subroutines and have returned status code as argument
			Please check on status codes although this is often skipped!
	Standard Constants in MPI
		There a set of predefined constants in include files for each language and these include:
		MPI_SUCCESS -- succesful return code
		MPI_COMM_WORLD (everything) and MPI_COMM_SELF(current process) are predefined reserved communicators in C and Fortran
		Fortran elementary datatypes are:
		MPI_INTEGER, MPI_REAL, MPI_DOUBLE_PRECISION, MPI_COMPLEX, MPI_DOUBLE_COMPLEX, MPI_LOGICAL, MPI_CHARACTER, MPI_BYTE, MPI_PACKED
		C elementary datatypes are:
		MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNSIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED
	The Six Fundamental MPI routines
		call MPI_INIT(mpierr) -- initialize
		call MPI_COMM_RANK (comm,rank,mpierr) -- find processor label (rank) in group
		call MPI_COMM_SIZE(comm,size,mpierr) -- find total number of processors
		call MPI_SEND (sndbuf,count,datatype,dest,tag,comm,mpierr) -- send a message
		call MPI_RECV (recvbuf,count,datatype,source,tag,comm,status,mpierr) -- receive a message
		call MPI_FINALIZE(mpierr) -- End Up
	MPI_Init -- Environment Management
		This MUST be called to set up MPI before any other MPI routines may be called
		For C: int MPI_Init(int *argc, char **argv[])
			argc and argv[] are conventional C main routine arguments
			As usual MPI_Init returns an error handle
		For Fortran: call MPI_INIT(mpierr)
			nonzero (more pedantically values not equal to MPI_SUCCESS) values of mpierr represent errors
	MPI_Comm_rank -- Environment Inquiry
		This allows you to identify each processor by a unique integer called the rank which runs from 0 to N-1 where there are N processors
		If we divide the region 0 to 1 by domain decomposition into N parts, the processor with rank r controls 
			region: r/N to (r+1)/N
		for C:int MPI_Comm_rank(MPI_Comm comm, int *rank)
			comm is an MPI communicator of type MPI_Comm
		for FORTRAN: call MPI_COMM_RANK (comm,rank,mpierr)
	MPI_Comm_size -- Environment Inquiry
		This returns in integer size number of processes in given communicator comm (remember this specifies processor group)
		For C: int MPI_Comm_size(MPI_Comm comm,int *size)
		For Fortran: call MPI_COMM_SIZE (comm,size,mpierr)
			where comm, size, mpierr are integers
			comm is input; size mpierr returned
	MPI_Finalize -- Environment Management
		Before exiting, an MPI application it is courteous to clean up the MPI state and MPI_FINALIZE does this. No MPI routine may be called in a given process after that process has called MPI_FINALIZE
		for C: int MPI_Finalize()
		for Fortran: MPI_FINALIZE(mpierr)
			mpierr is an integer
	Hello World in C plus MPI
		#include <stdio.h>
		#include <mpi.h>
		void main(int argc,char *argv[])
		{
			int ierror, rank, size
			MPI_Init(&argc, &argv); # Initialize
			MPI_Comm_rank(MPI_COMM_WORLD, &rank); # Find Processor Number
			if( rank == 0)
				printf ("hello World!\n");
			ierror=MPI_Comm_size(MPI_COMM_WORLD, &size);
			if( ierror  != MPI_SUCCESS ) # Find Total number of processors above
				MPI_Abort(MPI_COMM_WORLD,ierror); # Abort
			printf("I am processor %d out of total of %d\n", rank, size);
			MPI_Finalize(); # Finalize
		}
	Comments on Parallel Input/Output - I
		Parallel I/O has technical issues -- how best to optimize access to a file whose contents may be stored on N different disks which can deliver data in parallel and
		Semantic issues -- what does printf in C (and PRINT in Fortran) mean?
		The meaning of printf/PRINT is both undefined and changing
			In my old Caltech days, printf on a node of a parallel machine was a modification of UNIX which automatically transferred data from nodes to "host" and produced a single stream
			In those days, full UNIX did not run on every node of machine
			We introduced new UNIX I/O modes (singular and multiple) to define meaning of parallel I/O and I thought this was a great idea but it didn't catch on!!
	Comments on Parallel Input/Output - II
		Today, memory costs have declined and ALL mainstream MIMD distributed memory machines whether clusters of workstations or integrated systems such as T3D/ Paragon/ SP-2 have enough memory on each node to run UNIX 
		Thus printf today means typically that the node on which it runs will stick it out on "standard output" file for that node
			However this is implementation dependent
			e.g. If you want a stream of output with information in order
				Say that from node 0, then node 1, then node 2 etc.
				This was default on old Caltech machines but
			In general you need to communicate information from nodes 1 to N-1 to node 0 and let node 0 sort it and output in required order
		New MPI-IO initiative will link I/O to MPI in a standard fashion
	Review of Message Passing Paradigm
		Data is propagated between processors via messages which can be divided into packets but at MPI level we only see logically single complete messages
		The building block is Point to Point Communication with one processor sending information and one other receiving it
		Collective communication involves more than one message
			Broadcast of one processor to group of other processors (call a multicast on Internet when restricted to group)
			Synchronization or barrier
			Exchange of Information between processors
			Reduction Operation such as a Global Sum
		Collective Communication can ALWAYS be implemented in terms of elementary point to point communications but is provided
			for user convenience and
			often the best algorithm is hardware dependent and "official" MPI collective routines ought to be faster than portable implementation in terms of point to point primitive routines.
	Basic Point to Point Message Passing I
		Communication is between two processors and receiving process must expect a message although can be uncertain asto message type and sending process
		Information required to specify a message includes
			Identification of sender process(or)
			Identification of destination process
			Type of data -- Integers, Floats, Characaters -- note floating point numbers may need conversion if sending between machines of different type
			Number of data elements to send and in destination process, maximum size of message that can be received
			Where the data lives in sending process -- typically a pointer
			Where the received data should be stored into -- again a pointer
	Basic Point to Point Message Passing II
		Two types of communication operations applicable to send and receive
			Blocking -- sender and receiver wait until communication is complete
			Non-blocking -- send and receive handled concurrently with computation -- MPI send and receive routines return immediately before message processed so processor can do other things
		In addition four types of send operation
			Standard -- A send may be initiated even if matching receive has not been initiated
			Synchronous -- Waits for recipent to complete and so message delivery guaranteed
			Buffered -- Copies message into a buffer (if necessary) on destination processor so that completion of send independent of matching receive 
			Ready -- A send is only started when matching receive  initiated
	Blocking Send MPI_Send(C) MPI_SEND(Fortran)
		call MPI_SEND (
			IN message             start address of data to send
			IN message_len     number of items (length in bytes
			                                determined by type)
			IN datatype             type of each data element
			IN dest_rank           Processor number (rank) of destination
			IN message_tag     tag of message to allow receiver to filter
			IN communicator    Communicator of both sender and receiver group
			OUT error_message)       Error Flag (absent in C)
		Fortran example:
			integer count, datatype, dest, tag, comm, mpierr
			comm= MPI_COMM_WORLD
			tag=0
			datatype=MPI_REAL
		call MPI_SEND (sndbuf,count,datatype,dest,tag,comm,mpierr)
	Blocking Receive MPI_Recv(C) MPI_RECV(Fortran)
		call MPI_RECV(
			IN    start_of_buffer      Address of place to store data(address is INput
			                  values of data are of course output starting at this adddress!)
			IN    buffer_len              Maximum number of items allowed
			IN    datatype                Type of each data type
			IN    source_rank          Processor number (rank) of source
			IN    tag                          only accept messages with this tag value
			IN   communicator       Communicator of both sender and receiver group
			OUT return_status       Data structure describing what happened!
			OUT error_message)       Error Flag (absent in C)
		Note that return_status is used after completion of receive to find actual received length (buffer_len is a MAXIMUM), actual source processor rank and actual message tag
		In C syntax is
		int error_message=MPI_Recv(void *start_of_buffer,int buffer_len, MPI_DATATYPE datatype, int source_rank, int tag, MPI_Comm communicator, MPI_Status *return_status)
	Fortran example:Blocking Receive MPI_RECV
		integer status(MPI_STATUS_SIZE)  An array to store status
		integer mpierr, count, datatype, source, tag, comm
		integer recvbuf(100)
		count=100
		datatype=MPI_REAL
		comm=MPI_COMM_WORLD
		source=MPI_ANY_SOURCE   accept any source processor
		tag=MPI_ANY_TAG             accept anmy message tag
		call MPI_RECV (recvbuf,count,datatype,source,tag,comm,status,mpierr)
		
		Note source  and tag can be wild-carded
	Hello World:C Example of Send and Receive
		#include "mpi.h"
		main( int argc, char **argv )
		{
			char message[20];
			int i, rank, size, tag=137; # Any value of type allowed
			MPI_Status status;

			MPI_Init (&argc, &argv);
			MPI_Comm_size(MPI_COMM_WORLD, &size); # Number of Processors
			MPI_Comm_rank(MPI_COMM_WORLD, &rank); # Who is this processor
			if( rank == 0 )  {  # We are on "root" -- Processor 0
				strcpy(message,"Hello MPI World"); # Generate message
				for(i=1; i<size; i++)  # Send message to size-1 other processors
				MPI_Send(message, strlen(message)+1, MPI_CHAR, i, tag, MPI_COMM_WORLD);
			}
			else  
				MPI_Recv(message,20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
			printf("This is a message from node %d saying %s\n", rank, message);
			MPI_Finalize();
		}
	Interpretation of Returned Message Status
		In C status is a structure of type MPI_Status
			status.source gives actual source process
			status.tag gives the actual message tag
		In Fortran the status is an integer array and different elements give:
			in status(MPI_SOURCE) the actual source process 
			in status(MPI_TAG) the actual message tag
		In C and Fortran, the number of elements (called count) in the message can be found from call to
		MPI_GET_COUNT (IN status, IN datatype, 
		OUT count, OUT error_message)
			where as usual in C last argument is missing as returned in function call
	Naming Conventions for Send and Receive
		SEND	              Blocking          Nonblocking
		Standard            MPI_Send        MPI_Isend
		Ready	              MPI_Rsend      MPI_Irsend
		Synchronous     MPI_Ssend      MPI_Issend
		Buffered             MPI_Bsend      MPI_Ibsend

		RECEIVE            Blocking          Nonblocking
		Standard            MPI_Recv         MPI_Irecv

		Any type of receive routine routine can be used to receive messages from any type of send routine
	Collective Communication
		MPI_BARRIER(comm)  Global Synchronization within a given communicator
		MPI_BCAST Global Broadcast
		MPI_GATHER Concatenate data from all processors in a communicator into one process
			MPI_ALLGATHER puts result of concatenation in all processors
		MPI_SCATTER takes data from one processor and scatters over all processors
		MPI_ALLTOALL sends data from all processes to all other processes
		MPI_SENDRECV exchanges data between two processors -- often used to implement "shifts"
			viewed as pure point to point by some
	Hello World:C Example of Broadcast
		#include "mpi.h"
		main( int argc, char **argv )
		{
			char message[20];
			int rank;

			MPI_Init (&argc, &argv);
			MPI_Comm_rank(MPI_COMM_WORLD, &rank); # Who is this processor
			if( rank == 0 )    # We are on "root" -- Processor 0
				strcpy(message,"Hello MPI World"); # Generate message
			# MPI_Bcast sends from root=0 and receives on all other processors
			MPI_Bcast(message,20, MPI_CHAR, 0, MPI_COMM_WORLD);
			printf("This is a message from node %d saying %s\n", rank,
			                                                                 message);
			MPI_Finalize();
		}
	Collective Computation
		One can often perform computing during a collective communication
		MPI_REDUCE performs reduction operation of type chosen from
			maximum(value or value and location), minimum(value or value and location), sum, product, logical and/or/xor, bit-wise and/or/xor
			e.g. operation labelled MPI_MAX stores in location result of processor rank the global maximum of original in each processor as in
			call MPI_REDUCE(original,result,1,MPI_REAL,MPI_MAX,rank,comm,ierror)
			One can also supply one's own reduction function
		MPI_ALLREDUCE is as MPI_REDUCE but stores result in all -- not just one -- processors
		MPI_SCAN performs reductions with result for processor r depending on data in processors 0 to r
	Examples of Collective Communication/Computation
		Four Processors where each has a send buffer of size 2
		  0               1              2               3      Processors
		(2,4)         (5,7)         (0,3)           (6,2) Initial Send Buffers

		MPI_BCAST with root=2
		(0,3)         (0,3)         (0,3)           (0,3) Resultant  Buffers

		MPI_REDUCE with action MPI_MIN and root=0 
		(0,2)         (_,_)         (_,_)           (_,_) Resultant  Buffers

		MPI_ALLREDUCE with action MPI_MIN and root=0 
		(0,2)         (0,2)         (0,2)           (0,2) Resultant  Buffers

		MPI_REDUCE with action MPI_SUM and root=1
		(_,_)       (13,16)       (_,_)           (_,_) Resultant  Buffers
	More Examples of Collective Communication/Computation 
		Four Processors where each has a send buffer of size 2
		  0               1              2               3      Processors
		(2,4)         (5,7)         (0,3)           (6,2) Initial Send Buffers

		MPI_SENDRECV with 0,1 and 2,3 paired
		(5,7)         (2,4)         (6,2)           (0,3) Resultant Buffers

		MPI_GATHER with root=0
		(2,4,5,7,0,3,6,2) (_,_) (_,_)         (_,_) Resultant  Buffers

		Four Processors where only rank=0 has send buffer
		(2,4,5,7,0,3,6,2) (_,_) (_,_)         (_,_) Initial send Buffers

		MPI_SCATTER with root=0
		(2,4)         (5,7)         (0,3)           (6,2) Resultant Buffers
	Examples of MPI_ALLTOALL
		All to All Communication with i'th location in j'th processor being sent to j'th location in i'th processor

		Processor    0                   1                      2                     3
		Start (a0,a1,a2,a3)  (b0,b1,b2,b3)  (c0,c1,c2,c3)  (d0,d1,d2,d3)
		After (a0,b0,c0,d0)  (a1,b1,c1,d1)  (a2,b2,c2,d2)  (a3,b3,c3,d3

		There are extensions MPI_ALLTOALLV to handle case where data stored in noncontiguous fashion in each processor and when each processor sends different amounts of data to other processors
		Many MPI routines have such "vector" extensions
	Motivation for Derived Datatypes in MPI
		These are an elegant solution to a problem we struggled with a lot in the early days -- all message passing is naturally built on buffers holding contiguous data
		However often (usually) the data is not stored contiguously. One can address this with a set of small MPI_SEND commands but we want messages to be as big as possible as latency is so high
		One can copy all the data elements into a single buffer and transmit this but this is tedious for the user and not very efficient
		It has extra memory to memory copies which are often quite slow
		So derived datatypes can be used to set up arbitary memory templates with variable offsets and primitive datatypes. Derived datatypes can then be used in "ordinary" MPI calls in place of primitive datatypes MPI_REAL MPI_FLOAT etc.
	Derived Datatype Basics
		Derived Datatypes should be declared integer in Fortran and MPI_Datatype in C
		Generally have form { (type0,disp0), (type1,disp1) ... (type(n-1),disp(n-1)) } with list of primitive data types typei and displacements (from start of buffer) dispi
		call MPI_TYPE_CONTIGUOUS (count, oldtype, newtype, ierr)
			creates a new datatype newtype made up of count repetitions of old datatype oldtype
		one must use call MPI_TYPE_COMMIT(derivedtype,ierr)
		before one can use the type derivedtype in a communication call 
		call MPI_TYPE_FREE(derivedtype,ierr) frees up space used by this derived type
	Simple Example of Derived Datatype
		integer derivedtype,  ...
		call MPI_TYPE_CONTIGUOUS(10, MPI_REAL, derivedtype, ierr)
		call MPI_TYPE_COMMIT(derivedtype, ierr)
		call MPI_SEND(data, 1, derivedtype, dest,tag, MPI_COMM_WORLD, ierr)
		call MPI_TYPE_FREE(derivedtype, ierr)

		is equivalent to simpler single call
		call MPI_SEND(data, 10, MPI_REAL, dest, tag, MPI_COMM_WORLD, ierr)
		and each sends 10 contiguous real values at location data to process dest
	More Complex Datatypes MPI_TYPE_VECTOR/INDEXED
		MPI_TYPE_VECTOR (count,blocklen,stride,oldtype,newtype,ierr)
			IN count        Number of blocks to be added
			IN blocklen   Number of elements in block
			IN stride        Number of elements (NOT bytes) between start of each block
			IN oldtype     Datatype of each element
			OUT newtype Handle(pointer) for new derived type
		MPI_TYPE_INDEXED (count,blocklens,indices,oldtype,newtype,ierr)
			IN count        Number of blocks to be added
			IN blocklens  Number of elements in each block -- an array of length count
			IN indices     Displacements (an array of length count) for each block
			IN oldtype     Datatype of each element
			OUT newtype Handle(pointer) for new derived type
	Use of Derived Types in Jacobi Iteration with Guard Rings--I
		Assume each processor stores NLOC by NLOC set of grid points in an array PHI dimensioned PHI(NLOC2,NLOC2) with NLOC=NLOC+2 to establish guard rings
	Use of Derived Types in Jacobi Iteration with Guard Rings--II
		integer North,South,East,West 
		# These are the processor ranks of 4 nearest neighbors
		integer rowtype,coltype  # the new derived types
		# Fortran stores elements in columns contiguously
		#  (C has opposite convention!)
		call MPI_TYPE_CONTIGUOUS(NLOC, MPI_REAL, coltype, ierr)
		call MPI_TYPE_COMMIT(coltype,ierr)

		# rows (North and South) are not contiguous
		call MPI_TYPE_VECTOR(NLOC, 1, NLOC2, MPI_REAL, rowtype, ierr)
		call MPI_TYPE_COMMIT(rowtype,ierr)

		call MPI_SEND(array(2,2), 1, coltype, west,0,comm,ierr)
		call MPI_SEND(array(2,NLOC+1), 1, coltype, east,0,comm,ierr)
		call MPI_SEND(array(2,2), rowtype, north, 0,comm,ierr)
		call MPI_SEND(array(NLOC+1,2), 1, rowtype, south, 0,comm,ierr)
	
	Other Useful Concepts in MPI
		More general versions of MPI_?SEND and associated inquiry routines to see if messages have arrived. Use of these allows you to overlap communication and computation. In general this is not used even though more efficient
			Also use in more general asynchronous applications -- blocking routines are most natural in loosely syncronous communicate-compute cycles
		Application Topology routines allow to find rank of nearest neighbor processors as North,South,East,West in Jacobi iteration
		Packing and Unpacking of data to make single buffers -- derived datatypes are usually a more elegant approach to this 
		Communicators to set up subgroups of processors (remember matrix example) and to set up independent MPI universes as needed to build libraries so that messages generated by library do not interfere with those from other libraries or user code
			Historically (in my work) WOULD have been useful to distinguish debugging and application messages