New CPS615 Foils-- D 23 September 95
	CPS615 -- Base Course for the Simulation Track of Computational Science
	Fall Semester 1995 --
	Lecture Stream 3 --
	The Laplace Example
	Programming and Performance
		Geoffrey Fox
		NPAC
		Room 3-131 CST
		111 College Place
		Syracuse NY 13244-4100
	Abstract of Laplace Example for CPS615
		This takes Jacobi Iteration for Laplace's Equation in a 2D square and uses this to illustrate:
		Programming in both Data Parallel (HPF) and Message Passing (MPI and a simplified Syntax)
		SPMD -- Single Program Multiple Data -- Programming Model
		Stencil dependence of Parallel Program and use of Guard Rings
		Collective Communication
		Basic Speed Up,Efficiency and Performance Analysis with edge over area dependence and consideration of load imbalance and communication overhead effects.
	Parallel Computing 
	Algorithms and 
	Software --
	Laplace Example
		 
	The Solution of Laplace's Equation
		Solve Laplace's Equation:
		
		

		on a rectangular domain, with specified boundary conditions.
		Use simple iterative Gauss-Seidel algorithm:

		f new = ( f  left + f right + f up + f  down ) / 4
	Discretized Form of Laplace'e Equation on a Parallel Processor
		256 Grid Points
		16-Node Concurrent Processor
		16 Grid Points in each Processor
		
		  f  is unknown
		  f  is known
		
		     Update Stencil
	 Basic Structure of Domain to be Updated in Parallel Version
		X  denotes f   values to be communicated
	Sequential and Introduction to Parallel Coding for the
	Laplace Example
		 
	SEQUENTIAL LAPLACE PROGRAMMING 
	JACOBI ITERATION IN ONE DIMENSION 
	(constant in y direction)
				 DIMENSION PHINEW(NTOT), PHIOLD(NTOT)
				 NTOT1 = NTOT-1
		BEGIN	TEST=0
				 DO 1 I=2, NTOT1
				 TEMP = 0.5 * (PHIOLD(I-1) + PHIOLD(I+1))
				 TEST = AMAX1 (TEST, ABS (TEMP-PHIOLD(I)))
		1		 PHINEW (I) = TEMP
				 DO 2 I=2, NTOT1
		2		 PHIOLD (I) = PHINEW(I)
				 IF (TEST > CONVG) GO TO BEGIN
				 ELSE STOP
	SEQUENTIAL LAPLACE PROGRAMMING 
	JACOBI ITERATION IN TWO DIMENSIONS
		
					DIMENSION PHINEW (NTOT, NTOT), PHIOLD (NTOT, NOT)
					NTOT1 = NTOT-1
		
		BEGIN		TEST = 0
					DO 1 I=2, NTOT1
					DO 1 J=2, NTOT1
					TEMP=0.25 * (PHIOLD(I, J + 1) + PHIOLD (I, J-1) + PHIOLD (I+1,J) + 
		                                                                                                           PHIOLD (I-1,J))
					TEST = AMAX1(TEST,ABS(TEMP-PHIOLD(I,J)))
			1		PHINEW (I,J) = TEMP
					DO 2 I=2, NTOT1
					DO 2 J=2, NTOT1
			2		PHIOLD (I,J) = PHINEW (I,J)
		
					IF (TEST>CONVG) GO TO BEGIN
					ELSE STOP
	 Approaches to Parallel Programming 
		Data Parallel typified by CMFortran and its generalization - High Performance Fortran  which we have discussed separately
		Typical Data Parallel Fortran Statements are full array statements
			B=A1 + A2
			B=EOSHIFT(A,-1)                                                                              
			Function operations on arrays representing full data domain	
		Message Passing typified by later discussion of Laplace Example which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature
		Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel"
		Data parallel software is translated by a compiler into "machine language" which is typically message passing
	SPMD or SCMD
	Single Program (code) Multiple Data
		A particular way of using MIMD machines - DOMINANT successful use so far.
		Each processor runs same program but in general executes different instructions at different times.
		Will later see corresponds to "loosely synchronous problems".
		Style of current program example -- note although each doing roughly same thing --  i.e. updating grid points -- each node is NOT at same point in update at each clock cycle
	Data Parallel
	Programming for
	Laplace Example
		 
	Parallel Laplace Programming 
	Data Parallel for Jacobi Iteration in One Dimension 
		!HPF$	TEMPLATE WORLD(NTOT)                                                                           (1)
		!HPF$	DISTRIBUTE WORLD(BLOCK)                                                                      (2)
		
				DIMENSION PHINEW (NTOT), PHIOLD(NTOT)
		!HPF$	ALIGN PHINEW WITH WORLD                                                                      (3)
		!HPF$	ALIGN PHIOLD WITH WORLD                                                                       (3)
				NTOT1 = NTOT-1
				NTOT2 = NTOT-2
		BEGIN	PHINEW (2:NTOT1) =0.5* (EOSHIFT (PHIOLD,1) + EOSHIFT (PHIOLD, -1)) (4)
				TEST = MAXVAL (ABS(PHINEW(2:NTOT1)-PHIOLD(2:NTOT1)))                  (5)
				PHIOLD=PHINEW
				IF (TEST>CONVG) GO TO BEGIN
				ELSE STOP
	Notes on HPF Implementation of Lapace Solver
		(1) DefInes a data world of NTOT entries
		(2) Breaks up world into equal parts in each processor decomposed 	in one dimension
		(3) Instructs that PHINEW AND PHIOLD are aligned exactly with world 	i.e. PHINEW(I) and PHIOLD(I) are stored in the I'th position of world
		(4) EOSHIFT is "End-Off" SHIFT. The calls shift PHIOLD by one both to left and  right. The indices take care of WORLD (1 and NTOT) being boundary values
		
			EOSHIFT (and corresponding circular shift (CSHIFT) are "standard" 	Fortran90 array manipulation routines. They are not specific to parallel 	computing. We can write this statement (4) as a direct array operation.
		
		BEGIN 	PHINEW(2:NTOT1) = 0.5 * (PHIOLD (1:NTOT2) + PHIOLD (3:NTOT))

		(5) Forms TEST as maximum change in absolute value of j at any world 	location. MAXVAL is a Fortran90 intrinsic function.	
		NOTE: Any subscript n:m is an "array section" or set of indices
	Message Passing Model
	Used for Parallel
	Programming for
	Laplace Example
		 
	Basic Message Passing Approach
		Basic operations are SEND and RECEIVE
		But there are a lot of subtleties
		How does program 2 know what message passing it is getting ?	Does it come from program 1 or another one ?
		Does program 1 get an acknowledgment (analogous to certified mail) ?
		Do we use system or application level implementations ? -- raises issues of protection, copying
		Broadcasts, Multicasts -- other collective or multiple message/source/destination communications
		Do we have multiple processes or threads per processor ? -- How do we address?
	Message Passing Overview
		 
	Parallel Laplace Programming: Set Up of 
	Message Passing for Jacobi Iteration in One Dimension
		Processor 1
		("typical") 
	Node Program: Message Passing for Laplace Sover
		                            DIMENSION        PHINEW(NLOC+2), PHIOLD(NLOC+2)
		Initialization 	  NLOC=NTOT/NPROC      (Assume NTOT divisible by NPROC)      			
					  I1=2 
		                            NLOC1=NLOC + 1
		                            IF (PROCNUM.EQ.NPROC-1   .or.   PROCNUM.EQ.0)  NLOC1=NLOC1-1
		BASIC LOOP
		BEGIN		 TEST = 0
		Shift to the right			 SOURCE = PHIOLD(NLOC1)                                          address of data to be sent			 IF(PROCNUM.EQ.NPROC-1) SOURCE = "DUMMY"			 DEST = PHIOLD(1)                                                          address where data to be stored			 IF(PROCNUM.EQ.1)DEST = "DUMMY"			 CALL MPSHIFT(+1, SOURCE, DEST)
		Shift to left			          SOURCE = FOLD(I1)			  IF(PROCNUM.EQ.0) SOURCE = "DUMMY"                  address of data to be sent			  DEST = PHIOLD(NLOC1+1)			  IF(PROCNUM.EQ.NPROC-1) DEST = "DUMMY"          address where data to be stored 			  CALL MPSHIFT(-1, SOURCE, DEST)
		 
					  DO 1 I=I1,NLOC1			  TEMP = 0.5 * (PHIOLD(I-1) + PHIOLD(I+1))			  TEST = AMAX1(TEST,ABS(TEMP-PHIOLD(I)))	1		  PHINEW (I) = TEMP			  DO 2 I=I1, NLOC1	2		  PHIOLD(I) = PHINEW(I)
					  CALL GLOBALMAX (TEST)			  IF (TEST>CONVG) GO TO BEGIN			  ELSE STOP
	Collective Communication Primitives
		
		The example uses three idealized (not present in real message passing system) primitives
		Message Passing Shift to right   MPSHIFT (+1, SOURCE, DEST)
		Sends 1 word in location SOURCE to processor on the right 
		Receives word in location DEST from the processor on the left
		SOURCE and DEST are locations -- if set to "DUMMY", then no information is to be sent or received
		
		Message Passing Shift to left    MPSHIFT (-1, SOURCE, DEST)
		Sends 1 word in SOURCE to processor on the left 
		Receives word in DEST from processor on the right
		
		GLOBALMAX (TEST)
		takes TEST from all processors 
		forms TESTMAX = maximum value of TEST over all  processors
		replaces TEST by TESTMAX in all  processors
	Implementation of MPSHIFT(+1, SOURCE,DEST)
		Consider for example, shift to right
		
		
		

		Then 	Processors 0 .. Nproc-2        Send to right
			      Processors 1 .. Nproc-1       Receive from left

		We can't necessarily send all messages and then receive all messages. "Standard" send can be "blocking" i.e. will not return unless receive completed by destination processor. In this case, we can deadlock as all "hang" after send, waiting for receive
		So we are more careful
	Possible Implementation of MPSHIFT in MPI
		Let PROCNUM be processor "rank" (number or order in 
		one dimensional decomposition)
				IPAR = PROCNUM/2
			IF (IPAR.e.q.0)
				THEN                             ! even processors
				IF (SOURCE.NE."DUMMY")
				CALL MPI_SEND (SOURCE,1,MPI_REAL,PROCNUM+1,tag,comm)
				IF (DEST.NE"DUMMY")
				CALL MPI_RECV (DEST,1,MPI_REAL, PROCNUM-1,tag,comm,status)
				ELSE                             ! odd processors
				IF (DEST.NE."DUMMY")
				CALL MPI_RECV (DEST,1,MPI_REAL, PROCNUM-1,tag,comm)
				IF (SOURCE.NE."DUMMY")
				CALL MPI_SEND (SOURCE,1,MPI_REAL,PROCNUM+1,tag,comm,status)
		Note: MPI uses reserved word MPI_PROCNULL rather than "DUMMY". Also, we could remove setting of "DUMMY" in calling routine and placed in MPSHIFT as test on PROCNUM
	Implementation of SHIFT in MPI
		We can implement MPSHIFT directly in MPI as        				CALL MPI_SENDRECV(SOURCE,1,MPI_REAL, PROCNUM+1, sendtag, DEST,1,MPI_REAL, PROCNUM-1, recvtag,comm,status)			
		Notes:
		MPI_REAL denotes that variable is real
		"sendtag/recvtag" are for this purpose, a largely irrelevant additional message tag
		"comm" is extra system message tag defining "scope" -- i.e. the set of processors involved -- here it is all of them
		"status" tells you about received data. You needn't look at it if you trust your code and hardware
	Implementation of GLOBALMAX (TEST)
		In MPI, this is a single call
		CALL MPI_ALLREDUCE (TEST,TEST,1,MPI_REAL,MPI_MAX,comm)
		Flag MPI_MAX specifies global maximum
		The implementation is quite subtle and usually involves a logarithmic tree																													
		There is a clever extension to get maximum in all processors in "same" time as that on one processor on hypercube and other architectures
	General Features
	of
	Laplace Example
		 
	What does the Laplace Update calculation look like?
		In the sequential code, a single processor updates in turn (16x16=256-60=196) internal points. Each update is     				j --> 0.25 *(j up + jdown + jleft +  jright  )    			                                                                                                involves 4 floating point operations for each point
		In the parallel case, each processor updates in turn the points for which it is responsible - this is "owner computes rule"   
		A corner processor updates                                                                          nine points
		 (the small internal points)	
				
		A "general" central processor 
		updates sixteen points
	The Various Stencil Special Cases
		There are several different cases illustrated by the points A,B,C in case of corner processor																		
		In each case we superimpose stencil on top of point to be updated.
		
		
		

		The large circle must be placed on top of each point "owned" by the processor. The cases A,B,C differ because the points used have different characteristics
	Communication Loads
		For the corner processor we need to communicate a total of 6 points (marked with an X)																
		For a general processor, we need to communicate a total of 16 points (marked with an X)
	What is the relation of sequential and parallel programming models ?
		In both cases, one is "solving" the "same" problem			j --> .25 (jleft +jright + jup + jdown)  
		 over a set of points
		In sequential case, problem is "solved" over the full domain
		In parallel case, problem is "solved" over part of domain (roughly one sixteenth of it) i.e. each node is solving same problem as sequential case but over a different - smaller geometry
		In sequential case, boundary conditions are "just" fixed values on the edge
		In the parallel case, boundary conditions are in addition, communicated values
	More on SPMD Programming Model and Sequential/Parallel Comparison
		Parallel and Sequential cases differ in
			geometry
			boundaries                                                                                        
			We also saw this in "society" analogy with Hadrian's wall
		This analysis implies SPMD - Single Programming Multiple Data - programming model	i.e. each nodes executes the same program but runs through different data
		Also note that not only does each node of the parallel machine run the same program but also this program is very similar to sequential program as long as one isolates boundary and geometry sections
		Note that this problem can be solved on SIMD or MIMD computer. Thus SPMD example is not totally general. There are SPMD examples which require MIMD machines
	Programming with Guard Rings - Sequential
		In sequential case, one could dimension array PHI(F) to PHI(14,14) to hold updated points only. However then points on the edge would need special treatment so that one uses boundary values in update
		Rather dimension PHI(16,16) to include internal and boundary points
			Run loops over x(I) and y(J) from 2 to 15 to only cover internal points
			Preload boundary values in PHI(1, . ), PHI( . , 16),PHI( . ,1), 
			PHI( . ,16)
			This is easier and faster as no conditionals (IF statements) in inner loops
	Programming with Guard Rings - Parallel 
		Similarily we will not dimension PHI(4,4) but rather PHI(6,6) to hold communicated values																								
		Now we preload PHI(1, . ), PHI ( . , 1), PHI(6, . ), PHI( . , 6) by communication and then basic "parallel" program update is identical to sequential case and one loops over x(I), y(J) with I and J in range 2 to 5
		As "load imbalance" - not all processors have the same number of internal points, some differences are seen for "edge" processors with only 9 or 12 internal points
	Special Case of Corner Processor
		For example, in the top right corner processor one needs to only run  I, J from 2 to 4
	Analysis of Parallel Overheads
	Efficiency and
	Speedup
		 
	General Formalism for Speed Up
		We look at speed up for this simple case
		Suppose you run a problem on one processor - the sequential version 
		      Let execution time be T1
		Now run the "same" problem (definition of "same" can be tricky but it is not subtle here) on a parallel machine with P nodes where C.P.U. same as sequential machine 	Let execution time be TP
		Then speed up is 				S=T1/TP 		
		We would like S to be P 	What is it in our case and why ?
	What Affects Speed Up ?
		The speed up S is reduced from P for two reasons
			Load Imbalance
			Communication Overhead
		In some cases, one also has effects from parallel method being different from sequential method but that is not relevant here
	Load Imbalance and Speed-Up for Laplace Example -- I
		We have processors with different amounts of work - as some update 16, some update 12 and some 9 internal points
		Each update takes time 			tcalc = 4 tfloat 		                                                                                           where tfloat is time taken to do floating point addition or multiplication
	Load Imbalance and Speed-Up for Laplace Example -- II
		Processors doing 9 or 12 updates must wait for those doing 16 to finish
		T1 = 196 tcalc                                                                                                           T16 = 16 tcalc                         			                                                                                                          as all that counts is maximum amount of work done by any one processor
		Speed up degraded by load imbalance is:                                          				Sload = 196/16 = 12.25 			                                                                                                       This is speed up ignoring degradation due to communications
	Analytical Analysis of Load Imbalance
		 
	Communication Overhead
		Suppose communicating a single word - here a j value probably stored in a 4 byte word - take time tcomm
	Analytical Form of Speed Up for Communication Overhead
		 
	General Form of Efficiency
		 
	Communication to Calculation Ratio as a function of template 
		 
	Performance for Increasing Stencil
		The previous foil showed that increasing stencil made slight improvements!
		This foil shows that larger stencils have much lower overheads (and hence better parallel performance) than simple Laplace's equation with 5 point stencil
	Matrix Multiplication on the Hypercube
		Showing linear overhead behavior for fc 
	Efficiency of QCD Physics Simulation on JPL MarkIIIfp Hypercube
		 
	General Analysis of Overheads and Efficiency
		 
	Speed Up as a Function of Grain Size