Designing Parallel Algorithms
	Partition
		Divide problem into tasks to expose potential parallelism
	Communicate
		Determine amount and pattern of data to share between tasks
	Agglomerate
		Combine tasks to increase granularity and reduce communication
	Map
		Assign agglomerated tasks to physical processors
Designing Parallel Programs in HPF
	Partition
		Divide problem into tasks
		In HPF, look for array update operators
			Elemental computations, linear algebra
	Communicate
		Determine amount and pattern of communication
		In HPF, consider which elements are combined
			Topological neighbors, data reductions
	Agglomerate
		Combine tasks
		In HPF, choose the data mapping
			BLOCK for index locality, CYCLIC for load balance
	Map
		Assign agglomerated tasks to physical processors
		In HPF, trust the compiler
Jacobi Iteration
Jacobi Iteration: The Algorithm
	The Problem
		Given a partial differential equation & boundary conditions
		Find the solution
	The Approach
		Divide (continuous) space into a (discrete) grid
		Guess a solution on the grid
		Update the solution at every grid point
		Repeat update until solution doesn't change
Jacobi Iteration: 
Equations and Pictures

Jacobi Iteration: Partitioning
	At the most abstract level, Jacobi iteration is a sequential process.
	However, each step in the process is itself composed of many smaller operations.
	Conclusion: Jacobi iteration has plenty of data parallelism
Jacobi Iteration: Communication
	Element updates:
		Each requires 4 values from previous update step
		Static, local communication
			Generally, the second-fastest kind (after no communication)
	Convergence test:
		Uses all values from latest update step
		Static, global communication
			Reduction operation
			Efficient methods known, encapsulated in libraries
Jacobi Iteration: Agglomeration
	Element updates require nearest neighbors
		CYCLIC Þ communicate entire array
		BLOCK Þ least communication volume
		(BLOCK,*) Þ move u(i-1,j), u(i+1,j) " j
		(BLOCK,BLOCK) Þ move u(ILOW-1,j), u(IHIGH+1,j) " j, u(i,JLOW-1), u(i,JHIGH+1) " i
	Convergence test requires a whole-array reduction
		Any distribution Þ static, structured communication
	The bottom line
		(BLOCK,*) on high-latency machines or small problem sizes
		(BLOCK,BLOCK) on low-latency machines
Jacobi Iteration: HPF Program
	REAL u(0:nx,0:ny), unew(0:nx,0:ny), f(0:nx,0:ny)
	!HPF$ DISTRIBUTE u(BLOCK,*)
	!HPF$ ALIGN (:,:) WITH u(:,:) :: unew, f
	dx = 1.0/nx; dy = 1.0/ny; err = tol * 1e6
	FORALL ( i=0:nx, j=0:ny )
		f(i,j) = -2*(dx*i)**2+2*dx*i-2*(dy*j)**2+2*dy*j
	END FORALL
	u = 0.0; unew = 0.0
	DO WHILE (err > tol)
		FORALL ( i=1:nx-1, j=1:ny-1 ) &
			unew(i,j) = (u(i-1,j)+u(i+1,j)+u(i,j-1)+ &
			u(i,j+1)+f(i,j))/4
		err = MAXVAL( ABS(unew-u) )
		u = unew
	END DO
Jacobi Iteration: Mapping
	This program is a piece of cake for the compiler.
		Allocate portion of array on each processor based on DISTRIBUTE
		Apply owner-computes rule analytically based on left-hand side
		Detect shift communication from dependence analysis of subscripts or pattern matching
		Recognize MAXVAL intrinsic as reduction communication
		Place all communication directly outside parallel construct where it occurs
		Number processors so that shifts do not cause contention
Gaussian Elimination
Gaussian Elimination: The Algorithm
	The Problem
		Given N linear equations in N unknowns xi
		Find values of all xi to satisfy the equations
	The Approach
		Use Eq. 1 to eliminate x1 from Eq. 2, 3,  N
		Use Eq. 2 to eliminate x2 from Eq. 3,  N
		
		Eq. N only involves xN Þ Solve it!
		Work backwards to find xN-1, , x1
Gaussian Elimination: Pictures

Gaussian Elimination: Partitioning
	At the most abstract level, Gaussian elimination is a sequential process.
		Need all elements in column k to find pivot row
		Need all elements in column k and pivot row to perform pivoting
	However, each step in the process is itself composed of many smaller operations.
		Perform all element updates independently
	Conclusion: Gaussian elimination has plenty of data parallelism
		Until the last few stages, anyway
Gaussian Elimination: Communication
	Pivot search:
		Reduction along column
		Static, global communication
	Element updates:
		Each requires itself, elements from pivot column & row
		Static, global communication
			Broadcast
Gaussian Elimination: Agglomeration
	Pivot selection requires a 1-D reduction
		Distribute rows Þ parallel, with communication
		Distribute columns Þ sequential, but no communication
	Element updates require the old value and elements from the pivot row and column
		Distribute rows Þ parallel, but broadcast the pivot row
		Distribute columns Þ parallel, but broadcast the pivot column
	Each stage works on a smaller contiguous region of the array
		BLOCK Þ processors drop out of the computation
		CYCLIC Þ work stays (fairly) evenly distributed until the end
	The bottom line
		(*,CYCLIC) if broadcast > pivoting one column
		(CYCLIC,*) if broadcast  < one column, synchronous comm.
		(CYCLIC,CYCLIC) if broadcast < one col., overlapped comm.
Gaussian Elimination: HPF Program
	REAL a(n,n), tmp(n)
	!HPF$ DISTRIBUTE a(CYCLIC,CYCLIC)
	!HPF$ ALIGN tmp(i) WITH a(*,i)
	DO k = 1, n-1
		! Select the pivot
		ipivot = MAXLOC( ABS(a(k:n,k)) ) + k - 1
		! Swap the rows
		tmp(k:n) = a(ipivot,k:n)
		a(ipivot,k:n) = a(k,k:n)
		a(k,k:n) = tmp(k:n)
		! Update the submatrix
		FORALL ( i=k+1:n, j=k+1:n ) &
			& a(i,j) = a(i,j) - a(i,k)/tmp(k)*tmp(j)
	END DO
Gaussian Elimination: Mapping
	This program is harder for the compiler.
		Allocate portion of array on each processor based on DISTRIBUTE
		Apply owner-computes rule analytically based on left-hand side
		Detect communication from dependence analysis & intrinsics
		Here, it really pays to transform the program!
			Reorder computation to always precompute the next pivot column
			Rearrange communication to pipeline the series of updates
			Do broadcasts asynchronously
			Net result: 2¥ speedup
		Use standard numbering for processors
Conjugate Gradient
Conjugate Gradient: The Algorithm
	The Problem
		Given a partial differential equation & boundary conditions
		Find the solution
	The Approach
		Divide (continuous) space into a (discrete) grid
		Guess a solution on the grid
		Estimate how the solution should change
		Move in that direction
		Repeat estimate and move until solution doesn't change
Conjugate Gradient: 
Equations and Pictures

Conjugate Gradient: More Equations
	u = · initial guess "
	r = f ­ A * u
	d = max( |r| )
	i = 0; r = 0
	WHILE (d > e) DO
		i = i + 1; rold = r
		r = r × r
		IF  (i=1)  THEN  p = r   ELSE  p = r + r/rold p
		q = A * p
		a = r / (p × q)
		u = u + a p
		r = r ­ a q
		d = max( |r| )
	END WHILE
Conjugate Gradient: Partitioning
	At the most abstract level, conjugate gradient has minimal parallelism
		u, r, a can be updated independently
	The real potential parallelism is in the matrix and vector operations, however.
		r × r  is a reduction of size N
		u + a p is a vector update of size N
		A * p is a (sparse) matrix-vector multiply, in this case of size O(N)
			It looks a lot like the operator in Jacobi
	Conclusions:
		Stick to the matrix/vector operators
		Task parallelism (and pipelining) may improve some
Conjugate Gradient: Communication
	Convergence test:
		Global reduction
	Dot products:
		Global reductions
	Vector updates:
		Elementwise scaling and addition of two vectors
		No communication if vectors are aligned
	Matrix-vector multiplies:
		Depends on the matrix (PDE operator) in the problem
		For the model problem, equivalent to Jacobi iteration grid update
		Static, local communication
Conjugate Gradient: Agglomeration
	Dot products and convergence test always require global communication
		No reason to pick one DISTRIBUTE over another
	Vector updates require no communication
		Really no reason to choose a particular DISTRIBUTE
	Matrix-vector multiply does care where its data come from
		In this case, same advantages/disadvantages as Jacobi iteration
	The bottom line
		(BLOCK,*) on high-latency machines or small problem sizes
		(BLOCK,BLOCK) on low-latency machines
Conjugate Gradient: HPF Program
	REAL u(0:n,0:n), r(0:n,0:n), p(0:n,0:n)
	REAL q(0:n,0:n), f(0:n,0:n)
	!HPF$ DISTRIBUTE u(BLOCK,*)
	!HPF$ ALIGN (:,:) WITH u(:,:) :: r, p, q, f
	INTERFACE
		SUBROUTINE a_times_vector( x, y )
		REAL, INTENT(IN) :: x(:,:)
		REAL, INTENT(OUT) :: y(:,:)
		!HPF$ DISTRIBUTE x *(BLOCK,*)
		!HPF$ ALIGN y(:,:) WITH *x(:,:)
	END INTERFACE
	u = 0.0
	r = f
	err = MAXVAL( ABS(r(1:n-1,1:n-1)) )
	i = 0; rho = 0
Conjugate Gradient: Mapping
	This program looks like a more complicated version of Jacobi iteration to the compiler.
		Allocate arrays based on DISTRIBUTE
		Apply owner-computes rule
		Detect communication from dependence analysis and intrinsics
		Useful optimizations include aggregating communication, overlapping communication with computation
		All of this becomes more interesting if the program encapsulates operations in a subroutine
			The programmer must trade off efficiency for flexibility and maintainability
Irregular Mesh
Irregular Mesh Relaxation: 
The Algorithm
	The Problem
		Given an irregular mesh of values
		Update each value using its neighbors in the mesh
	The Approach
		Store the mesh as a list of edges
		Process all edges in paralle
			Compute contribution of edge
			Add to one endpoint, subtract from the otherl
Irregular Mesh: Sequential Program
	REAL x(nnode), flux(nedge)
	INTEGER iedge(nedge,2)
	err = tol * 1e6
	DO WHILE (err > tol)
		DO i = 1, nedge
			flux(i) = (x(iedge(i,1))-x(iedge(i,2))) / 2
			err = err + flux(i)*flux(i)
		END DO
		DO i = 1, nedge
			x(iedge(i,1)) = x(iedge(i,1)) - flux(i)
			x(iedge(i,2)) = x(iedge(i,2)) + flux(i)
		END DO
		err = err / nedge
	END DO
Irregular Mesh: 
Partitioning & Communication
	Each iteration of the relaxation uses all the data computed in the previous step, and the edge array
		No parallelism at this level
		(Does this sound familiar?)
	Instead, use the data-parallel edge and node updates
		flux(i)=(x(iedge(i,1))-x(iedge(i,2)))/2
			Independent because edge_val ‚ node_val
		x(iedge(i,1)) = x(iedge(i,1)) - flux(i); 
		x(iedge(i,2)) = x(iedge(i,2)) + flux(i)
			Not independent because sometimes iedge(iY,1) = iedge(iX,2)
			Fortunately, HPF provides the SUM_SCATTER function
	Communication needed in both stages
		Between edges and nodes to compute flux
		Edge-node and node-node to compute x
		All communication is static, local with respect to grid, but unstructured with respect to array indices
Irregular Mesh: Agglomeration I
	Warning: Your compiler may do things differently!
	Computing edge values requires edge list and node values
		Distribute edges Þ parallel, no communication for edges
		Replicate edges Þ sequential or broadcast edge values
		Distribute nodes Þ move ³shared² endpoints
		Replicate nodes Þ no movement for endpoints
	Updating node values requires edge list, edge values, and node values
		Distribute edges Þ parallel, no communication for edges
		Replicate edges Þ sequential, no communication for edges
		Distribute nodes Þ move ³shared² endpoints
		Replicate nodes Þ move all endpoints
	The bottom line, part I
		Always distribute edges
		Distribute nodes unless the problem is very small
Irregular Mesh: Agglomeration II
	Warning: Your compiler may do things differently!
	Computation is static, homogeneous, and over full array (with respect to the edges)
		No load balancing issues
	Accesses to node array are ³nearest neighbor² in the mesh
		This is not reflected in the index order!
		\ This does not favor either BLOCK or CYCLIC
	To minimize communication, edge and node distributions must fit the mesh topology
		HPF¹s regular distributions are not ideal for this
		HPF 2.0 indirect distributions may be better, but require careful construction
	The bottom line, part II
		No silver bullet
		Order the nodes and edges to bring ³close² entities together, then use BLOCK
Irregular Mesh: Pictures

Irregular Mesh: Bad Data Distribution

Irregular Mesh: Good Data Distribution

Irregular Mesh: HPF Program
	USE HPF_LIBRARY
	REAL x(nnode), flux(nedge)
	INTEGER iedge(nedge,2)
	INTEGER permute_node(nnode), permute_edge(nedge)
	!HPF$ DISTRIBUTE x(BLOCK)
	!HPF$ DISTRIBUTE flux(BLOCK)
	!HPF$ ALIGN iedge(i,*) WITH flux(i)
	!HPF$ ALIGN permute_edge(i) WITH flux(i)
	!HPF$ ALIGN permute_node(i) WITH x(i)
	CALL renumber_nodes( iedge, permute_node )
	x( permute_node(:) ) = x
	FORALL (i=1:nedge) iedge(i,:) = permute_node(iedge(i,:))
	permute_edge = GRADE_UP( iedge(:,1) )
	FORALL (i=1:nedge) iedge(i,:) = iedge(permute_edge(i),:)
	err = tol * 1e6
	DO WHILE (err > tol)
		flux=(x(iedge(1:nedge,1))-x(iedge(1:nedge,2)))/2
		x=SUM_SCATTER(-flux(1:nedge),x,iedge(1:nedge,1))
		x=SUM_SCATTER( flux(1:nedge),x,iedge(1:nedge,2))
		err = SUM( flux*flux ) / nedge
	END DO
Irregular Mesh: Mapping
	This program is going to be challenging
		Indexing of arrays will be difficult
		Owner-computes rule difficult to apply
		Key technique: inspector-executor communication
			First time the code is executed, generate a table of required communication at run time (inspector)
				Problem: How big does that table get?
				Problem: How do you efficiently distribute that table to all processors?
			Use this table to manage unstructured communication until the communication pattern changes (executor)
				Problem: How do you know the pattern has changed?
		Commercial compilers are attacking these problems, but there is a long way to go