Designing Parallel Algorithms Partition Divide problem into tasks to expose potential parallelism Communicate Determine amount and pattern of data to share between tasks Agglomerate Combine tasks to increase granularity and reduce communication Map Assign agglomerated tasks to physical processors Designing Parallel Programs in HPF Partition Divide problem into tasks In HPF, look for array update operators Elemental computations, linear algebra Communicate Determine amount and pattern of communication In HPF, consider which elements are combined Topological neighbors, data reductions Agglomerate Combine tasks In HPF, choose the data mapping BLOCK for index locality, CYCLIC for load balance Map Assign agglomerated tasks to physical processors In HPF, trust the compiler Jacobi Iteration Jacobi Iteration: The Algorithm The Problem Given a partial differential equation & boundary conditions Find the solution The Approach Divide (continuous) space into a (discrete) grid Guess a solution on the grid Update the solution at every grid point Repeat update until solution doesn't change Jacobi Iteration: Equations and Pictures Jacobi Iteration: Partitioning At the most abstract level, Jacobi iteration is a sequential process. However, each step in the process is itself composed of many smaller operations. Conclusion: Jacobi iteration has plenty of data parallelism Jacobi Iteration: Communication Element updates: Each requires 4 values from previous update step Static, local communication Generally, the second-fastest kind (after no communication) Convergence test: Uses all values from latest update step Static, global communication Reduction operation Efficient methods known, encapsulated in libraries Jacobi Iteration: Agglomeration Element updates require nearest neighbors CYCLIC Þ communicate entire array BLOCK Þ least communication volume (BLOCK,*) Þ move u(i-1,j), u(i+1,j) " j (BLOCK,BLOCK) Þ move u(ILOW-1,j), u(IHIGH+1,j) " j, u(i,JLOW-1), u(i,JHIGH+1) " i Convergence test requires a whole-array reduction Any distribution Þ static, structured communication The bottom line (BLOCK,*) on high-latency machines or small problem sizes (BLOCK,BLOCK) on low-latency machines Jacobi Iteration: HPF Program REAL u(0:nx,0:ny), unew(0:nx,0:ny), f(0:nx,0:ny) !HPF$ DISTRIBUTE u(BLOCK,*) !HPF$ ALIGN (:,:) WITH u(:,:) :: unew, f dx = 1.0/nx; dy = 1.0/ny; err = tol * 1e6 FORALL ( i=0:nx, j=0:ny ) f(i,j) = -2*(dx*i)**2+2*dx*i-2*(dy*j)**2+2*dy*j END FORALL u = 0.0; unew = 0.0 DO WHILE (err > tol) FORALL ( i=1:nx-1, j=1:ny-1 ) & unew(i,j) = (u(i-1,j)+u(i+1,j)+u(i,j-1)+ & u(i,j+1)+f(i,j))/4 err = MAXVAL( ABS(unew-u) ) u = unew END DO Jacobi Iteration: Mapping This program is a piece of cake for the compiler. Allocate portion of array on each processor based on DISTRIBUTE Apply owner-computes rule analytically based on left-hand side Detect shift communication from dependence analysis of subscripts or pattern matching Recognize MAXVAL intrinsic as reduction communication Place all communication directly outside parallel construct where it occurs Number processors so that shifts do not cause contention Gaussian Elimination Gaussian Elimination: The Algorithm The Problem Given N linear equations in N unknowns xi Find values of all xi to satisfy the equations The Approach Use Eq. 1 to eliminate x1 from Eq. 2, 3, N Use Eq. 2 to eliminate x2 from Eq. 3, N Eq. N only involves xN Þ Solve it! Work backwards to find xN-1, , x1 Gaussian Elimination: Pictures Gaussian Elimination: Partitioning At the most abstract level, Gaussian elimination is a sequential process. Need all elements in column k to find pivot row Need all elements in column k and pivot row to perform pivoting However, each step in the process is itself composed of many smaller operations. Perform all element updates independently Conclusion: Gaussian elimination has plenty of data parallelism Until the last few stages, anyway Gaussian Elimination: Communication Pivot search: Reduction along column Static, global communication Element updates: Each requires itself, elements from pivot column & row Static, global communication Broadcast Gaussian Elimination: Agglomeration Pivot selection requires a 1-D reduction Distribute rows Þ parallel, with communication Distribute columns Þ sequential, but no communication Element updates require the old value and elements from the pivot row and column Distribute rows Þ parallel, but broadcast the pivot row Distribute columns Þ parallel, but broadcast the pivot column Each stage works on a smaller contiguous region of the array BLOCK Þ processors drop out of the computation CYCLIC Þ work stays (fairly) evenly distributed until the end The bottom line (*,CYCLIC) if broadcast > pivoting one column (CYCLIC,*) if broadcast < one column, synchronous comm. (CYCLIC,CYCLIC) if broadcast < one col., overlapped comm. Gaussian Elimination: HPF Program REAL a(n,n), tmp(n) !HPF$ DISTRIBUTE a(CYCLIC,CYCLIC) !HPF$ ALIGN tmp(i) WITH a(*,i) DO k = 1, n-1 ! Select the pivot ipivot = MAXLOC( ABS(a(k:n,k)) ) + k - 1 ! Swap the rows tmp(k:n) = a(ipivot,k:n) a(ipivot,k:n) = a(k,k:n) a(k,k:n) = tmp(k:n) ! Update the submatrix FORALL ( i=k+1:n, j=k+1:n ) & & a(i,j) = a(i,j) - a(i,k)/tmp(k)*tmp(j) END DO Gaussian Elimination: Mapping This program is harder for the compiler. Allocate portion of array on each processor based on DISTRIBUTE Apply owner-computes rule analytically based on left-hand side Detect communication from dependence analysis & intrinsics Here, it really pays to transform the program! Reorder computation to always precompute the next pivot column Rearrange communication to pipeline the series of updates Do broadcasts asynchronously Net result: 2¥ speedup Use standard numbering for processors Conjugate Gradient Conjugate Gradient: The Algorithm The Problem Given a partial differential equation & boundary conditions Find the solution The Approach Divide (continuous) space into a (discrete) grid Guess a solution on the grid Estimate how the solution should change Move in that direction Repeat estimate and move until solution doesn't change Conjugate Gradient: Equations and Pictures Conjugate Gradient: More Equations u = · initial guess " r = f ­ A * u d = max( |r| ) i = 0; r = 0 WHILE (d > e) DO i = i + 1; rold = r r = r × r IF (i=1) THEN p = r ELSE p = r + r/rold p q = A * p a = r / (p × q) u = u + a p r = r ­ a q d = max( |r| ) END WHILE Conjugate Gradient: Partitioning At the most abstract level, conjugate gradient has minimal parallelism u, r, a can be updated independently The real potential parallelism is in the matrix and vector operations, however. r × r is a reduction of size N u + a p is a vector update of size N A * p is a (sparse) matrix-vector multiply, in this case of size O(N) It looks a lot like the operator in Jacobi Conclusions: Stick to the matrix/vector operators Task parallelism (and pipelining) may improve some Conjugate Gradient: Communication Convergence test: Global reduction Dot products: Global reductions Vector updates: Elementwise scaling and addition of two vectors No communication if vectors are aligned Matrix-vector multiplies: Depends on the matrix (PDE operator) in the problem For the model problem, equivalent to Jacobi iteration grid update Static, local communication Conjugate Gradient: Agglomeration Dot products and convergence test always require global communication No reason to pick one DISTRIBUTE over another Vector updates require no communication Really no reason to choose a particular DISTRIBUTE Matrix-vector multiply does care where its data come from In this case, same advantages/disadvantages as Jacobi iteration The bottom line (BLOCK,*) on high-latency machines or small problem sizes (BLOCK,BLOCK) on low-latency machines Conjugate Gradient: HPF Program REAL u(0:n,0:n), r(0:n,0:n), p(0:n,0:n) REAL q(0:n,0:n), f(0:n,0:n) !HPF$ DISTRIBUTE u(BLOCK,*) !HPF$ ALIGN (:,:) WITH u(:,:) :: r, p, q, f INTERFACE SUBROUTINE a_times_vector( x, y ) REAL, INTENT(IN) :: x(:,:) REAL, INTENT(OUT) :: y(:,:) !HPF$ DISTRIBUTE x *(BLOCK,*) !HPF$ ALIGN y(:,:) WITH *x(:,:) END INTERFACE u = 0.0 r = f err = MAXVAL( ABS(r(1:n-1,1:n-1)) ) i = 0; rho = 0 Conjugate Gradient: Mapping This program looks like a more complicated version of Jacobi iteration to the compiler. Allocate arrays based on DISTRIBUTE Apply owner-computes rule Detect communication from dependence analysis and intrinsics Useful optimizations include aggregating communication, overlapping communication with computation All of this becomes more interesting if the program encapsulates operations in a subroutine The programmer must trade off efficiency for flexibility and maintainability Irregular Mesh Irregular Mesh Relaxation: The Algorithm The Problem Given an irregular mesh of values Update each value using its neighbors in the mesh The Approach Store the mesh as a list of edges Process all edges in paralle Compute contribution of edge Add to one endpoint, subtract from the otherl Irregular Mesh: Sequential Program REAL x(nnode), flux(nedge) INTEGER iedge(nedge,2) err = tol * 1e6 DO WHILE (err > tol) DO i = 1, nedge flux(i) = (x(iedge(i,1))-x(iedge(i,2))) / 2 err = err + flux(i)*flux(i) END DO DO i = 1, nedge x(iedge(i,1)) = x(iedge(i,1)) - flux(i) x(iedge(i,2)) = x(iedge(i,2)) + flux(i) END DO err = err / nedge END DO Irregular Mesh: Partitioning & Communication Each iteration of the relaxation uses all the data computed in the previous step, and the edge array No parallelism at this level (Does this sound familiar?) Instead, use the data-parallel edge and node updates flux(i)=(x(iedge(i,1))-x(iedge(i,2)))/2 Independent because edge_val ‚ node_val x(iedge(i,1)) = x(iedge(i,1)) - flux(i); x(iedge(i,2)) = x(iedge(i,2)) + flux(i) Not independent because sometimes iedge(iY,1) = iedge(iX,2) Fortunately, HPF provides the SUM_SCATTER function Communication needed in both stages Between edges and nodes to compute flux Edge-node and node-node to compute x All communication is static, local with respect to grid, but unstructured with respect to array indices Irregular Mesh: Agglomeration I Warning: Your compiler may do things differently! Computing edge values requires edge list and node values Distribute edges Þ parallel, no communication for edges Replicate edges Þ sequential or broadcast edge values Distribute nodes Þ move ³shared² endpoints Replicate nodes Þ no movement for endpoints Updating node values requires edge list, edge values, and node values Distribute edges Þ parallel, no communication for edges Replicate edges Þ sequential, no communication for edges Distribute nodes Þ move ³shared² endpoints Replicate nodes Þ move all endpoints The bottom line, part I Always distribute edges Distribute nodes unless the problem is very small Irregular Mesh: Agglomeration II Warning: Your compiler may do things differently! Computation is static, homogeneous, and over full array (with respect to the edges) No load balancing issues Accesses to node array are ³nearest neighbor² in the mesh This is not reflected in the index order! \ This does not favor either BLOCK or CYCLIC To minimize communication, edge and node distributions must fit the mesh topology HPF¹s regular distributions are not ideal for this HPF 2.0 indirect distributions may be better, but require careful construction The bottom line, part II No silver bullet Order the nodes and edges to bring ³close² entities together, then use BLOCK Irregular Mesh: Pictures Irregular Mesh: Bad Data Distribution Irregular Mesh: Good Data Distribution Irregular Mesh: HPF Program USE HPF_LIBRARY REAL x(nnode), flux(nedge) INTEGER iedge(nedge,2) INTEGER permute_node(nnode), permute_edge(nedge) !HPF$ DISTRIBUTE x(BLOCK) !HPF$ DISTRIBUTE flux(BLOCK) !HPF$ ALIGN iedge(i,*) WITH flux(i) !HPF$ ALIGN permute_edge(i) WITH flux(i) !HPF$ ALIGN permute_node(i) WITH x(i) CALL renumber_nodes( iedge, permute_node ) x( permute_node(:) ) = x FORALL (i=1:nedge) iedge(i,:) = permute_node(iedge(i,:)) permute_edge = GRADE_UP( iedge(:,1) ) FORALL (i=1:nedge) iedge(i,:) = iedge(permute_edge(i),:) err = tol * 1e6 DO WHILE (err > tol) flux=(x(iedge(1:nedge,1))-x(iedge(1:nedge,2)))/2 x=SUM_SCATTER(-flux(1:nedge),x,iedge(1:nedge,1)) x=SUM_SCATTER( flux(1:nedge),x,iedge(1:nedge,2)) err = SUM( flux*flux ) / nedge END DO Irregular Mesh: Mapping This program is going to be challenging Indexing of arrays will be difficult Owner-computes rule difficult to apply Key technique: inspector-executor communication First time the code is executed, generate a table of required communication at run time (inspector) Problem: How big does that table get? Problem: How do you efficiently distribute that table to all processors? Use this table to manage unstructured communication until the communication pattern changes (executor) Problem: How do you know the pattern has changed? Commercial compilers are attacking these problems, but there is a long way to go