Element updates require nearest neighbors
-
CYCLIC Þ communicate entire array
-
BLOCK Þ least communication volume
-
(BLOCK,*) Þ move u(i-1,j), u(i+1,j) " j
-
(BLOCK,BLOCK) Þ move u(ILOW-1,j), u(IHIGH+1,j) " j, u(i,JLOW-1), u(i,JHIGH+1) " i
|
Convergence test requires a whole-array reduction
-
Any distribution Þ static, structured communication
|
The bottom line
-
(BLOCK,*) on high-latency machines or small problem sizes
-
(BLOCK,BLOCK) on low-latency machines
|
REAL u(0:nx,0:ny), unew(0:nx,0:ny), f(0:nx,0:ny)
|
!HPF$ DISTRIBUTE u(BLOCK,*)
|
!HPF$ ALIGN (:,:) WITH u(:,:) :: unew, f
|
dx = 1.0/nx; dy = 1.0/ny; err = tol * 1e6
|
FORALL ( i=0:nx, j=0:ny )
-
f(i,j) = -2*(dx*i)**2+2*dx*i-2*(dy*j)**2+2*dy*j
|
END FORALL
|
u = 0.0; unew = 0.0
|
DO WHILE (err > tol)
-
FORALL ( i=1:nx-1, j=1:ny-1 ) &
-
unew(i,j) = (u(i-1,j)+u(i+1,j)+u(i,j-1)+ &
-
u(i,j+1)+f(i,j))/4
-
err = MAXVAL( ABS(unew-u) )
-
u = unew
|
END DO
|
Pivot selection requires a 1-D reduction
-
Distribute rows Þ parallel, with communication
-
Distribute columns Þ sequential, but no communication
|
Element updates require the old value and elements from the pivot row and column
-
Distribute rows Þ parallel, but broadcast the pivot row
-
Distribute columns Þ parallel, but broadcast the pivot column
|
Each stage works on a smaller contiguous region of the array
-
BLOCK Þ processors drop out of the computation
-
CYCLIC Þ work stays (fairly) evenly distributed until the end
|
The bottom line
-
(*,CYCLIC) if broadcast > pivoting one column
-
(CYCLIC,*) if broadcast < one column, synchronous comm.
-
(CYCLIC,CYCLIC) if broadcast < one col., overlapped comm.
|
REAL a(n,n), tmp(n)
|
!HPF$ DISTRIBUTE a(CYCLIC,CYCLIC)
|
!HPF$ ALIGN tmp(i) WITH a(*,i)
|
DO k = 1, n-1
-
! Select the pivot
-
ipivot = MAXLOC( ABS(a(k:n,k)) ) + k - 1
-
! Swap the rows
-
tmp(k:n) = a(ipivot,k:n)
-
a(ipivot,k:n) = a(k,k:n)
-
a(k,k:n) = tmp(k:n)
-
! Update the submatrix
-
FORALL ( i=k+1:n, j=k+1:n ) &
-
& a(i,j) = a(i,j) - a(i,k)/tmp(k)*tmp(j)
|
END DO
|
REAL u(0:n,0:n), r(0:n,0:n), p(0:n,0:n)
|
REAL q(0:n,0:n), f(0:n,0:n)
|
!HPF$ DISTRIBUTE u(BLOCK,*)
|
!HPF$ ALIGN (:,:) WITH u(:,:) :: r, p, q, f
|
INTERFACE
-
SUBROUTINE a_times_vector( x, y )
-
REAL, INTENT(IN) :: x(:,:)
-
REAL, INTENT(OUT) :: y(:,:)
-
!HPF$ DISTRIBUTE x *(BLOCK,*)
-
!HPF$ ALIGN y(:,:) WITH *x(:,:)
|
END INTERFACE
|
u = 0.0
|
r = f
|
err = MAXVAL( ABS(r(1:n-1,1:n-1)) )
|
i = 0; rho = 0
|
Each iteration of the relaxation uses all the data computed in the previous step, and the edge array
-
No parallelism at this level
-
(Does this sound familiar?)
|
Instead, use the data-parallel edge and node updates
-
flux(i)=(x(iedge(i,1))-x(iedge(i,2)))/2
-
Independent because edge_val ‚ node_val
-
x(iedge(i,1)) = x(iedge(i,1)) - flux(i);
-
x(iedge(i,2)) = x(iedge(i,2)) + flux(i)
-
Not independent because sometimes iedge(iY,1) = iedge(iX,2)
-
Fortunately, HPF provides the SUM_SCATTER function
|
Communication needed in both stages
-
Between edges and nodes to compute flux
-
Edge-node and node-node to compute x
-
All communication is static, local with respect to grid, but unstructured with respect to array indices
|
Warning: Your compiler may do things differently!
|
Computing edge values requires edge list and node values
-
Distribute edges Þ parallel, no communication for edges
-
Replicate edges Þ sequential or broadcast edge values
-
Distribute nodes Þ move ³shared² endpoints
-
Replicate nodes Þ no movement for endpoints
|
Updating node values requires edge list, edge values, and node values
-
Distribute edges Þ parallel, no communication for edges
-
Replicate edges Þ sequential, no communication for edges
-
Distribute nodes Þ move ³shared² endpoints
-
Replicate nodes Þ move all endpoints
|
The bottom line, part I
-
Always distribute edges
-
Distribute nodes unless the problem is very small
|
Warning: Your compiler may do things differently!
|
Computation is static, homogeneous, and over full array (with respect to the edges)
|
Accesses to node array are ³nearest neighbor² in the mesh
-
This is not reflected in the index order!
-
\ This does not favor either BLOCK or CYCLIC
|
To minimize communication, edge and node distributions must fit the mesh topology
-
HPF¹s regular distributions are not ideal for this
-
HPF 2.0 indirect distributions may be better, but require careful construction
|
The bottom line, part II
-
No silver bullet
-
Order the nodes and edges to bring ³close² entities together, then use BLOCK
|
USE HPF_LIBRARY
|
REAL x(nnode), flux(nedge)
|
INTEGER iedge(nedge,2)
|
INTEGER permute_node(nnode), permute_edge(nedge)
|
!HPF$ DISTRIBUTE x(BLOCK)
|
!HPF$ DISTRIBUTE flux(BLOCK)
|
!HPF$ ALIGN iedge(i,*) WITH flux(i)
|
!HPF$ ALIGN permute_edge(i) WITH flux(i)
|
!HPF$ ALIGN permute_node(i) WITH x(i)
|
CALL renumber_nodes( iedge, permute_node )
|
x( permute_node(:) ) = x
|
FORALL (i=1:nedge) iedge(i,:) = permute_node(iedge(i,:))
|
permute_edge = GRADE_UP( iedge(:,1) )
|
FORALL (i=1:nedge) iedge(i,:) = iedge(permute_edge(i),:)
|
err = tol * 1e6
|
DO WHILE (err > tol)
-
flux=(x(iedge(1:nedge,1))-x(iedge(1:nedge,2)))/2
-
x=SUM_SCATTER(-flux(1:nedge),x,iedge(1:nedge,1))
-
x=SUM_SCATTER( flux(1:nedge),x,iedge(1:nedge,2))
-
err = SUM( flux*flux ) / nedge
|
END DO
|