Intro HPF September 27, 1995 Introduction to High Performance Fortran (HPF) CPS615 September 27, 1995 - Implementations of the data parallel language Fortran 90 on parallel machines. - HPF/Fortran90 parallel language constructs - HPF data mapping directives Parallel Implementations of Fortran90 Data decomposition - divide the data into pieces with equal amounts of processing required. Distribute data blocks among processors - shown here on a MIMD distributed memory or SIMD machine. Each processor runs a traditional sequential program implementing the Fortran90 program, looping over the subarray in its own memory. If an array element A(i) is used with an array element B(j) which happens to be on another processor, communication will be used. Example Program: a PDE solver, the solution of LaPlaceÕs Equation Equation for the field can be solved over a rectangular domain if boundary conditions are known. The domain is discretized into a rectangular grid with values at the boundary points. To solve, start with an initial guess and use Jacobi or Gauss-Seidel iterative techniques to improve the solution within a small error range using the formula: A Fortran 90 statement where W is the array of grid points: W(2:R-1, 2:C-1) = .25 (W(2:R-1, 1:C-2) + W(2:R-1, 3:C) + W(1:R-2, 2:C-1) + W(3:R, 2:C-1)). Grid Decomposition The compiler decomposes the grid array onto processors by equal pieces For each interior point, the algorithm must access the immediate neighbors according to a 5 point stencil, which may sometimes involve communication with other processors: Then the algorithm can be implemented by code which first communicates the internal boundaries and then loops over the interior points on each processor. Implementation of the Gauss-Jordan Program Consider the decomposition into roughly equal blocks of the N x N+1 augmented matrix: c = [ a | b ], given here in a 9 x 10 example where some processors have 9 elements and the last column of processors has 12 elements: In the first iteration of the algorithm, consider that each processor has roughly n x n elements and find the running time of each part of the algorithm: Divide first row: Transform remaining rows: Load Imbalance arises after i iterations over columns: Many processors are completely idle, while the last column of processors still has the same running time. Better Decomposition is Block by Rows: After iterating over i columns, each processor is still working on n-i+1 row elements. Better load balance and better running time, as time of longest running processor is reduced each iteration. Load Balancing for Regular Gaussian Elimination during Forward Reduction This algorithm is similar to the Gauss-Jordan algorithm except that it only transforms rows after row i. After i iterations over columns: Cyclic Distribution by Rows: Suppose that there are only three processors. Then a better decomposition would scatter the rows on the processors so that each processor still has work to do: