Given by Tom Haupt and Nancy McCracken at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared October 1995
Outside Index
Summary of Material
Data decomposition - divide the data into pieces with equal amounts of processing required. |
Distribute data blocks among processors - shown here on a MIMD distributed memory or SIMD machine. |
Each processor runs a traditional sequential program implementing the Fortran90 program, looping over the subarray in its own memory. |
If an array element A(i) is used with an array element B(j) which happens to be on another processor, communication will be used. |
Outside Index Summary of Material
CPS615 |
September 27, 1995 |
- Implementations of the data parallel language Fortran 90 on parallel machines. |
- HPF/Fortran90 parallel language constructs |
- HPF data mapping directives |
Data decomposition - divide the data into pieces with equal amounts of processing required. |
Distribute data blocks among processors - shown here on a MIMD distributed memory or SIMD machine. |
Each processor runs a traditional sequential program implementing the Fortran90 program, looping over the subarray in its own memory. |
If an array element A(i) is used with an array element B(j) which happens to be on another processor, communication will be used. |
Equation for the field can be solved over a rectangular domain if boundary conditions are known. The domain is discretized into a rectangular grid with values at the boundary points. |
To solve, start with an initial guess and use Jacobi or Gauss-Seidel iterative techniques to improve the solution within a small error range using the formula: |
A Fortran 90 statement where W is the array of grid points: |
W(2:R-1, 2:C-1) = .25 (W(2:R-1, 1:C-2) + W(2:R-1, 3:C) + W(1:R-2, 2:C-1) + W(3:R, 2:C-1)). |
The compiler decomposes the grid array onto processors by equal pieces |
For each interior point, the algorithm must access the immediate neighbors according to a 5 point stencil, which may sometimes involve communication with other processors: |
Then the algorithm can be implemented by code which first communicates the internal boundaries and then loops over the interior points on each processor. |
Consider the decomposition into roughly equal blocks of the N x N+1 augmented matrix: c = [ a | b ], given here in a 9 x 10 example where some processors have 9 elements and the last column of processors has 12 elements: |
In the first iteration of the algorithm, consider that each processor has roughly n x n elements and find the running time of each part of the algorithm:
|
Many processors are completely idle, while the last column of processors still has the same running time. |
Better Decomposition is Block by Rows: After iterating over i columns, each processor is still working on n-i+1 row elements. Better load balance and better running time, as time of longest running processor is reduced each iteration. |
This algorithm is similar to the Gauss-Jordan algorithm except that it only transforms rows after row i. After i iterations over columns: |
Cyclic Distribution by Rows: Suppose that there are only three processors. Then a better decomposition would scatter the rows on the processors so that each processor still has work to do: |