Full HTML for Basic Introduction to HPF

Full HTML for

Basic foilset Introduction to HPF

Given by Tom Haupt and Nancy McCracken at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared October 1995
Outside Index Summary of Material

Data decomposition - divide the data into pieces with equal amounts of processing required.

Distribute data blocks among processors - shown here on a MIMD distributed memory or SIMD machine.

Each processor runs a traditional sequential program implementing the Fortran90 program, looping over the subarray in its own memory.

If an array element A(i) is used with an array element B(j) which happens to be on another processor, communication will be used.

Table of Contents for full HTML of Introduction to HPF

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

Introduction to High Performance Fortran (HPF)
2

Parallel Implementations of Fortran90
3

Example Program: a PDE solver, the solution of LaPlace's Equation
4

Grid Decomposition
5

Implementation of the Gauss-Jordan Program
6

Load Imbalance arises after i iterations over columns:
7

Load Balancing for Regular Gaussian Elimination during Forward Reduction

Outside Index Summary of Material

HTML version of Basic Foils prepared October 1995

Foil 1 Introduction to High Performance Fortran (HPF)

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

CPS615

September 27, 1995

- Implementations of the data parallel language Fortran 90 on parallel machines.

- HPF/Fortran90 parallel language constructs

- HPF data mapping directives

HTML version of Basic Foils prepared October 1995

Foil 2 Parallel Implementations of Fortran90

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Data decomposition - divide the data into pieces with equal amounts of processing required.

Distribute data blocks among processors - shown here on a MIMD distributed memory or SIMD machine.

Each processor runs a traditional sequential program implementing the Fortran90 program, looping over the subarray in its own memory.

If an array element A(i) is used with an array element B(j) which happens to be on another processor, communication will be used.

HTML version of Basic Foils prepared October 1995

Foil 3 Example Program: a PDE solver, the solution of LaPlace's Equation

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Equation for the field can be solved over a rectangular domain if boundary conditions are known. The domain is discretized into a rectangular grid with values at the boundary points.

To solve, start with an initial guess and use Jacobi or Gauss-Seidel iterative techniques to improve the solution within a small error range using the formula:

A Fortran 90 statement where W is the array of grid points:

W(2:R-1, 2:C-1) = .25 (W(2:R-1, 1:C-2) + W(2:R-1, 3:C) + W(1:R-2, 2:C-1) + W(3:R, 2:C-1)).

HTML version of Basic Foils prepared October 1995

Foil 4 Grid Decomposition

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

The compiler decomposes the grid array onto processors by equal pieces

For each interior point, the algorithm must access the immediate neighbors according to a 5 point stencil, which may sometimes involve communication with other processors:

Then the algorithm can be implemented by code which first communicates the internal boundaries and then loops over the interior points on each processor.

HTML version of Basic Foils prepared October 1995

Foil 5 Implementation of the Gauss-Jordan Program

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Consider the decomposition into roughly equal blocks of the N x N+1 augmented matrix: c = [ a | b ], given here in a 9 x 10 example where some processors have 9 elements and the last column of processors has 12 elements:

In the first iteration of the algorithm, consider that each processor has roughly n x n elements and find the running time of each part of the algorithm:

Divide first row:
Transform remaining rows:

HTML version of Basic Foils prepared October 1995

Foil 6 Load Imbalance arises after i iterations over columns:

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Many processors are completely idle, while the last column of processors still has the same running time.

Better Decomposition is Block by Rows: After iterating over i columns, each processor is still working on n-i+1 row elements. Better load balance and better running time, as time of longest running processor is reduced each iteration.

HTML version of Basic Foils prepared October 1995

Foil 7 Load Balancing for Regular Gaussian Elimination during Forward Reduction

From Intro HPF September 27, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This algorithm is similar to the Gauss-Jordan algorithm except that it only transforms rows after row i. After i iterations over columns:

Cyclic Distribution by Rows: Suppose that there are only three processors. Then a better decomposition would scatter the rows on the processors so that each processor still has work to do: