Given by Chuck Koelbel -- Rice University at DoD Training and Others on 1995-98. Foils prepared August 7 98
Outside Index
Summary of Material
Fourth in HPF Tutorial Sequence of Chuck Koelbel |
Discussion of Data Mapping in HPF using Multigrid as an exampleobullet1:DISTRIBUTE ALIGN DYNAMIC REALIGN REDISTRIBUTE |
Subroutine Interfaces |
Outside Index
Summary of Material
1. Introduction to Data-Parallelism |
2. Fortran 90/95 Features |
3. HPF Parallel Features |
4. HPF Data Mapping Features ** Contents of This Presentation |
5. Parallel Programming in HPF |
6. HPF Version 2.0 |
Data mapping complements data parallelism by placing data for parallel access |
HPF uses a two-phase data mapping:
|
Goals:
|
These goals are sometimes in conflict
|
Syntax:
|
Semantics:
|
Options for dist-format
|
REAL W(12,12),X(12,12),Y(12,12),Z(12,12) |
!HPF$ DISTRIBUTE W(BLOCK,*) |
!HPF$ DISTRIBUTE X(*,CYCLIC) |
!HPF$ DISTRIBUTE Y(BLOCK,BLOCK) |
!HPF$ DISTRIBUTE Z(CYCLIC(2),CYCLIC(3)) |
REAL a(12,12), b(12,12), |
REAL c(12,12), d(12,12) |
!HPF$ PROCESSORS p(6,1), q(2,3) |
!HPF$ PROCESSORS r(3,2), s(1,6) |
!HPF$ DISTRIBUTE a(BLOCK,BLOCK) ONTO p |
!HPF$ DISTRIBUTE b(BLOCK,BLOCK) ONTO q |
!HPF$ DISTRIBUTE c(BLOCK,BLOCK) ONTO r |
!HPF$ DISTRIBUTE d(BLOCK,BLOCK) ONTO s |
Compilation is based on the data distribution
|
Communication and synchronization are based on the data distribution
|
Communication (data movement) happens when two data items on different processors must be brought together
|
How communication is accomplished is a system problem
|
BLOCK retains locality for small shifts
|
CYCLIC is a disaster for shift computations
|
Broadcasts (copying one element) and reductions (combining all elements) favor neither BLOCK nor CYCLIC |
Irregular communications (sorting, scatters and gathers) favor neither BLOCK nor CYCLIC
|
Copying between distributions creates a lot of communication
|
Interpolation (as in multigrid) works like a shift operator |
CYCLIC always balances number of elements per processor
|
If P (# of processors) divides N (# of elements), BLOCK balances elements per processor
|
If P*K divides N, CYCLIC(K) balances elements per processor |
BLOCK is good for local (e.g. nearest-neighbor) communication
|
CYCLIC has non-obvious locality
|
CYCLIC(K) has some of the advantage of each |
Strides are expensive on any distribution
|
Broadcasts are equally expensive on any distribution |
Communication between different distributions is very expensive |
Nearest-neighbor relaxation
|
Dense linear algebra
|
Indirection array data structures
|
Single shared array for all levels
|
List of arrays, one per level
|
Allocate enough memory on each processor for its section of each distributed array
|
Adjust indexing
|
Adjust loops (including implicit loops)
|
Handle nonlocal data
|
Syntax:
|
Semantics:
|
Options for subscript-list
|
REAL U(6,6), V(6,6), W(6,6) |
REAL X(6), Y(6), Z(3) |
!HPF$ALIGN V(I,J) WITH U(I,J) |
!HPF$ALIGN W(I,J) WITH U(J,I) |
!HPF$ALIGN X(K) WITH W(K,1) |
!HPF$ALIGN Y(K) WITH W(*,K) |
!HPF$ALIGN Z(K) WITH X(2*K-1) |
!HPF$ DISTRIBUTE U(BLOCK,*) |
REAL U(6,6), V(6,6), W(6,6) |
REAL X(6), Y(6), Z(3) |
!HPF$ALIGN V(I,J) WITH U(I,J) |
!HPF$ALIGN W(I,J) WITH U(J,I) |
!HPF$ALIGN X(K) WITH W(K,1) |
!HPF$ALIGN Y(K) WITH W(*,K) |
!HPF$ALIGN Z(K) WITH X(2*K-1) |
!HPF$ DISTRIBUTE U(*,BLOCK) |
REAL U(6,6), V(6,6), W(6,6) |
REAL X(6), Y(6), Z(3) |
!HPF$ALIGN V(I,J) WITH U(I,J) |
!HPF$ALIGN W(I,J) WITH U(J,I) |
!HPF$ALIGN X(K) WITH W(K,1) |
!HPF$ALIGN Y(K) WITH W(*,K) |
!HPF$ALIGN Z(K) WITH X(2*K-1) |
!HPF$ DISTRIBUTE U(*,CYCLIC) |
REAL U(6,6), V(6,6), W(6,6) |
REAL X(6), Y(6), Z(3) |
!HPF$ALIGN V(I,J) WITH U(I,J) |
!HPF$ALIGN W(I,J) WITH U(J,I) |
!HPF$ALIGN X(K) WITH W(K,1) |
!HPF$ALIGN Y(K) WITH W(*,K) |
!HPF$ALIGN Z(K) WITH X(2*K-1) |
!HPF$DISTRIBUTE U(BLOCK,BLOCK) |
Communication (data movement) happens when two data items on different processors must be brought together |
ALIGN relates array elements, ensuring they are mapped together
|
You can also think of this as modifying DISTRIBUTE
|
³Copy² distributions
|
Off-by-one problems
|
Differing array sizes
|
Substitute ALIGN expression into array reference subscripts |
Treat like DISTRIBUTE
|
One data mapping is not always appropriate for an entire program
|
Therefore, HPF 1.x provides executable DISTRIBUTE and ALIGN
|
Syntax:
|
Semantics:
|
Syntax:
|
Semantics:
|
Syntax:
|
Semantics:
|
Picking mappings based on input data
|
FFT, ADI, and other directional sweeps
|
Cyclic reduction and other recursive doubling methods
|
Some subroutines require data to use a specific mapping, so actual arguments must be remapped |
Some subroutines can use any mapping, so actual arguments should be passed in place |
Sometimes the programmer knows the incoming data mappings, sometimes not |
HPF has options to say all of this! |
Any remappings are undone on procedure return |
Three classes of mappings for dummy arguments
|
Sorting some of these cases out requires doing special things in the caller
|
Always use explicit interfaces |
Require good errors and warnings from your compiler |
There are several cases where HPF requires mappings to be ³compatible²
|
³Compatible² is defined by a partial ordering
|
You can avoid learning the detailed rules!
|
There is a hierarchy of directives, based on how precisely they describe a mapping |
An explicit interface is required if
|
Libraries
|
Non-reusable modules
|
PROCESSORS
|
TEMPLATE
|
ALLOCATABLE Arrays and POINTER Targets
|
ALIGN data based on physical domains
|
DISTRIBUTE based on parallelism
|
Pick distribution pattern based on communications
|
REALIGN with care; REDISTRIBUTE with extreme care
|