"High Performance Fortran in Practice" tutorial Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995 Presented at Supercomputer '95, Mannheim, Germany, June 27, 1995 Presented at Supercomputing '95, San Diego, December 4, 1995 Presented at University of Tennessee (short form), Knoxville, March 21, 1996 Presented at Metacenter Regional Alliances, Cornell, May 6, 1996 Presented at Summer of HPF Workshop, Vienna, July 1, 1996 Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996 Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996 Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996 Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997 Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997 Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997 Parts presented at SC'97, November 17, 1997 Parts presented (slideshow mode) at SCı97, November 15-21, 1997 Presented at DOD HPC Users Group, June 1, 1998 IV. HPF Data Mapping Features Outline 1. Introduction to Data-Parallelism 2. Fortran 90/95 Features 3. HPF Parallel Features 4. HPF Data Mapping Features 5. Parallel Programming in HPF 6. HPF Version 2.0 Data Mapping Data mapping complements data parallelism by placing data for parallel access HPF uses a two-phase data mapping: ALIGN: Creates a relationship between objects DISTRIBUTE: Partitions an object between processors Vendors may define further levels of mapping Data Mapping, cont. Goals: Avoiding Contention: Data updated in parallel should be on different processors Locality of Reference: Data used together should be on the same processor These goals are sometimes in conflict To avoid all contention, put every data item on its own processor To maximize locality, put all data on one processor HPF gives you the tools to resolve these conflicts, but doesnıt solve them for you A. distribute directive The DISTRIBUTE Directive Syntax: !HPF$ DISTRIBUTE array(dist-format-list ) [ONTO procs] !HPF$ DISTRIBUTE(dist-format-list ) [ONTO procs]::array-list Semantics: Each dimension of array is divided according to the corresponding pattern in dist-format-list ONTO (if present) names the processor array to distribute on Options for dist-format (Let N be the number of elements.) (Let P be the number of processors.) BLOCK : Contiguous pieces of size N/P on each processor CYCLIC : Every Pth element to the same processor CYCLIC(K) : Every Pth block of size K to the same processor * : Dimension not distributed Examples of DISTRIBUTE REAL W(12,12),X(12,12),Y(12,12),Z(12,12) !HPF$ DISTRIBUTE W(BLOCK,*) !HPF$ DISTRIBUTE X(*,CYCLIC) !HPF$ DISTRIBUTE Y(BLOCK,BLOCK) !HPF$ DISTRIBUTE Z(CYCLIC(2),CYCLIC(3)) More Examples of DISTRIBUTE REAL a(12,12), b(12,12), REAL c(12,12), d(12,12) !HPF$ PROCESSORS p(6,1), q(2,3) !HPF$ PROCESSORS r(3,2), s(1,6) !HPF$ DISTRIBUTE a(BLOCK,BLOCK) ONTO p !HPF$ DISTRIBUTE b(BLOCK,BLOCK) ONTO q !HPF$ DISTRIBUTE c(BLOCK,BLOCK) ONTO r !HPF$ DISTRIBUTE d(BLOCK,BLOCK) ONTO s Why Use DISTRIBUTE? Compilation is based on the data distribution Computations will execute in parallel if They are conceptually parallel (e.g. array operations) The data is partitioned (e.g. by DISTRIBUTE) Communication and synchronization are based on the data distribution BLOCK reduces surface-to-volume ratio CYCLIC (and CYCLIC(K)) improves load balance * keeps things on one processor DISTRIBUTE and Communication Communication (data movement) happens when two data items on different processors must be brought together Assume a(n), a(m) are on different processors a(n) = a(m) ­ Communicate one element x = a(n) + a(m) ­ Communicate one of {a(n),a(m)} a(n) = a(n) ­ No communication How communication is accomplished is a system problem Depends on data mapping (DISTRIBUTE and ALIGN) Depends on data access (subscripts of arrays) Depends on implementation (whose compiler?) Choosing a Good Distribution Pattern BLOCK retains locality for small shifts But large (relative to block size) shifts are a problem CYCLIC is a disaster for shift computations But there are some interesting special cases Choosing a Good Distribution Pattern (cont.) Broadcasts (copying one element) and reductions (combining all elements) favor neither BLOCK nor CYCLIC Irregular communications (sorting, scatters and gathers) favor neither BLOCK nor CYCLIC But some special cases are natural Choosing a Good Distribution Pattern (cont.) Copying between distributions creates a lot of communication This includes copying between different block sizes Interpolation (as in multigrid) works like a shift operator Distribution Patterns and Load Balance CYCLIC always balances number of elements per processor This balances the computation load if all elements are the same If P (# of processors) divides N (# of elements), BLOCK balances elements per processor There are pathological cases if P < N < P2 If P*K divides N, CYCLIC(K) balances elements per processor Rules of Thumb for Communication BLOCK is good for local (e.g. nearest-neighbor) communication Look for subscripts like a(i+1,j-1) Look for intrinsics like CSHIFT Warning: BLOCK depends on array size CYCLIC has non-obvious locality Look for subscripts like x(i+jmp), where jmp is a multiple of the number of processors For example, jmp may be a power of 2 on a hypercube machine CYCLIC always balances the memory load CYCLIC(K) has some of the advantage of each Strides are expensive on any distribution But some special cases are worth recognizing Broadcasts are equally expensive on any distribution Communication between different distributions is very expensive Typical Uses of DISTRIBUTE Nearest-neighbor relaxation BLOCK in dimension(s) with most parallelism CYCLIC(K) may help with load balancing Dense linear algebra CYCLIC(K) in one or two dimensions Indirection array data structures ³No silver bullet² BLOCK, with careful numbering of data elements? Multigrid: A Complicated Case for DISTRIBUTE Possible Data Structures for Multigrid Possible DISTRIBUTE Patterns for Multigrid Single shared array for all levels Smoothing: All CSHIFT operations Prolongation and Restriction: All CSHIFT operations Conclusion: Use some form of BLOCK If start-up cost is high: 1-D BLOCK (in longest dimension) Otherwise: 2-D BLOCK may be better List of arrays, one per level Smoothing: All CSHIFT operations Prolongation and Restriction: copies between different-sized arrays Conclusion: Use some form of BLOCK on each level May want to change for coarser grid levels Possible DISTRIBUTE Patterns for Multigrid (cont.) Implementation of DISTRIBUTE Allocate enough memory on each processor for its section of each distributed array Distributed memory: 1 malloc per processor or shrink bounds Shared memory: 1 shared area, with usage divided Adjust indexing Distributed memory: translate global indices ¤ local numbering Shared memory: permute elements, keep each processor's together Adjust loops (including implicit loops) Pick a reference in the loop (e.g. A(J+1)) Each processor executes iterations so that reference is local (e.g. lb-1:ub-1) Handle nonlocal data Distributed memory: allocate buffer space, SEND/RECV needed data Shared memory: access data directly, or allocate & make local copies B. align directive The ALIGN Directive Syntax: !HPF$ ALIGN array(source-list) WITH target(subscript-list ) !HPF$ ALIGN(source-list) WITH target(subscript-list) :: array-list Semantics: Creates a relationship between array and target so that for all values of the source-list variables, array(source-list) and target(subscript-list) are stored on the same processor Only target can be distributed explicitly Options for subscript-list Linear function of one source-list variable Triplet notation Must match ³:² in source-list Element-wise matching as in array assignment * (Replication) Examples of ALIGN REAL U(6,6), V(6,6), W(6,6) REAL X(6), Y(6), Z(3) !HPF$ALIGN V(I,J) WITH U(I,J) !HPF$ALIGN W(I,J) WITH U(J,I) !HPF$ALIGN X(K) WITH W(K,1) !HPF$ALIGN Y(K) WITH W(*,K) !HPF$ALIGN Z(K) WITH X(2*K-1) !HPF$ DISTRIBUTE U(BLOCK,*) More Examples of ALIGN REAL U(6,6), V(6,6), W(6,6) REAL X(6), Y(6), Z(3) !HPF$ALIGN V(I,J) WITH U(I,J) !HPF$ALIGN W(I,J) WITH U(J,I) !HPF$ALIGN X(K) WITH W(K,1) !HPF$ALIGN Y(K) WITH W(*,K) !HPF$ALIGN Z(K) WITH X(2*K-1) !HPF$ DISTRIBUTE U(*,BLOCK) Even More Examples of ALIGN REAL U(6,6), V(6,6), W(6,6) REAL X(6), Y(6), Z(3) !HPF$ALIGN V(I,J) WITH U(I,J) !HPF$ALIGN W(I,J) WITH U(J,I) !HPF$ALIGN X(K) WITH W(K,1) !HPF$ALIGN Y(K) WITH W(*,K) !HPF$ALIGN Z(K) WITH X(2*K-1) !HPF$ DISTRIBUTE U(*,CYCLIC) Son of Even More Examples of ALIGN REAL U(6,6), V(6,6), W(6,6) REAL X(6), Y(6), Z(3) !HPF$ALIGN V(I,J) WITH U(I,J) !HPF$ALIGN W(I,J) WITH U(J,I) !HPF$ALIGN X(K) WITH W(K,1) !HPF$ALIGN Y(K) WITH W(*,K) !HPF$ALIGN Z(K) WITH X(2*K-1) !HPF$DISTRIBUTE U(BLOCK,BLOCK) ALIGN and Communication Communication (data movement) happens when two data items on different processors must be brought together ALIGN relates array elements, ensuring they are mapped together !HPF$ ALIGN a(i) WITH b(i+1) No communication for a(10) = b(11) No information about a(10) = b(10) You can also think of this as modifying DISTRIBUTE Substitute the ALIGN subscripts before doing the communication analysis Why Use ALIGN? € ³Copy² distributions ­ DISTRIBUTE one array as the ³master² for data layout ­ ALIGN other arrays to it ­ Just modify one line to change all the distributions ­ This will be common when porting codes! € Off-by-one problems ­ Sometimes boundaries get in the way of DISTRIBUTE € Differing array sizes ­ Will the compiler handle this better? REAL a(16), b(8) !HPF$ ALIGN b(i) WITH a(2*i) !HPF$ DISTRIBUTE a(BLOCK) ­ Or this? REAL a(16), b(8) !HPF$ DISTRIBUTE a(BLOCK), b(BLOCK) Implementation of ALIGN Substitute ALIGN expression into array reference subscripts Treat like DISTRIBUTE Except that the expressions get trickier For non-stride-1 accesses and nontrivial alignments, much trickier C. dynamic mapping Dynamic Data Mapping One data mapping is not always appropriate for an entire program Different behavior in different phases ³First sweep in the X dimension, then sweep in the Y dimension² ALLOCATABLE arrays can change size Therefore, HPF 1.x provides executable DISTRIBUTE and ALIGN Called REALIGN and REDISTRIBUTE DYNAMIC attribute (compare to ALLOCATABLE attribute) The DYNAMIC Directive Syntax: !HPF$ DYNAMIC entity-decl-list !HPF$ DYNAMIC :: entity-decl-list Semantics: Any array in entity-decl-list can be used in a REALIGN or REDISTRIBUTE directive DYNAMIC appears only in the declarations section DYNAMIC arrays can also appear in ALIGN or DISTRIBUTE to get initial distributions) The REALIGN Directive Syntax: !HPF$ REALIGN array(source-list) WITH target(subscript-list) !HPF$ REALIGN (source-list) WITH target(subscript-list) :: array-list Semantics: Creates a dynamic relationship between array and target so that for all values of the source-list variables, array(source-list) and target(subscript-list) are stored on the same processor Array must have the DYNAMIC attribute Ends any previous ALIGN or DISTRIBUTE for array Communicates data already in array to its new home Lasts until another REALIGN or REDISTRIBUTE for the same array I.e., Not static scope The REDISTRIBUTE Directive Syntax: !HPF$ REDISTRIBUTE array(dist-format-list) [ONTO processors] !HPF$ REDISTRIBUTE (dist-format-list) [ONTO processors] :: array-list Semantics: Dynamically divides each dimension of array according to the corresponding pattern in dist-format-list Also changes the mappings of anything already aligned to array I.e., Array was previously the target in an ALIGN or REALIGN Array must have the DYNAMIC attribute Communicates data to its new processor home Lasts until another REALIGN or REDISTRIBUTE for the same array Typical Use of Dynamic Data Mapping Picking mappings based on input data !HPF$ DYNAMIC x !HPF$ ALIGN (:,:) WITH x(:,:) :: y, z IF (n1>n2) THEN !HPF$ REDISTRIBUTE x(BLOCK,*) ELSE !HPF$ REDISTRIBUTE x(*,BLOCK) END IF ! Note: Compiler doesn't know precise data ! mappings of any array at this point More Typical Uses of Dynamic Data Mapping FFT, ADI, and other directional sweeps (RE)DISTRIBUTE BLOCK in parallel dimension for first sweep Perform sweep REDISTRIBUTE BLOCK in the next dimension Perform sweep, Š Cyclic reduction and other recursive doubling methods (Only if number of processors P is a power of 2) (RE)DISTRIBUTE BLOCK Reduce by step 1, then 2, Š to P/2 REDISTRIBUTE CYCLIC Reduce by P, 2*P, Š D. Subroutine linkage Data Mapping in Subroutine Calls Some subroutines require data to use a specific mapping, so actual arguments must be remapped Some subroutines can use any mapping, so actual arguments should be passed in place Sometimes the programmer knows the incoming data mappings, sometimes not HPF has options to say all of this! Any remappings are undone on procedure return Mapping Options for Dummy Arguments Three classes of mappings for dummy arguments Prescriptive: ³Wherever the actual is, remap it here² Syntax: DISTRIBUTE A (BLOCK) Note: Remapping is undone on return Descriptive: ³The actual should be here² Syntax: DISTRIBUTE A *(BLOCK) Note: If actual is not there, a warning is suggested but not required Transcriptive: ³Leave the actual where it is² Syntax: DISTRIBUTE A * Or: INHERIT A Sorting some of these cases out requires doing special things in the caller When this is needed, an explicit interface is required Explicit interfaces are a good idea in other cases too Subroutine Interfaces for Dummies Always use explicit interfaces Require good errors and warnings from your compiler When Are Two Mappings the Same? There are several cases where HPF requires mappings to be ³compatible² Parameter passing with implicit interfaces Parameter passing with explicit descriptive mappings (sort of) Pointer association ³Compatible² is defined by a partial ordering A mapping is ³more specialized² (or ³more generalized²) than another if it has more information (or less information) Some mappings canıt be compared You can avoid learning the detailed rules! Always ALIGN things that ought to have the same mapping Always use explicit interfaces Specialized and Generalized Mappings There is a hierarchy of directives, based on how precisely they describe a mapping Explicit Interfaces for Mapping Dummy Arguments An explicit interface is required if Any dummy has a transcriptive mapping Reason: The caller has to send a descriptor with the arrayıs mapping Any dummy has the INHERIT attribute Reason: The caller has to send a descriptor with the arrayıs mapping For some argument, the actualıs mapping is not a specialization of the dummyıs Translation: If the argument has to be remapped, you need an explicit interface Reason: Somebody has to move the data around Rules of Thumb for Dummy Arguments Libraries Prescriptive mapping (³remap actual²) if a specific mapping is needed Transcriptive mapping (³take anything²) for embarrassingly parallel Descriptive mapping (³this is coming²) for internal routines, probably after checking distribution of (transcriptively mapped) user arguments Non-reusable modules Descriptive mapping probably fastest Prescriptive mapping probably safer Other Mapping Features PROCESSORS Define size and shape of processors array Only guaranteed when size of processors array equals NUMBER_OF_PROCESSORS() !HPF$ PROCESSORS name(shape-spec-list) TEMPLATE Defines an index domain for alignment and distribution, but no memory !HPF$ TEMPLATE name(shape-spec-list) ALLOCATABLE Arrays and POINTER Targets ALIGN and DISTRIBUTE take effect (and have expressions evaluated) when the array is allocated If used as the target of REALIGN, then the array must be allocated when REALIGN takes effect. Hints for Using Data Mapping ALIGN data based on physical domains Arrays with the same domain should be aligned Arrays not physically connected should not DISTRIBUTE based on parallelism Optimal performance comes from parallel operations on a distributed axis Pick distribution pattern based on communications BLOCK generally good for local stencils and fully-filled arrays CYCLIC and CYCLIC(K) generally good for load balancing and triangular loops Conflicts require compromises, remapping, complex compilers, or new algorithms REALIGN with care; REDISTRIBUTE with extreme care Computation performed must outweigh communication for the remapping