"High Performance Fortran in Practice" tutorial Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995 Presented at Supercomputer '95, Mannheim, Germany, June 27, 1995 Presented at Supercomputing '95, San Diego, December 4, 1995 Presented at University of Tennessee (short form), Knoxville, March 21, 1996 Presented at Metacenter Regional Affiliates, Cornell, May 6, 1996 Presented at Summer of HPF Workshop, Vienna, July 1, 1996 Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996 Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996 Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996 Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997 Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997 Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997 Parts presented at SC'97, November 17, 1997 Parts presented (slideshow mode) at SCą97, November 15-21, 1997 Presented at DOD HPC Users Group, June 1, 1998 III. HPF Parallel Features Outline 1. Introduction to Data-Parallelism 2. Fortran 90/95 Features 3. HPF Parallel Features 4. HPF Data Mapping Features 5. Parallel Programming in HPF 6. HPF Version 2.0 Data-Parallel Statements Data parallelism emphasizes having many fine-grain operations, such as computations on every element of an array HPF has several ways to exploit data parallelism: Array expressions: Taken from Fortran 90 FORALL: Tightly-coupled parallel execution based on the structure of an index space PURE: Procedures without side effects that may be called in FORALL INDEPENDENT: Assertion that iterations do not interfere with each other HPF library and intrinsics: Extended from Fortran 90 B. FORALL statements, etc. The Single-Statement FORALL Syntax: FORALL ( index-spec-list [,mask-expr] ) forall-assignment index-spec is int-variable = triplet-spec forall-assignment is ordinary assignment or pointer assignment Semantics: Equivalent to array assignment in Fortran 90 For each value of indices, check the mask Compute right-hand sides for unmasked values Make assignments to left-hand sides for unmasked values Multiple assignments to the same location are not standard-conforming (i.e. are undefined) Note: FORALL is not a general-purpose parallel loop! The Multi-Statement FORALL Syntax: FORALL ( index-spec-list [, mask] ) forall-body-list END FORALL forall-body can be a forall-assignment, FORALL, or WHERE Semantics: Multi-statement FORALL is shorthand for a series of single-statement FORALLs Multi-statement FORALLs can be nested to produce more complex iteration spaces Each bottom-level assignment statement is completed before the next one starts Note: FORALL is not a general-purpose parallel loop! An Example of FORALL Initially, a = [0, 1, 2, 3, 4] b = [0, 10, 20, 30, 40] c = [-1, -1, -1, -1, -1] FORALL ( i = 2:4 ) a(i) = a(i-1) + a(i+1) c(i) = b(i) * a(i+1) END FORALL Afterwards, a = [0, 2, 4, 6, 4] b = [0, 10, 20, 30, 40] c = [-1, 40, 120, 120, -1] An Example of DO Initially, a = [0, 1, 2, 3, 4] b = [10, 20, 30, 40, 50] c = [-1, -1, -1, -1, -1] DO i = 2, 4 a(i) = a(i-1) + a(i+1) c(i) = b(i) * a(i+1) END DO Afterwards, a = [0, 2, 5, 9, 4] b = [10, 20, 30, 40, 50] c = [-1, 20, 60, 120, -1] An Example of Nested FORALLs FORALL ( i = 1:3 ) a(i) = b(i) FORALL ( j = 1:i ) c(i,j) = d(i,j) END FORALL END FORALL Why Use FORALL? Assignments to array sections FORALL ( i = 1:4, j = 2:4 ) a(i,j) = a(i,j-1) FORALL ( i = 1:4 ) a(i,i) = a(i,i) * scale FORALL ( i = 1:4 ) FORALL ( j=i:4 ) a(i,j) = a(i,j) / a(i,i) END FORALL FORALL ( i=1:4 ) FORALL (j=ilo(i):ihi(i)) x(j) = x(j)*y(i) END FORALL Calculating based on a subscript FORALL (i=0:n,j=0:n) a(i,j) = SQRT(1.0*(i*i+j*j))/n Determinate Behavior of FORALL Consider the statement: FORALL ( i = 1:n ) a(ix(i)) = a(i) If ix has no repeated values (e.g. ix is a permutation), this is well-defined Note that a(i) is always the łold˛ value, not the new one computed elsewhere in the FORALL If ix has repeated values (e.g. ix(i)=i/2), this is not defined by HPF The compiler may take any action it feels appropriateŠ Assigning one of the possible values is appropriate Reporting an error is appropriate Assigning a random number is appropriate Implementation of FORALL Use owner-computes rule to partition computation See implementation of DISTRIBUTE Semantics of the construct allow parallelism All rows of dependence diagram can execute in parallel Dependence analysis can further limit synchronization If RHS and LHS do not access the same element, no synchronization is needed Data need only be moved at synchronization points Before each RHS and before each LHS These may be combined with previous/following statement Implementation of FORALL (cont.) Data movement becomes much more difficult on distributed memory machines if the FORALL calls a function There is no race condition But the data may not be on the processor where it is needed And there is no synchronization point where it can be exchanged For this reason, many implementations serialize some cases of parallel constructs Usually there is a compiler override switch to parallelize at least some cases Best advice: Donąt use global data in functions D. PURE functions PURE Functions Syntax: PURE [type-spec] FUNCTION function-name( [ dummy-arg-name-list ] ) PURE functions have no side effects Syntactic constraints: Global variables and dummy arguments cannot be used in any context that may cause the variable to become defined Left hand side of assignment DO index, ASSIGN, ALLOCATE Actual argument with INTENT(OUT) Targets of pointer assignments (due to later use of pointers) Full list of restrictions is too long to fit on this slide! No external I/O or file operations Only inherited distribution/alignment of dummies and locals Intrinsic functions are PURE PURE Functions in Pictures COMMON X REAL Y, Z Z = FCN( Y ) PURE Functions and FORALL PURE functions are the only ones that can be invoked from a FORALL Safe to do this because they have no side effects Useful to do this because there are things you cannot do (directly) in a FORALL Conditionals and iteration E.g., Do point-wise iteration this way Local variables E.g., Handle temporaries this way Why Use PURE Functions? Elemental functions Intrinsics Equations of state, etc. Note: A row can be an element! Pointwise iteration Mandelbrot sets Pointwise Newton iterations PURE for Mandelbrot Sets ! The caller (Explicit interface not shown) FORALL ( i=1:n, j=1:m ) k(i,j) = mandelbrot( CMPLX((i-1)*1.0/(n-1), & (j-1)*1.0/(m-1)), 1000 ) END FORALL ! The callee PURE INTEGER FUNCTION mandelbrot(x, itol) COMPLEX, INTENT(IN) :: x INTEGER, INTENT(IN) :: itol COMPLEX xtmp INTEGER k k = 0 xtmp = -x DO WHILE (ABS(xtmp)<2.0 .AND. k err_tol) !HPF$ INDEPENDENT DO i = 2, n-1 b(i) = a(i) a(i) = 0.5 * (a(i-1) + a(i+1)) b(i) = ABS(b(i) - a(i)) END DO err = MAXVAL(b(2:n-1)) END DO Example of Data-Dependent INDEPENDENT Assertion Sometimes true !HPF$ INDEPENDENT, NEW(j, n1) DO i = 1, nblue n1 = iblue(i) DO j = ibegin(n1), ibegin(n1+1)-1 x(n1) = x(n1) + y(j)*x(ired(j)) END DO END DO Why Use INDEPENDENT? Yet another way to do (some) array assignments !HPF$ INDEPENDENT DO i = 1, n a(i) = b(i) END DO Express application-dependent information ! łcolors˛ don't interfere with each other !HPF$ INDEPENDENT, NEW(ix, i1,i2,f12) DO i = 1, ncolor DO ix = color_beg(i), color_end(i) i1 = icolor(ix,1) i2 = icolor(ix,2) f12 = w(i1)-w(i2) x(i1) = x(i1) + f12 x(i2) = x(i2) - f12 END DO END DO Implementation of INDEPENDENT Virtually identical to FORALL Use owner-computes rule to partition computation See implementation of DISTRIBUTE Semantics of the construct allow parallelism All iterations can execute in parallel Data need only be moved at synchronization points Before and after the loop No further dependence analysis needed Warnings about functions still apply Possibly worse, due to more pathological cases In Summary: DO, FORALL and INDEPENDENT Hints for Using Data Parallel Statements Use FORALL to extend array operations More shapes User-defined elemental functions Better than array syntax? Beware of hidden overheads Intra-statement overheads Inter-statement synchronizations Complex pure functions Assert INDEPENDENT only when it is true This has implications for debuggers and programming environments! C. Library, intrinsic, and EXTRINSIC functions The HPF Library and New Intrinsics Extended intrinsics: MAXLOC, MINLOC One elemental intrinsic: ILEN System inquiry intrinsics: NUMBER_OF_PROCESSORS New reduction functions: IAND Combining-scatter functions: SUM_SCATTER Prefix reduction functions: SUM_PREFIX Sorting functions: GRADE_UP Bit manipulation functions: POPCNT Data distribution inquiry subroutines: HPF_DISTRIBUTION Examples of HPF Library Many implement data-parallel operations that are not elemental Why Use the HPF Library? If you need the functionsŠ Round out the set of reductions All built-in associative, commutative functions Partial reductions (prefix reductions) for all reductions Sorting Functions were chosen for the library because They are useful They are data-parallel They are hard to implement efficiently by hand Typical Uses of HPF Library Accumulations through indirection arrays x = SUM_SCATTER( flux, x, nbr(1,1:n) ) x = SUM_SCATTER( -flux, x, nbr(2,1:n) ) ! Equivalent to the following DO i = 1, n x(nbr(1,i)) = x(nbr(1,i)) + flux(i) x(nbr(2,i)) = x(nbr(2,i)) - flux(i) END DO Manipulating array-based sparse structures inum(1:n) = MAX( iend(1:n)-ibeg(1:n)+1, 0 ) ibeg_new(1:n) = SUM_PREFIX(inum(1:n)) + 1 iend_new(1:n) = ibeg_new(1:n)+inum(1:n)-1 ! Moving the data left as exercise for reader Implementation of the HPF 1 Library Functions were added to the library because They are useful They can be implemented in parallel on all (known) machines The optimal implementation is complex or machine-dependent Therefore, HPF puts the onus on the vendors to provide the best possible implementation Can be implemented by subroutine library or in-line expansion Should use the best known algorithms for reduction, prefix operations, etc. EXTRINSIC Procedures Syntax: EXTRINSIC ( extrinsic-kind-keyword ) Goes in FUNCTION or SUBROUTINE header (like PURE) An escape mechanism for allowing HPF to call other languages and paradigms On the caller (HPF) side: There must be an explicit interface with the EXTRINSIC directive Remapping occurs (if needed) to meet distribution specifications System synchronizes all processors before the call System calls łlocal˛ routine on every processor On the callee (non-HPF) side: INTENT(IN) and INTENT(OUT) must be obeyed If variables are replicated, callee must make them consistent before return Processors can access their own section of distributed arrays EXTRINSIC(HPF_LOCAL) € HPF_LOCAL is a SPMD language that meshes well with łglobal˛ HPF ­ It can do things HPF canąt ­ Global HPF can do things it canąt € On the caller (global HPF) side EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x) REAL x(:) !HPF$ DISTRIBUTE x(BLOCK) € On the callee (local HPF) side EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x) REAL x(:) € Inside foo, SIZE(x) checks the local section of x € HPF_LOCAL_LIBRARY provides GLOBAL_SIZE(x) to check the global size of x Example of HPF_LOCAL EXTRINSIC(F77_LOCAL) € HPF_LOCAL requires a Fortran 90 interface to the local subroutine with assumed-shape arrays - Many libraries do not use assumed-shape arrays - Thus was born the need for F77_LOCAL € On the caller (HPF) side: EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x) REAL x(:) !HPF$ DISTRIBUTE x(BLOCK) € On the callee side: - Implicit interface (FORTRAN 77 doesn't have explicit interfaces) Example of F77_LOCAL (From NAS FT Benchmark) ! ! 3D FFT subroutine used by the PESSL implementation. ! subroutine fft (n1, n2, n3, isign, scale, x, y) use blacs use types implicit none ! ! Arguments ! integer, intent(in) :: n1, n2, n3, isign real(R8), intent(in) :: scale ! complex(R8), dimension(:,:,:), intent(in) :: x !hpf$ template gridx(n1,n2,n3) !hpf$ distribute(*,*,block) :: gridx !hpf$ align(:,:,:) with *gridx :: x ! complex(R8), dimension(:,:,:), intent(out) :: y !hpf$ template gridy(n3,n2,n1) !hpf$ distribute(*,*,block) :: gridy !hpf$ align(:,:,:) with *gridy :: y Example of F77_LOCAL (From NAS FT Benchmark) II ! ! Interfaces ! interface extrinsic (f77_local) subroutine & & pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip) use types implicit none integer, intent(in) :: n1, n2, n3, isign, ictxt real(R8), intent(in) :: scale integer, dimension(:), intent(in) :: ip complex(R8), dimension(:,:,:), intent(in) :: x !hpf$ template tempx(n1,n2,n3) !hpf$ distribute(*,*,block) :: tempx !hpf$ align(:,:,:) with *tempx :: x complex(R8), dimension(:,:,:), intent(out) :: y !hpf$ template tempy(n3,n2,n1) !hpf$ distribute(*,*,block) :: tempy !hpf$ align(:,:,:) with *tempy :: y end subroutine pdcft3 end interface ! ip = 0 call pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip) ! Example of F77_LOCAL (From NAS FT Benchmark) III ! call blacs_get (0, 0, val) ictxt = val(1) ! ! Calculate the USERMAP for the BLACS grid, and set up the grid. The ! processor ID assignment is done by the same algorithm used by PGHPF, ! and may be different for other HPF implementions. ! nprow = size (usermap, 1) npcol = size (usermap, 2) do j = 1, npcol do i = 1, nprow usermap(i,j) = (j - 1) * nprow + (i - 1) end do end do call blacs_gridmap (ictxt, usermap, nprow, nprow, npcol) ! ip = 0 call pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip) ! call blacs_gridexit (ictxt) ! Why Use EXTRINSIC Procedures? Access to other programming languages and paradigms HPF_SERIAL: Sequential execution For example, interfacing with graphics libraries HPF_LOCAL: Single Program Multiple Data (SPMD) paradigm HPF_CRAFT: Cray native programming language (HPF 2.0) Expressiveness Some things are just hard to write in HPF Some things are impossible to do in HPF Nondeterminism Efficiency Because of interface requirements, this works best for superlinear operations (e.g. ScaLAPACK)