"High Performance Fortran in Practice" tutorial Presented at Supercomputing '95, San Diego, December 4, 1995 Presented at University of Tennessee (short form), Knoxville, March 21, 1996 Revised and presented at High Performance Computing and Networking Europe, April 18, 1996 Presented at Metacenter Regional Alliances, Cornell, May 6, 1996 Presented at Summer of HPF Workshop, Vienna, July 1, 1996 Revised, expanded, and presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1994 Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996 Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996 Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997 Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997 Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997 Parts presented at SC'97, November 17, 1997 Parts presented (slideshow mode) at SCı97, November 15-21, 1997 Presented at DOD HPC Users Group, June 1, 1998 HPF 2.0 preview Outline 1. Introduction to Data-Parallelism 2. Fortran 90/95 Features 3. HPF Parallel Features 4. HPF Data Mapping Features 5. Parallel Programming in HPF 6. HPF Version 2.0 HPF 2 Background New HPFF meetings were held to develop extensions to HPF Meetings began January 1995 Preliminary presentations at Supercomputing ı95 Complete draft at Supercomputing ı96 Finalized draft January 31, 1997 Areas for extensions Control parallelism (hpff-task@cs.rice.edu) Data distribution (hpff-distribute@cs.rice.edu) External interfaces (hpff-external@cs.rice.edu) Input is still welcome! Send mail to majordomo@cs.rice.edu to get on a list Send mail to hpff-interpret@cs.rice.edu to comment HPF 1.x Features HPF 2.0 Features (Language Laywer View) HPF 2.0 Features (Technical View) HPF 2.0 Deletions and Simplifications DYNAMIC, REALIGN, and REDISTRIBUTE Now ³just² approved extensions Removed due to complexity of implementation Mapping in the presence of sequence association Now forbidden in all cases Removed due to complexity, and lack of user demand Subroutine interfaces Explicit interface required if mapping changes, or INHERIT used Descriptive (³star²) syntax retained, primarily for error checking Changed to match spirit of Fortran 95, and simplify user model Methods of Avoiding DYNAMIC Distributions REDISTRIBUTE/REALIGN had two main uses: Changing access patterns dynamically Picking static distributions based on data size Changing access patterns Declare multiple arrays and copy data between them Picking a distribution based on input Clone sections of the program Sequence Association for Dummies Donıt Use It! Methods of Avoiding Sequence Association Just say "No!" EQUIVALENCE was never a good idea for new HPF codes Included for ease of porting F77 codes, with debatable benefits For new codes: Always declare arrays to be their natural rank Use ALLOCATABLE to make arrays their natural size Use MODULE for global arrays, or pass as explicit arguments For porting codes: Top-down conversion of subroutines If subroutine really needs EQUIVALENCE, it may be better as an EXTRINSIC Methods of Rewriting Subroutine Interfaces Always use an explicit interface Create a MODULE with INTERFACE blocks Create an INCLUDE file with INTERFACE blocks New Data Mapping Features € Extended Distribution Patterns !HPF$ SHADOW x( 1, 0 ) !HPF$ DISTRIBUTE y( GEN_BLOCK( (/ 12,10,10,12 /) ) ) !HPF$ DISTRIBUTE z( INDIRECT(map_array) ) € Distribution to processor subsets !HPF$ PROCESSORS procs(1:np) !HPF$ DISTRIBUTE b(BLOCK) ONTO procs(1:np/2-1) € Distribution of derived type components TYPE set_of_meshes REAL p(100,100), q(100,100), r(100,100) !HPF$ DISTRIBUTE (BLOCK,*) :: p, q, r END TYPE TYPE(set_of_meshes) multi_block(32) !!! Do not try to DISTRIBUTE array multiblock !!! € Rules for matching distributions (pointers, dummy parameters) Examples of Extended Distributions !HPF$ DISTRIBUTE x(BLOCK(SHADOW=1),*) !HPF$ DISTRIBUTE y(GEN_BLOCK((/4,2,2,4/)),*) !HPF$ DISTRIBUTE z(*,INDIRECT(map)) Example of Distribution to Subsets !HPF$ PROCESSORS p(6) !HPF$ DISTRIBUTE a(*,BLOCK) ONTO p !HPF$ DISTRIBUTE b(*,BLOCK) ONTO p(1:3) !HPF$ DISTRIBUTE c(*,BLOCK) ONTO p(4:6) Rules for Mapping Pointers In the core language of HPF 2 Pointers cannot be mapped Targets of pointers cannot be mapped (i.e. variables with the TARGET attribute) In the HPF 2 approved extensions Pointers can be mapped ALIGN and DISTRIBUTE take effect after ALLOCATE INHERIT can be used to declare a pointer to ³anything² Targets can be mapped In a pointer assignment, the targetıs mapping must be a specialization of the pointer's That is, the pointer has to be higher in the diagram than the thing it points at If the target is an array section, the pointer must have the INHERIT attribute Implementation of HPF 2 DISTRIBUTE Patterns Conceptually, the process remains the same Allocate memory, adjust indexing and loops, handle nonlocal data New patterns require more elaborate methods to achieve this SHADOW Add extra space to allocation Use that space for buffering of nonlocal data and adjusting indices Ignore that space for adjusting loop bounds Simplifies addressing, may avoid copying GEN_BLOCK Keep table of block bounds on each processor Search table to find home of nonlocal elements, adjust indices and loops Allows some load balancing with locality Implementation of HPF 2 DISTRIBUTE Patterns (cont.) INDIRECT Keep a copy of the map array distributed by BLOCK Keep a list of all elements on the local processor for adjusting loop bounds Inspector/executor strategy for locating nonlocal elements Inspector: Gather all information needed from map array Executor: Use this information to perform the computation Only do the inspector once, if possible (i.e. when distributions and array access patterns stay exactly the same) Allows arbitrary load balancing and communication reduction through partitioning But finding the right partition is NP-complete New Parallel Execution Features € Loop Reductions !HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x) DO i = 1, n CALL sub(i, xinc) x = x + xinc END DO € Computation Placement !HPF$ INDEPENDENT DO i = 1, n !HPF$ ON HOME( ix(i) ) x(i) = y(ix(i)) - y(iy(i)) END DO € Task Parallelism !HPF$ TASK_REGION !HPF$ ON HOME(p(1:8)) CALL foo(x,y) !HPF$ ON HOME(p(9:16)) CALL bar(z) !HPF$ END TASK_REGION Example of Loop Reductions !HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x) DO i = 1, n CALL sub(i, xinc) x = x + xinc END DO Example of Computation Placement !HPF$ INDEPENDENT DO i = 1, 12 !HPF$ ON HOME( ix(i) ) x(i) = y(ix(i))-y(iy(i)) END DO Example of Locality Assertion !HPF$ ALIGN (i) WITH x(i) :: ix, y, iy !HPF$ DISTRIBUTE x(BLOCK) !HPF$ INDEPENDENT DO i = 1, n !HPF$ ON HOME( ix(i) ), RESIDENT(y(iy(i))) x(i) = y(ix(i)) - y(iy(i)) END DO Example of Task Parallelism !HPF$ PROCESSORS p(8) !HPF$ DISTRIBUTE a1(block,*) ONTO p(1:4) !HPF$ DISTRIBUTE a2(*,block) ONTO p(5:8) !HPF$ TEMPLATE, DIMENSION(4), DISTRIBUTE(BLOCK) ONTO p(1:4) :: td1 !HPF$ ALIGN WITH td1(*) :: done1 !HPF$ TASK_REGION done1 = .false. DO WHILE (.true.) !HPF$ ON HOME(p(1:4)) BEGIN, RESIDENT READ (unit = iu,end=100) a1 CALL rowffts(a1) GOTO 101 100 done1 = .true. 101 CONTINUE !HPF$ END ON IF (done1) EXIT a2 = a1 !HPF$ ON HOME(p(5:8)) BEGIN, RESIDENT CALL colffts(a2) WRITE(unit = ou) a2 !HPF$ END ON ENDDO !HPF$ END TASK_REGION GRADE_UP versus SORT_UP Sometimes you want to keep things together (GRADE_UP) Sometimes you donıt (SORT_UP) Implementation of HPF 2 Parallel Features All new features represent useful information to the compiler They do not change the meaning of the program if properly used REDUCTION Choose an efficient order of evaluation for the reduction tree For scalar reductions, keep a local sum on each node and have a global combining phase at the end Alternate implementation: critical region For vector reductions, use same algorithm as XXX_SCATTER ON HOME Base the loop partitioning on the HOME expression Invert subscripting function to derive loop bounds Does not affect where communication/synchronization can be placed, but may change what must be communicated Warning: You can outsmart the compiler this way‹to your detriment Implementation of HPF 2 Parallel Features (cont.) RESIDENT Do not generate communication for the RESIDENT expressions (and look for logically equivalent expressions as well) If no expression is given, then no communication for any variable is needed Allows INDEPENDENT with CALL TASK_REGION Shared memory: Use barrier synchronization for tasks entering each ON block; also synchronize on shared data access outside ON blocks Distributed memory: Processor groups communicate only internally within an ON block (use ordinary communication), communicate with other tasks at ON boundaries (use group communication External Interfaces Calling HPF from other languages How to ensure HPF is called consistently? Calling other languages from HPF How to avoid unnecessary synchronization? Parallel Input/Output What do you mean by ³parallel I/O²? Asynchronous I/O Tools Can we define a symbol table format for debuggers, etc.? Example of Asynchronous I/O i0 = 0; i1 = 1 READ (file0, ID=id0, END=100) a(i0,1:1048576) DO WAIT (ID=id0, END=100) ! start next read into other row itmp = i0; i0 = i1; i1 = itmp READ (file0, ID=id0, END=100) a(i0,1:1048576) ! overlap I/O with useful work CALL PROCESSING( a(i1,1:1048576) ) END DO 100 CONTINUE Conclusions For More Information These are listed in order of usefulness www.crpc.rice.edu is physically at Rice University, Houston, TX www.vcpc.univie.ac.at is physically at the Vienna Center for Parallel Computing, Vienna, Austria World Wide Web http://www.crpc.rice.edu/HPFF/home.html http://www.vcpc.univie.ac.at/HPFF/home.html These slides http://www.cs.rice.edu/~chk/hpf-tutorial.html http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/ Koelbel/ Mailing Lists: Write to majordomo@cs.rice.edu In message body: subscribe Lists: hpff, hpff-interpret, hpff-core Anonymous FTP: Connect to titan.cs.rice.edu Full draft in public/HPFF/draft See public/HPFF/README for latest file list