"High Performance Fortran in Practice" tutorial Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995 Presented at Supercomputer '95, Mannheim, Germany June 27, 1995 Presented at Supercomputing '95, San Diego, December 4, 1995 Presented at University of Tennessee (short form), Knoxville, March 21, 1996 Presented at Metacenter Regional Alliances, Cornell, May 6, 1996 Presented at Summer of HPF Workshop, Vienna, July 1, 1996 Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996 Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996 Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996 Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997 Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997 Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997 Parts presented at SC'97, November 17, 1997 Parts presented (slideshow mode) at SC¹97, November 15-21, 1997 Presented at DOD HPC Users Group, June 1, 1998 High Performance Fortran in Practice Charles Koelbel Click here for body text Support Supported in part by National Science Foundation DoD HPCMP CEWES MSRC, Prime Contract DAHC94-96-C-0002 DoD HPCMP ASC MSRC, Prime Contract DAHC94-96-C-0003 As alwaysŠ Views, opinions, and/or findings contained in this report are those of the author and should not be contrued as an offical Department of Defense position, policy, or decision unless so designated by other official documentation. Thanks toŠ The High Performance Fortran Forum The D System Group (Rice University) The Fortran 90D Group (Syracuse University) Ken Kennedy (Rice) David Loveman (DEC) Piyush Mehrotra (ICASE) Rob Schreiber (HP Labs) Guy Steele (Sun) Mary Zosel (Livermore) High Performance Fortran Background This is in no way a comprehensive history of HPF There are just too many influences to list (like 700 people on the mailing list) Some individuals who were particularly influential to HPF (and to me): Alok Choudhary (Syracuse) Geoffrey Fox (Syracuse) Ken Kennedy (Rice) Bob Knighten (Intel) David Loveman (DEC) Piyush Mehrotra (ICASE) Andy Meltzer (Cray Research) Rob Schreiber (RIACS) Guy Steele (Thinking Machines) Hans Zima (University of Vienna) Mary Zosel (Livermore) Defined by the High Performance Fortran Forum (HPFF) as a portable language for data-parallel computation History: Proposed at Supercomputing '91 and HPFF Kickoff Meeting Meetings every 6 weeks during 1992 in Dallas Final draft of HPF, version 1.0 in June 1993 New meetings in 1994 and 1995 to make corrections, define further requirements Influences: Industry: CM Fortran, C*, MasPar Fortran, DEC Academia: ADAPT, Fortran D, Kali, Vienna Fortran, CRPC Government: Applications experience, $$$ HPF Features Fortran 90 is the most recent Fortran standard Fortran 95 is in the works Data-parallel operations are the major new executable construct Some were adopted in Fortran 95 The heart of HPF is the data mapping features describing how data is partitioned among parallel processors Assume that computation will be partitioned along with the data The 1995 and 1996 meetings have (and will) address the omissions in HPF Most were intentional, due to the difficulty of reaching consensus All of Fortran 90 FORALL and INDEPENDENT Data Alignment and Distribution Miscellaneous Support Operations But: No parallel I/O No explicit message-passing Little support for irregular computations Little support for non-data parallelism Performance of HPF Performance is highly dependent on the compiler and on the nature of the code Best performance is for data-parallel computations Tie ³natural² parallelism in the problem to distributed array dimensions Commercial compilers are now competitive with MPI for regular problems Subscripts like A(L+1,K-1) Research continues on irregular problems and task parallelism Subscripts like A(INDX(I1,2)) Tree searches with branch-and-bound cutoffs Recent Performance Results ‹ Princeton Ocean Model A full application for ocean modeling Same source code on Cray T3E, SGI Origin 2000, and Compaq Pro 8000 87¥57¥19 grid with (*,BLOCK,BLOCK) distribution Portland Group PGHPF compiler, versions 2.2 and 2.3 Contact: Steve Piacsek (piacsek@nrlssc.navy.mil) Recent Performance Results ‹ NAS Parallel Benchmarks Well-known compact application benchmarks from NASA Modeling CFD applications Class A (small) and B (large) input data BT, EP, SP benchmarks written purely in HPF FT benchmark calls EXTRINSIC FFT For More Information These are listed in order of usefulness www.crpc.rice.edu is physically at Rice University, Houston, TX www.vcpc.univie.ac.at is physically at the Vienna Center for Parallel Computing, Vienna, Austria World Wide Web http://www.crpc.rice.edu/HPFF/home.html http://www.vcpc.univie.ac.at/HPFF/home.html These slides http://www.cs.rice.edu/~chk/hpf-tutorial.html http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/ Koelbel/ Mailing Lists: Write to: majordomo@cs.rice.edu In message body: subscribe Lists: hpff, hpff-interpret, hpff-core Anonymous FTP: Connect to titan.cs.rice.edu Full draft in public/HPFF/draft See public/HPFF/README for latest file list I. Intro. to Data-Parallelism Outline 1. Introduction to Data-Parallelism 2. Fortran 90/95 Features 3. HPF Parallel Features 4. HPF Data Mapping Features 5. Parallel Programming in HPF 6. HPF Version 2.0 Parallel Machines Two theories of parallel execution: ³Two heads are better than one.² ³Too many cooks spoil the soup.² Both theories can happen in practice. Keeping all the processors busy and accessing local memory means a good collaboration. Synchronization and accessing nonlocal memory mean the chefs are fighting over the pots. Generalities about modern parallel processors: Individual processors are reasonably fast Local memory access costs one to ten cycles Nonlocal access costs tens to hundreds of cycles Synchronization costs hundreds to thousands of cycles Parallel computers allow several CPUs to contribute to a computation simultaneously. For our purposes, a parallel computer has three types of parts: Processors n Memory modules n Communication / synchronization network n Key points: All processors must be busy for peak speed. Local memory is directly connected to each processor. Accessing local memory is much faster than other memory. Synchronization is expensive, but necessary for correctness. Distributed Memory Machines Conceptually, the Paragon is very similar to the CM-5. Bandwidth is higher Latency is longer The network topology is a two-dimensional torus To program these machines: Divide the problem to minimize number of messages while retaining parallelism Convert all references to global structures into references to local pieces Optimization: Pack messages together to reduce fixed overhead (almost always needed) Optimization: Carefully schedule messages (usually done by library) Every processor has a memory others can¹t access. Advantages: Can be scalable Can hide latency of communication Disadvantages: Hard to program Program and O/S (and sometimes data) must be replicated Shared-Memory Machines Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit. These machines share data by direct access. Potentially conflicting accesses must be protected by synchronization. Simultaneous access to the same memory bank will cause contention, degrading performance. Some access patterns will collide in the network (or bus), causing contention. Many machines have caches at the processors. All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently. All processors access the same memory. Advantages: Easy to program (correctly) Can share code and data among processors Disadvantages: Hard to program (optimally) Not scalable due to bandwidth limitations Distributed Shared Memory Machines Combining the advantages of shared and disttributed memory Lots of hierarchical designs are appearing. Typically, ³nodes² with 4 to 32 processors (see below) Each processor has a local cache Processors within a node access shared memory Nodes can get data from or put data to other nodes¹ memories Parallel Algorithms This is just one view of parallel programs (and algorithms). Note: The tasks mentioned here may not be the tasks you're used to. Here, ³task² means ³a sequence of operations that is scheduled atomically and can run to completion once started² That is, a task is the computation between synchronization points For example, an MPI process is a collection of tasks A parallel algorithm is a collection of tasks and a partial ordering between them. Design goals: Match tasks to the available processors (exploit parallelism). Minimize ordering (avoid unnecessary synchronizations). Utilize remaining ordering (exploit locality). Sources of parallelism: Data parallelism: updating array elements simultaneously. Functional parallelism: conceptually different tasks which combine to solve the problem. Speculative parallelism: temporarily ignore partial ordering, repair later if this causes problems. This list is not exhaustiveŠ Data Parallelism in Algorithms Data-parallel algorithms exploit the parallelism inherent in many large data structures. Many elements in structure ¤ many update operations Example: Red-Black Relaxation All red points can be updated in parallel Analysis: Scalable parallelism Synchronization and locality may be suspect Functional Parallelism in Algorithms Functional parallelism exploits the parallelism between the parts of many systems. Many pieces to work on ¤ many independent operations Example: Aeroelasticity (wing design) CFD and CSM (and others, if need be) can be evaluated in parallel Analysis: Parallelism limited Synchronization probably good Parallel Languages A parallel language provides an executable notation for implementing a parallel algorithm. Design criteria: How are parallel operations defined? static tasks vs. dynamic tasks vs. implicit operations How is data shared between tasks? explicit communication/synchronization vs. shared memory How is the language implemented? low-overhead runtime systems vs. optimizing compilers Usually a language reflects a particular type of parallelism. Data-Parallel Languages Data parallelism originated on SIMD machines But has since been generalized in concept And has been implemented on many architectures The definition is still somewhat fuzzy HPF is largely based on data parallelism Other data-parallel languages of note: C* DataParallel C (taken from C*) NESL Fortran D, Fortran 90D, Vienna Fortran Data-parallel languages provide an abstract, machine-independent model of parallelism. Fine-grain parallel operations, such as element-wise operations on arrays Shared data in large, global arrays with mapping ³hints² Implicit synchronization between operations Partially explicit communication from operation definitions Advantages: Global operations conceptually simple Easy to program (particularly for certain scientific applications) Disadvantages: Unproven compilers Data-Parallel Languages, cont. Data parallelism is (an attempt at) a high-level language. High-level languages have inherent advantages and disadvantages: It's easier to concentrate on the high-level ideas. This means you can write the program faster This means you're more likely to get it right The underlying system takes over the low-level details. This means you can't micro-manage the execution This means you can't always tell what will happen Abstractions like data parallelism split the work between the programmer and the compiler. Programmer¹s task: Solve the problem in this model. Concentrate on high-level structure and concepts Aggregate operations on large data structures Data in global arrays with mapping information Compiler¹s task: Map conceptual (massive) parallelism to physical (finite) machine. Fill in the grungy details Guided by user ³hints² like mapping information Optimizing computation and communication Message-Passing Systems For comparison to data-parallel languages Program is based on relatively coarse-grain tasks Separate address space and a processor number for each task Data shared by explicit messages Point-to-point and collective communications patterns Examples: MPI, PVM, Occam Advantages: Close to hardware. Disadvantages: Many low-level details. How HPF Is Implemented Analysis Traditional dataflow and dependence analysis Data mapping analysis Computation Partitioning Use data mapping info to create locality Transform code to enhance locality Communication Introduction Communicate (move data) if data mapping and computation partitioning don¹t agree Lots of optimization to minimize/package communication Code Generation The really ugly details Lots of optimization possible here, too What Compilation Means for Programmers Help analysis with assertions ALIGN and DISTRIBUTE INDEPENDENT Distribute array dimensions that exhibit parallelism Conflicts require complex compilers, REDISTRIBUTE, or new algorithms Consider communications patterns BLOCK generally good for local stencils and fully-filled arrays CYCLIC and CYCLIC(K) generally good for load balancing and triangular loops Don't hide what you are doing