"High Performance Fortran in Practice" tutorial
	Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995
	Presented at Supercomputer '95, Mannheim, Germany June 27, 1995
	Presented at Supercomputing '95, San Diego, December 4, 1995
	Presented at University of Tennessee (short form), Knoxville, March 21, 1996
	Presented at Metacenter Regional Alliances, Cornell, May 6, 1996
	Presented at Summer of HPF Workshop, Vienna, July 1, 1996
	Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996
	Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996
	Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996
	Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997
	Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997
	Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997
	Parts presented at SC'97, November 17, 1997
	Parts presented (slideshow mode) at SC¹97, November 15-21, 1997
	Presented at DOD HPC Users Group, June 1, 1998
	High Performance Fortran
	in Practice
		Charles Koelbel
		Click here for body text
	Support
		Supported in part by
			National Science Foundation
			DoD HPCMP CEWES MSRC, Prime Contract DAHC94-96-C-0002
			DoD HPCMP ASC MSRC, Prime Contract DAHC94-96-C-0003
		As alwaysŠ
			Views, opinions, and/or findings contained in this report are those of the author and should not be contrued as an offical Department of Defense position, policy, or decision unless so designated by other official documentation.
	Thanks toŠ
		The High Performance Fortran Forum
		The D System Group (Rice University)
		The Fortran 90D Group (Syracuse University)
		Ken Kennedy (Rice)
		David Loveman (DEC)
		Piyush Mehrotra (ICASE)
		Rob Schreiber (HP Labs)
		Guy Steele (Sun)
		Mary Zosel (Livermore)
	High Performance Fortran Background
		
			This is in no way a comprehensive history of HPF
				There are just too many influences to list (like 700 people on the mailing list)
			Some individuals who were particularly influential to HPF (and to me):
				Alok Choudhary (Syracuse)
				Geoffrey Fox (Syracuse)
				Ken Kennedy (Rice)
				Bob Knighten (Intel)
				David Loveman (DEC)
				Piyush Mehrotra (ICASE)
				Andy Meltzer (Cray Research)
				Rob Schreiber (RIACS)
				Guy Steele (Thinking Machines)
				Hans Zima (University of Vienna)
				Mary Zosel (Livermore)
		Defined by the High Performance Fortran Forum (HPFF) as a portable language for data-parallel computation
		History:
			Proposed at Supercomputing '91 and HPFF Kickoff Meeting
			Meetings every 6 weeks during 1992 in Dallas
			Final draft of HPF, version 1.0 in June 1993
			New meetings in 1994 and 1995 to make corrections, define further requirements
		Influences:
			Industry: CM Fortran, C*, MasPar Fortran, DEC
			Academia: ADAPT, Fortran D, Kali, Vienna Fortran, CRPC
			Government: Applications experience, $$$
	HPF Features
		
			Fortran 90 is the most recent Fortran standard
				Fortran 95 is in the works
			Data-parallel operations are the major new executable construct
				Some were adopted in Fortran 95
			The heart of HPF is the data mapping features describing how data is partitioned among parallel processors
				Assume that computation will be partitioned along with the data
			The 1995 and 1996 meetings have (and will) address the omissions in HPF
				Most were intentional, due to the difficulty of reaching consensus
		All of Fortran 90
		FORALL and INDEPENDENT
		Data Alignment and Distribution
		Miscellaneous Support Operations
		But:
			No parallel I/O
			No explicit message-passing
			Little support for irregular computations
			Little support for non-data parallelism
	Performance of HPF
		Performance is highly dependent on the compiler and on the nature of the code
			Best performance is for data-parallel computations
			Tie ³natural² parallelism in the problem to distributed array dimensions
		Commercial compilers are now competitive with MPI for regular problems
			Subscripts like A(L+1,K-1)
		Research continues on irregular problems and task parallelism
			Subscripts like A(INDX(I1,2))
			Tree searches with branch-and-bound cutoffs
	Recent Performance Results ‹ Princeton Ocean Model
		A full application for ocean modeling
			Same source code on Cray T3E, SGI Origin 2000, and Compaq Pro 8000
			87¥57¥19 grid with (*,BLOCK,BLOCK) distribution
			Portland Group PGHPF compiler, versions 2.2 and 2.3
			Contact: Steve Piacsek (piacsek@nrlssc.navy.mil)
	Recent Performance Results ‹ NAS Parallel Benchmarks
		Well-known compact application benchmarks from NASA
			Modeling CFD applications
			Class A (small) and B (large) input data
			BT, EP, SP benchmarks written purely in HPF
			FT benchmark calls EXTRINSIC FFT
	For More Information
		
			These are listed in order of usefulness
			www.crpc.rice.edu is physically at Rice University, Houston, TX
			www.vcpc.univie.ac.at is physically at the Vienna Center for Parallel Computing, Vienna, Austria
		World Wide Web
			http://www.crpc.rice.edu/HPFF/home.html
			http://www.vcpc.univie.ac.at/HPFF/home.html
		These slides
			http://www.cs.rice.edu/~chk/hpf-tutorial.html
			http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/
			Koelbel/
		Mailing Lists:
			Write to: majordomo@cs.rice.edu
			In message body: subscribe <list-name> <your-id>
			Lists: hpff, hpff-interpret, hpff-core
		Anonymous FTP:
			Connect to titan.cs.rice.edu
			Full draft in public/HPFF/draft
			See public/HPFF/README for latest file list
	I. Intro. to Data-Parallelism
	Outline
		1. Introduction to Data-Parallelism
		2. Fortran 90/95 Features
		3. HPF Parallel Features
		4. HPF Data Mapping Features
		5. Parallel Programming in HPF
		6. HPF Version 2.0
	Parallel Machines
		
			Two theories of parallel execution: 
				³Two heads are better than one.²
				³Too many cooks spoil the soup.²
			Both theories can happen in practice.
				Keeping all the processors busy and accessing local memory means a good collaboration.
				Synchronization and accessing nonlocal memory mean the chefs are fighting over the pots.
			Generalities about modern parallel processors:
				Individual processors are reasonably fast
				Local memory access costs one to ten cycles
				Nonlocal access costs tens to hundreds of cycles
				Synchronization costs hundreds to thousands of cycles
		Parallel computers allow several CPUs to contribute to a computation simultaneously.
		For our purposes, a parallel computer has three types of parts:
			Processors n
			Memory modules n
			Communication / synchronization network n
		Key points:
			All processors must be busy for peak speed.
			Local memory is directly connected to each processor.
			Accessing local memory is much faster than other memory.
			Synchronization is expensive, but necessary for correctness.
	Distributed Memory Machines
		
			Conceptually, the Paragon is very similar to the CM-5.
				Bandwidth is higher
				Latency is longer
				The network topology is a two-dimensional torus
			To program these machines:
				Divide the problem to minimize number of messages while retaining parallelism
				Convert all references to global structures into references to local pieces
				Optimization: Pack messages together to reduce fixed overhead (almost always needed)
				Optimization: Carefully schedule messages (usually done by library)
		Every processor has a memory others can¹t access.
		Advantages:
			Can be scalable
			Can hide latency of communication
		Disadvantages:
			Hard to program
			Program and O/S (and sometimes data) must be replicated
	Shared-Memory Machines
		
			Interconnection network shown here is actually for the BBN Butterfly, but C-90 is in the same spirit.
			These machines share data by direct access.
				Potentially conflicting accesses must be protected by synchronization.
				Simultaneous access to the same memory bank will cause contention, degrading performance.
				Some access patterns will collide in the network (or bus), causing contention.
				Many machines have caches at the processors.
				All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently.
		All processors access the same memory.
		Advantages:
			Easy to program (correctly)
			Can share code and data among processors
		Disadvantages:
			Hard to program (optimally)
			Not scalable due to bandwidth limitations
	Distributed Shared Memory Machines
		
		Combining the advantages of shared and disttributed memory
		Lots of hierarchical designs are appearing.
			Typically, ³nodes² with 4 to 32 processors (see below)
			Each processor has a local cache
			Processors within a node access shared memory
			Nodes can get data from or put data to other nodes¹ memories
	Parallel Algorithms
		
			This is just one view of parallel programs (and algorithms).
			Note: The tasks mentioned here may not be the tasks you're used to.
				Here, ³task² means ³a sequence of operations that is scheduled atomically and can run to completion once started²
				That is, a task is the computation between synchronization points
				For example, an MPI process is a collection of tasks
		A parallel algorithm is a collection of tasks and a partial ordering between them.
		Design goals:
			Match tasks to the available processors (exploit parallelism).
			Minimize ordering (avoid unnecessary synchronizations).
			Utilize remaining ordering (exploit locality).
		Sources of parallelism:
			Data parallelism: updating array elements simultaneously.
			Functional parallelism: conceptually different tasks which combine to solve the problem.
			Speculative parallelism: temporarily ignore partial ordering, repair later if this causes problems.
			This list is not exhaustiveŠ
	Data Parallelism in Algorithms
		Data-parallel algorithms exploit the parallelism inherent in many large data structures.
			Many elements in structure ¤ many update operations
			Example: Red-Black Relaxation
				All red points can be updated in parallel
		Analysis:
			Scalable parallelism
			Synchronization and locality may be suspect
	Functional Parallelism in Algorithms
		Functional parallelism exploits the parallelism between the parts of many systems.
			Many pieces to work on ¤ many independent operations
			Example: Aeroelasticity (wing design)
				CFD and CSM (and others, if need be) can be evaluated in parallel
		Analysis:
			Parallelism limited
			Synchronization probably good
	Parallel Languages
		A parallel language provides an executable notation for implementing a parallel algorithm.
		Design criteria:
			How are parallel operations defined?
				static tasks vs. dynamic tasks vs. implicit operations
			How is data shared between tasks?
				explicit communication/synchronization vs. shared memory
			How is the language implemented?
				low-overhead runtime systems vs. optimizing compilers
		Usually a language reflects a particular type of parallelism.
	Data-Parallel Languages
		
			Data parallelism originated on SIMD machines
				But has since been generalized in concept
				And has been implemented on many architectures
				The definition is still somewhat fuzzy
			HPF is largely based on data parallelism
			Other data-parallel languages of note:
				C*
				DataParallel C (taken from C*)
				NESL
				Fortran D, Fortran 90D, Vienna Fortran
		Data-parallel languages provide an abstract, machine-independent model of parallelism.
			Fine-grain parallel operations, such as element-wise operations on arrays
			Shared data in large, global arrays with mapping ³hints²
			Implicit synchronization between operations
			Partially explicit communication from operation definitions
		Advantages:
			Global operations conceptually simple
			Easy to program (particularly for certain scientific applications)
		Disadvantages:
			Unproven compilers
	Data-Parallel Languages, cont.
		
			Data parallelism is (an attempt at) a high-level language.
			High-level languages have inherent advantages and disadvantages:
				It's easier to concentrate on the high-level ideas.
					This means you can write the program faster 
					This means you're more likely to get it right
				The underlying system takes over the low-level details.
					This means you can't micro-manage the execution
					This means you can't always tell what will happen
		Abstractions like data parallelism split the work between the programmer and the compiler.
		Programmer¹s task: Solve the problem in this model.
			Concentrate on high-level structure and concepts
			Aggregate operations on large data structures
			Data in global arrays with mapping information
		Compiler¹s task: Map conceptual (massive) parallelism to physical (finite) machine.
			Fill in the grungy details
			Guided by user ³hints² like mapping information
			Optimizing computation and communication
	Message-Passing Systems
		
			For comparison to data-parallel languages
		Program is based on relatively coarse-grain tasks
		Separate address space and a processor number for each task
		Data shared by explicit messages
			Point-to-point and collective communications patterns
		Examples: MPI, PVM, Occam
		Advantages:
			Close to hardware.
		Disadvantages:
			Many low-level details.
	How HPF Is Implemented
		Analysis
			Traditional dataflow and dependence analysis
			Data mapping analysis
		Computation Partitioning
			Use data mapping info to create locality
			Transform code to enhance locality
		Communication Introduction
			Communicate (move data) if data mapping and computation partitioning don¹t agree
			Lots of optimization to minimize/package communication
		Code Generation
			The really ugly details
			Lots of optimization possible here, too
	What Compilation Means
	for Programmers
		Help analysis with assertions
			ALIGN and DISTRIBUTE
			INDEPENDENT
		Distribute array dimensions that exhibit parallelism
			Conflicts require complex compilers, REDISTRIBUTE, or new algorithms
		Consider communications patterns
			BLOCK generally good for local stencils and fully-filled arrays
			CYCLIC and CYCLIC(K) generally good for load balancing and triangular loops
		Don't hide what you are doing