"High Performance Fortran in Practice" tutorial
	Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995
	Presented at Supercomputer '95, Mannheim, Germany, June 27, 1995
	Presented at Supercomputing '95, San Diego, December 4, 1995
	Presented at University of Tennessee (short form), Knoxville, March 21, 1996
	Presented at Metacenter Regional Alliances, Cornell, May 6, 1996
	Presented at Summer of HPF Workshop, Vienna, July 1, 1996
	Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996
	Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996
	Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996
	Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997
	Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997
	Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997
	Parts presented at SC'97, November 17, 1997
	Parts presented (slideshow mode) at SC¹97, November 15-21, 1997
	Presented at DOD HPC Users Group, June 1, 1998
	IV. HPF Data Mapping Features
	Outline
		1. Introduction to Data-Parallelism
		2. Fortran 90/95 Features
		3. HPF Parallel Features
		4. HPF Data Mapping Features
		5. Parallel Programming in HPF
		6. HPF Version 2.0
	Data Mapping
		Data mapping complements data parallelism by placing data for parallel access
		HPF uses a two-phase data mapping:
			ALIGN: Creates a relationship between objects
			DISTRIBUTE: Partitions an object between processors
			Vendors may define further levels of mapping
	Data Mapping, cont.
		Goals:
			Avoiding Contention: Data updated in parallel should be on different processors
			Locality of Reference: Data used together should be on the same processor
		These goals are sometimes in conflict
			To avoid all contention, put every data item on its own processor
			To maximize locality, put all data on one processor
			HPF gives you the tools to resolve these conflicts, but doesn¹t solve them for you
	A. distribute directive
	The DISTRIBUTE Directive
		Syntax:
			!HPF$ DISTRIBUTE array(dist-format-list ) [ONTO procs]
			!HPF$ DISTRIBUTE(dist-format-list ) [ONTO procs]::array-list
		Semantics:
			Each dimension of array is divided according to the corresponding pattern in dist-format-list
			ONTO (if present) names the processor array to distribute on
		Options for dist-format
			(Let N be the number of elements.)
			(Let P be the number of processors.)
			BLOCK : Contiguous pieces of size N/P on each processor
			CYCLIC : Every Pth element to the same processor
			CYCLIC(K) : Every Pth block of size K to the same processor
			* : Dimension not distributed
	Examples of DISTRIBUTE
		REAL W(12,12),X(12,12),Y(12,12),Z(12,12)
		!HPF$ DISTRIBUTE W(BLOCK,*)
		!HPF$ DISTRIBUTE X(*,CYCLIC)
		!HPF$ DISTRIBUTE Y(BLOCK,BLOCK)
		!HPF$ DISTRIBUTE Z(CYCLIC(2),CYCLIC(3))
	More Examples of DISTRIBUTE
		REAL a(12,12), b(12,12),
		REAL c(12,12), d(12,12)
		!HPF$ PROCESSORS p(6,1), q(2,3)
		!HPF$ PROCESSORS r(3,2), s(1,6)
		
		!HPF$ DISTRIBUTE a(BLOCK,BLOCK) ONTO p
		!HPF$ DISTRIBUTE b(BLOCK,BLOCK) ONTO q
		!HPF$ DISTRIBUTE c(BLOCK,BLOCK) ONTO r
		!HPF$ DISTRIBUTE d(BLOCK,BLOCK) ONTO s
	Why Use DISTRIBUTE?
		Compilation is based on the data distribution
			Computations will execute in parallel if 
				They are conceptually parallel (e.g. array operations)
				The data is partitioned (e.g. by DISTRIBUTE)
		Communication and synchronization are based on the data distribution
			BLOCK reduces surface-to-volume ratio
			CYCLIC (and CYCLIC(K)) improves load balance
			* keeps things on one processor
	DISTRIBUTE and Communication
		Communication (data movement) happens when two data items on different processors must be brought together
			Assume a(n), a(m) are on different processors
				a(n) = a(m)	­ Communicate one element
				x = a(n) + a(m)	­ Communicate one of {a(n),a(m)}
				a(n) = a(n)	­ No communication
		How communication is accomplished is a system problem
			Depends on data mapping (DISTRIBUTE and ALIGN)
			Depends on data access (subscripts of arrays)
			Depends on implementation (whose compiler?)
	Choosing a Good Distribution Pattern
		BLOCK retains locality for small shifts
			But large (relative to block size) shifts are a problem
		CYCLIC is a disaster for shift computations
			But there are some interesting special cases
	Choosing a Good Distribution Pattern (cont.)
		Broadcasts (copying one element) and reductions (combining all elements) favor neither BLOCK nor CYCLIC
		Irregular communications (sorting, scatters and gathers) favor neither BLOCK nor CYCLIC
			But some special cases are natural
	Choosing a Good Distribution Pattern (cont.)
		Copying between distributions creates a lot of communication
			This includes copying between different block sizes
		Interpolation (as in multigrid) works like a shift operator
	Distribution Patterns and
	Load Balance
		CYCLIC always balances number of elements per processor
			This balances the computation load if all elements are the same
		If P (# of processors) divides N (# of elements), BLOCK balances elements per processor
			There are pathological cases if P < N < P2
		If P*K divides N, CYCLIC(K) balances elements per processor
	Rules of Thumb for Communication
		BLOCK is good for local (e.g. nearest-neighbor) communication
			Look for subscripts like a(i+1,j-1)
			Look for intrinsics like CSHIFT
			Warning: BLOCK depends on array size
		CYCLIC has non-obvious locality
			Look for subscripts like x(i+jmp), where jmp is a multiple of the number of processors
			For example, jmp may be a power of 2 on a hypercube machine
			CYCLIC always balances the memory load
		CYCLIC(K) has some of the advantage of each
		Strides are expensive on any distribution
			But some special cases are worth recognizing
		Broadcasts are equally expensive on any distribution
		Communication between different distributions is very expensive
	Typical Uses of DISTRIBUTE
		Nearest-neighbor relaxation
			BLOCK in dimension(s) with most parallelism
			CYCLIC(K) may help with load balancing
		Dense linear algebra
			CYCLIC(K) in one or two dimensions
		Indirection array data structures
			³No silver bullet²
			BLOCK, with careful numbering of data elements?
	Multigrid: A Complicated
	Case for DISTRIBUTE
	Possible Data Structures for Multigrid
	Possible DISTRIBUTE Patterns
	for Multigrid
		Single shared array for all levels
			Smoothing: All CSHIFT operations
			Prolongation and Restriction: All CSHIFT operations
			Conclusion: Use some form of BLOCK
				If start-up cost is high: 1-D BLOCK (in longest dimension)
				Otherwise: 2-D BLOCK may be better
		List of arrays, one per level
			Smoothing: All CSHIFT operations
			Prolongation and Restriction: copies between different-sized arrays
			Conclusion: Use some form of BLOCK on each level
				May want to change for coarser grid levels
	Possible DISTRIBUTE Patterns
	for Multigrid (cont.)
	Implementation of DISTRIBUTE
		Allocate enough memory on each processor for its section of each distributed array
			Distributed memory: 1 malloc per processor or shrink bounds
			Shared memory: 1 shared area, with usage divided
		Adjust indexing
			Distributed memory: translate global indices ¤ local numbering
			Shared memory: permute elements, keep each processor's together
		Adjust loops (including implicit loops)
			Pick a reference in the loop (e.g. A(J+1))
			Each processor executes iterations so that reference is local (e.g. lb-1:ub-1)
		Handle nonlocal data
			Distributed memory: allocate buffer space, SEND/RECV needed data
			Shared memory: access data directly, or allocate & make local copies
	B. align directive
	The ALIGN Directive
		Syntax:
			!HPF$ ALIGN array(source-list) WITH target(subscript-list )
			!HPF$ ALIGN(source-list) WITH target(subscript-list) :: array-list
		Semantics:
			Creates a relationship between array and target so that for all values of the source-list variables, array(source-list) and target(subscript-list) are stored on the same processor
			Only target can be distributed explicitly
		Options for subscript-list
			Linear function of one source-list variable
			Triplet notation
				Must match ³:² in source-list
				Element-wise matching as in array assignment
			* (Replication)
	Examples of ALIGN
		REAL U(6,6), V(6,6), W(6,6)
		REAL X(6), Y(6), Z(3)
		!HPF$ALIGN V(I,J) WITH U(I,J)
		!HPF$ALIGN W(I,J) WITH U(J,I)
		!HPF$ALIGN X(K) WITH W(K,1)
		!HPF$ALIGN Y(K) WITH W(*,K)
		!HPF$ALIGN Z(K) WITH X(2*K-1)
		!HPF$ DISTRIBUTE U(BLOCK,*)
	More Examples of ALIGN
		REAL U(6,6), V(6,6), W(6,6)
		REAL X(6), Y(6), Z(3)
		!HPF$ALIGN V(I,J) WITH U(I,J)
		!HPF$ALIGN W(I,J) WITH U(J,I)
		!HPF$ALIGN X(K) WITH W(K,1)
		!HPF$ALIGN Y(K) WITH W(*,K)
		!HPF$ALIGN Z(K) WITH X(2*K-1)
		!HPF$ DISTRIBUTE U(*,BLOCK)
	Even More Examples of ALIGN
		REAL U(6,6), V(6,6), W(6,6)
		REAL X(6), Y(6), Z(3)
		!HPF$ALIGN V(I,J) WITH U(I,J)
		!HPF$ALIGN W(I,J) WITH U(J,I)
		!HPF$ALIGN X(K) WITH W(K,1)
		!HPF$ALIGN Y(K) WITH W(*,K)
		!HPF$ALIGN Z(K) WITH X(2*K-1)
		!HPF$ DISTRIBUTE U(*,CYCLIC)
	Son of Even More Examples of ALIGN
		REAL U(6,6), V(6,6), W(6,6)
		REAL X(6), Y(6), Z(3)
		!HPF$ALIGN V(I,J) WITH U(I,J)
		!HPF$ALIGN W(I,J) WITH U(J,I)
		!HPF$ALIGN X(K) WITH W(K,1)
		!HPF$ALIGN Y(K) WITH W(*,K)
		!HPF$ALIGN Z(K) WITH X(2*K-1)
		!HPF$DISTRIBUTE U(BLOCK,BLOCK)
	ALIGN and Communication
		Communication (data movement) happens when two data items on different processors must be brought together
		ALIGN relates array elements, ensuring they are mapped together
			!HPF$ ALIGN a(i) WITH b(i+1)
				No communication for a(10) = b(11)
				No information about a(10) = b(10)
		You can also think of this as modifying DISTRIBUTE
			Substitute the ALIGN subscripts before doing the communication analysis
	Why Use ALIGN?
		€ ³Copy² distributions
			­ DISTRIBUTE one array as the ³master² for data layout
			­ ALIGN other arrays to it
			­ Just modify one line to change all the distributions
				­ This will be common when porting codes!
		€ Off-by-one problems
			­ Sometimes boundaries get in the way of DISTRIBUTE
		€ Differing array sizes
			­ Will the compiler handle this better?
				REAL a(16), b(8)
				!HPF$ ALIGN b(i) WITH a(2*i)
				!HPF$ DISTRIBUTE a(BLOCK)
			­ Or this?
				REAL a(16), b(8)
				!HPF$ DISTRIBUTE a(BLOCK), b(BLOCK)
	Implementation of ALIGN
		Substitute ALIGN expression into array reference subscripts
		Treat like DISTRIBUTE
			Except that the expressions get trickier
			For non-stride-1 accesses and nontrivial alignments, much trickier
	C. dynamic mapping
	Dynamic Data Mapping
		One data mapping is not always appropriate for an entire program
			Different behavior in different phases
			³First sweep in the X dimension, then sweep in the Y dimension²
			ALLOCATABLE arrays can change size
		Therefore, HPF 1.x provides executable DISTRIBUTE and ALIGN
			Called REALIGN and REDISTRIBUTE
			DYNAMIC attribute  (compare to ALLOCATABLE attribute)
	The DYNAMIC Directive
		Syntax:
			!HPF$ DYNAMIC entity-decl-list
			!HPF$ DYNAMIC :: entity-decl-list
		Semantics:
			Any array in entity-decl-list can be used in a REALIGN or REDISTRIBUTE directive
			DYNAMIC appears only in the declarations section
			DYNAMIC arrays can also appear in ALIGN or DISTRIBUTE to get initial distributions)
	The REALIGN Directive
		Syntax:
			!HPF$ REALIGN array(source-list) WITH target(subscript-list)
			!HPF$ REALIGN (source-list) WITH target(subscript-list) :: array-list
		Semantics:
			Creates a dynamic relationship between array and target so that for all values of the source-list variables, array(source-list) and target(subscript-list) are stored on the same processor
			Array must have the DYNAMIC attribute
			Ends any previous ALIGN or DISTRIBUTE for array
			Communicates data already in array to its new home
			Lasts until another REALIGN or REDISTRIBUTE for the same array
				I.e., Not static scope
	The REDISTRIBUTE Directive
		Syntax:
			!HPF$ REDISTRIBUTE array(dist-format-list) [ONTO processors]
			!HPF$ REDISTRIBUTE (dist-format-list) [ONTO processors] :: array-list
		Semantics:
			Dynamically divides each dimension of array according to the corresponding pattern in dist-format-list
			Also changes the mappings of anything already aligned to array
				I.e., Array was previously the target in an ALIGN or REALIGN
			Array must have the DYNAMIC attribute
			Communicates data to its new processor home
			Lasts until another REALIGN or REDISTRIBUTE for the same array
	Typical Use of Dynamic Data Mapping
		Picking mappings based on input data
			!HPF$ DYNAMIC x
			!HPF$ ALIGN (:,:) WITH x(:,:) :: y, z
			IF (n1>n2) THEN
				!HPF$ REDISTRIBUTE x(BLOCK,*)
			ELSE
				!HPF$ REDISTRIBUTE x(*,BLOCK)
			END IF
			! Note: Compiler doesn't know precise data
			!       mappings of any array at this point
	More Typical Uses of Dynamic Data Mapping
		FFT, ADI, and other directional sweeps
			(RE)DISTRIBUTE BLOCK in parallel dimension for first sweep
			Perform sweep
			REDISTRIBUTE BLOCK in the next dimension
			Perform sweep, Š
		Cyclic reduction and other recursive doubling methods
			(Only if number of processors P is a power of 2)
			(RE)DISTRIBUTE BLOCK
			Reduce by step 1, then 2, Š to P/2
			REDISTRIBUTE CYCLIC
			Reduce by P, 2*P, Š
	D. Subroutine linkage
	Data Mapping in Subroutine Calls
		Some subroutines require data to use a specific mapping, so actual arguments must be remapped
		Some subroutines can use any mapping, so actual arguments should be passed in place
		Sometimes the programmer knows the incoming data mappings, sometimes not
		HPF has options to say all of this!
		Any remappings are undone on procedure return
	Mapping Options for Dummy Arguments
		Three classes of mappings for dummy arguments
			Prescriptive: ³Wherever the actual is, remap it here²
				Syntax: DISTRIBUTE A (BLOCK)
				Note: Remapping is undone on return
			Descriptive: ³The actual should be here²
				Syntax: DISTRIBUTE A  *(BLOCK)
				Note: If actual is not there, a warning is suggested but not required
			Transcriptive: ³Leave the actual where it is²
				Syntax: DISTRIBUTE A *
				Or: INHERIT A
		Sorting some of these cases out requires doing special things in the caller
			When this is needed, an explicit interface is required
			Explicit interfaces are a good idea in other cases too
	Subroutine Interfaces for Dummies
		Always use explicit interfaces
		Require good errors and warnings from your compiler
	When Are Two Mappings the Same?
		There are several cases where HPF requires mappings to be ³compatible²
			Parameter passing with implicit interfaces
			Parameter passing with explicit descriptive mappings (sort of)
			Pointer association
		³Compatible² is defined by a partial ordering
			A mapping is ³more specialized² (or ³more generalized²) than another if it has more information (or less information)
			Some mappings can¹t be compared
		You can avoid learning the detailed rules!
			Always ALIGN things that ought to have the same mapping
			Always use explicit interfaces
	Specialized and Generalized Mappings
		There is a hierarchy of directives, based on how precisely they describe a mapping
	Explicit Interfaces for 
	Mapping Dummy Arguments
		An explicit interface is required if
			Any dummy has a transcriptive mapping
				Reason: The caller has to send a descriptor with the array¹s mapping
			Any dummy has the INHERIT attribute
				Reason: The caller has to send a descriptor with the array¹s mapping
			For some argument, the actual¹s mapping is not a specialization of the dummy¹s
				Translation: If the argument has to be remapped, you need an explicit interface
				Reason: Somebody has to move the data around
	Rules of Thumb for Dummy Arguments
		Libraries
			Prescriptive mapping (³remap actual²) if a specific mapping is needed
			Transcriptive mapping (³take anything²) for embarrassingly parallel
			Descriptive mapping (³this is coming²) for internal routines, probably after checking distribution of (transcriptively mapped) user arguments
		Non-reusable modules
			Descriptive mapping probably fastest
			Prescriptive mapping probably safer
	Other Mapping Features
		PROCESSORS
			Define size and shape of processors array
			Only guaranteed when size of processors array equals NUMBER_OF_PROCESSORS()
			!HPF$ PROCESSORS name(shape-spec-list)
		TEMPLATE
			Defines an index domain for alignment and distribution, but no memory
			!HPF$ TEMPLATE name(shape-spec-list)
		ALLOCATABLE Arrays and POINTER  Targets
			ALIGN and DISTRIBUTE take effect  (and have expressions evaluated) when the array is allocated
			If used as the target of REALIGN, then the array must be allocated when REALIGN takes effect.
	Hints for Using Data Mapping
		ALIGN data based on physical domains
			Arrays with the same domain should be aligned
			Arrays not physically connected should not
		DISTRIBUTE based on parallelism
			Optimal performance comes from parallel operations on a distributed axis
		Pick distribution pattern based on communications
			BLOCK generally good for local stencils and fully-filled arrays
			CYCLIC and CYCLIC(K) generally good for load balancing and triangular loops
			Conflicts require compromises, remapping, complex compilers,  or new algorithms
		REALIGN with care; REDISTRIBUTE with extreme care
			Computation performed must outweigh communication for the remapping