"High Performance Fortran in Practice" tutorial
	Presented at SIAM Parallel Processing San Francisco, Feb. 14, 1995
	Presented at Supercomputer '95, Mannheim, Germany, June 27, 1995
	Presented at Supercomputing '95, San Diego, December 4, 1995
	Presented at University of Tennessee (short form), Knoxville, March 21, 1996
	Presented at Metacenter Regional Affiliates, Cornell, May 6, 1996
	Presented at Summer of HPF Workshop, Vienna, July 1, 1996
	Presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1996
	Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996
	Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996
	Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997
	Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997
	Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997
	Parts presented at SC'97, November 17, 1997
	Parts presented (slideshow mode) at SC¹97, November 15-21, 1997
	Presented at DOD HPC Users Group, June 1, 1998
	III. HPF Parallel Features
	Outline
		1. Introduction to Data-Parallelism
		2. Fortran 90/95 Features
		3. HPF Parallel Features
		4. HPF Data Mapping Features
		5. Parallel Programming in HPF
		6. HPF Version 2.0
	Data-Parallel Statements
		Data parallelism emphasizes having many fine-grain operations, such as computations on every element of an array
		HPF has several ways to exploit data parallelism:
			Array expressions: Taken from Fortran 90
			FORALL: Tightly-coupled parallel execution based on the structure of an index space
				PURE: Procedures without side effects that may be called in FORALL
			INDEPENDENT: Assertion that iterations do not interfere with each other
			HPF library and intrinsics: Extended from Fortran 90
	B. FORALL statements, etc.
	The Single-Statement FORALL
		Syntax:
			FORALL ( index-spec-list [,mask-expr] ) forall-assignment
			index-spec is int-variable = triplet-spec
			forall-assignment is ordinary assignment or pointer assignment
		Semantics:
			Equivalent to array assignment in Fortran 90
			For each value of indices, check the mask
			Compute right-hand sides for unmasked values
			Make assignments to left-hand sides for unmasked values
			Multiple assignments to the same location are not standard-conforming (i.e. are undefined)
		Note: FORALL is not a general-purpose parallel loop!
	The Multi-Statement FORALL
		Syntax:
			FORALL ( index-spec-list [, mask] ) 
				forall-body-list
			END FORALL
			forall-body can be a forall-assignment, FORALL, or WHERE
		Semantics:
			Multi-statement FORALL is shorthand for a series of single-statement FORALLs
			Multi-statement FORALLs can be nested to produce more complex iteration spaces
			Each bottom-level assignment statement is completed before the next one starts
		Note: FORALL is not a general-purpose parallel loop!
	An Example of FORALL
		Initially,
			a = [0, 1, 2, 3, 4]
			b = [0, 10, 20, 30, 40]
			c = [-1, -1, -1, -1, -1]
		FORALL ( i = 2:4 )
			a(i) = a(i-1) + a(i+1)
			c(i) = b(i) * a(i+1)
		END FORALL
		Afterwards,
			a = [0, 2, 4, 6, 4]
			b = [0, 10, 20, 30, 40]
			c = [-1, 40, 120, 120, -1]
	An Example of DO
		Initially,
			a = [0, 1, 2, 3, 4]
			b = [10, 20, 30, 40, 50]
			c = [-1, -1, -1, -1, -1]
		DO i = 2, 4
			a(i) = a(i-1) + a(i+1)
			c(i) = b(i) * a(i+1)
		END DO
		Afterwards,
			a = [0, 2, 5, 9, 4]
			b = [10, 20, 30, 40, 50]
			c = [-1, 20, 60, 120, -1]
	An Example of Nested FORALLs
		FORALL ( i = 1:3 )
			a(i) = b(i)
			FORALL ( j = 1:i  )
				c(i,j) = d(i,j)
			END FORALL
		END FORALL
	Why Use FORALL?
		Assignments to array sections
			FORALL ( i = 1:4, j = 2:4 ) a(i,j) = a(i,j-1)
			FORALL ( i = 1:4 ) a(i,i) = a(i,i) * scale
			FORALL ( i = 1:4 )
				FORALL ( j=i:4 ) a(i,j) = a(i,j) / a(i,i)
			END FORALL
			FORALL ( i=1:4 )
				FORALL (j=ilo(i):ihi(i)) x(j) = x(j)*y(i)
			END FORALL
		Calculating based on a subscript
			FORALL (i=0:n,j=0:n) a(i,j) = SQRT(1.0*(i*i+j*j))/n
	Determinate Behavior of FORALL
		Consider the statement:
			FORALL ( i = 1:n ) a(ix(i)) = a(i)
		If ix has no repeated values (e.g. ix is a permutation), this is well-defined
			Note that a(i) is always the ³old² value, not the new one computed elsewhere in the FORALL
		If ix has repeated values (e.g. ix(i)=i/2), this is not defined by HPF
			The compiler may take any action it feels appropriateŠ
			Assigning one of the possible values is appropriate
			Reporting an error is appropriate
			Assigning a random number is appropriate
	Implementation of FORALL
		Use owner-computes rule to partition computation
			See implementation of DISTRIBUTE
		Semantics of the construct allow parallelism
			All rows of dependence diagram can execute in parallel
		Dependence analysis can further limit synchronization
			If RHS and LHS do not access the same element, no synchronization is needed
		Data need only be moved at synchronization points
			Before each RHS and before each LHS
			These may be combined with previous/following statement
	Implementation of FORALL (cont.)
		Data movement becomes much more difficult on distributed memory machines if the FORALL calls a function
			There is no race condition
			But the data may not be on the processor where it is needed
			And there is no synchronization point where it can be exchanged
		For this reason, many implementations serialize some cases of parallel constructs
			Usually there is a compiler override switch to parallelize at least some cases
			Best advice: Don¹t use global data in functions
	D. PURE functions
	PURE Functions
		Syntax:
			PURE [type-spec] FUNCTION function-name( [ dummy-arg-name-list ] )
		PURE functions have no side effects
		Syntactic constraints:
			Global variables and dummy arguments cannot be used in any context that may cause the variable to become defined
				Left hand side of assignment
				DO index, ASSIGN, ALLOCATE
				Actual argument with INTENT(OUT)
				Targets of pointer assignments (due to later use of pointers)
				Full list of restrictions is too long to fit on this slide!
			No external I/O or file operations
			Only inherited distribution/alignment of dummies and locals
		Intrinsic functions are PURE
	PURE Functions in Pictures
		COMMON X
		REAL Y, Z
		Z = FCN( Y )
	PURE Functions and FORALL
		PURE functions are the only ones that can be invoked from a FORALL
		Safe to do this because they have no side effects
		Useful to do this because there are things you cannot do (directly) in a FORALL
			Conditionals and iteration
				E.g., Do point-wise iteration this way
			Local variables
				E.g., Handle temporaries this way
	Why Use PURE Functions?
		Elemental functions
			Intrinsics
			Equations of state, etc.
			Note: A row can be an element!
		Pointwise iteration
			Mandelbrot sets
			Pointwise Newton iterations
	PURE for Mandelbrot Sets
		! The caller (Explicit interface not shown)
		FORALL ( i=1:n, j=1:m )
			k(i,j) = mandelbrot( CMPLX((i-1)*1.0/(n-1), &
				(j-1)*1.0/(m-1)), 1000 )
		END FORALL
		
		! The callee
		PURE INTEGER FUNCTION mandelbrot(x, itol)
		COMPLEX, INTENT(IN) :: x
		INTEGER, INTENT(IN) :: itol
		COMPLEX xtmp
		INTEGER k
			k = 0
			xtmp = -x
			DO WHILE (ABS(xtmp)<2.0 .AND. k<itol)
				xtmp = xtmp*xtmp - x
				k = k + 1
			END DO
			mandelbrot = k
		END FUNCTION mandelbrot
	Avoiding the PURE Function in Mandelbrot
		k0 = 0
		FORALL ( i=1:n, j=1:m )
			x(i,j) = CMPLX((i-1)*1.0/(n-1),(j-1)*1.0/(m-1))
			k(i,j) = 0
			xtmp(i,j) = -x(i,j)
			mask(i,j) = .TRUE.
		END FORALL
		DO WHILE (ANY(mask(1:n,1:m)) .AND. k0<1000)
			FORALL ( i=1:n, j=1:m, mask(i,j))
				xtmp(i,j) = xtmp(i,j)*xtmp(i,j)-x(i,j)
				k(i,j) = k(i,j) + 1
				mask(i,j) = ABS(xtmp(i,j))<2.0
			END FORALL
			k0 = k0 + 1
		END DO
	A. INDEPENDENT directives
	The INDEPENDENT Directive
		Syntax:
			!HPF$ INDEPENDENT [ , NEW( variable-list ) ]
		Semantics:
			INDEPENDENT is an assertion that no iteration affects any other iteration in any way
			NEW variables are treated as if they were allocated anew for each iteration (DO only)
			Applied to a DO: states that there are no loop carried dependences (except for NEW variables)
			Applied to a FORALL: states that no index point assigns to any location that another uses
			If the assertion is false, the program is not standard-conforming (i.e. results are not defined)
		Note: INDEPENDENT is not a general parallel loop!
	An Example of INDEPENDENT
		Initially,
			a = [0, 2, 4, 6, 1, 3, 5, 7]
			b = [6, 5, 4, 3, 2, 3, 4, 5]
			c = [-1,-1,-1,-1,-1,-1,-1,-1]
		!HPF$ INDEPENDENT
		DO j = 1, 3
			a(j) = a(b(j))
			c(a(j)) = a(j)*b(a(j))
		END DO
		Afterwards,
			a = [3, 1, 6, 6, 1, 3, 5, 7]
			b = [6, 5, 4, 3, 2, 3, 4, 5]
			c = [6,-1,12,-1,-1,18,-1,-1]
	Another Example of FORALL
	With INDEPENDENT
		Initially,
			a = [0, 2, 4, 6, 1, 3, 5, 7]
			b = [6, 5, 4, 3, 2, 3, 4, 5]
			c = [-1,-1,-1,-1,-1,-1,-1,-1]
		!HPF$ INDEPENDENT
		FORALL ( j = 1:3 )
			a(j) = a(b(j))
			c(a(j)) = a(j)*b(a(j))
		END FORALL
		Afterwards,
			a = [3, 1, 6, 6, 1, 3, 5, 7]
			b = [6, 5, 4, 3, 2, 3, 4, 5]
			c = [6,-1,12,-1,-1,18,-1,-1]
	An Example of
	Nested INDEPENDENT FORALLs
		!HPF$ INDEPENDENT
		FORALL ( i = 1:3 )
			a(i) = b(i)
			FORALL ( j = 1:i )
				c(i) = d(i)
			END FORALL
		END FORALL
	The INDEPENDENT Directive:
	More Details
		The Fundamental Rule:
			If one iteration writes to an object, others cannot read or write it
		Things that write to an object:
			Assignment, ASSIGN, DO index, Š
				To the object itself
				To an aggregate that contains it
				Through a pointer to it
			Input/Output statements
				Write the file pointer (except INQUIRE)
				READ assigns to its input list
		Things that read an object
			Uses in expressions (as you expect)
			Input/Output statements
				Read the file pointer (always)
	Examples of Correct
	INDEPENDENT Assertions
		Always true
			!HPF$ INDEPENDENT
			FORALL (i=2:n-1) a(i)=b(i-1)+b(i)+b(i+1)
			
			!HPF$ INDEPENDENT, NEW(j)
			DO k = 2, m-1, 2
				!HPF$ INDEPENDENT, NEW(vl,vr)
				DO j = 2, n-1, 2
					vr = x(j,k) - x(j-1,k)
					vl = x(j+1,k) - x(j,k)
					x(j,k) = x(j,k) + 0.5*(vr-vl)
				END DO
			END DO
		Some compilers will catch these on their own; some won't
	Examples of Incorrect
	INDEPENDENT Assertions
		INDEPENDENT does not handle reductions
			!HPF$ INDEPENDENT
			DO i = 1, n
				x = x + a(i)*a(i)
			END DO
		INDEPENDENT does not know about higher-level correctness
			DO WHILE (err > err_tol)
				!HPF$ INDEPENDENT
				DO i = 2, n-1
					b(i) = a(i)
					a(i) = 0.5 * (a(i-1) + a(i+1))
					b(i) = ABS(b(i) - a(i))
				END DO
				err = MAXVAL(b(2:n-1))
			END DO
	Example of Data-Dependent
	INDEPENDENT Assertion
		Sometimes true
			!HPF$ INDEPENDENT, NEW(j, n1)
			DO i = 1, nblue
				n1 = iblue(i)
				DO j = ibegin(n1), ibegin(n1+1)-1
					x(n1) = x(n1) + y(j)*x(ired(j))
				END DO
			END DO
	Why Use INDEPENDENT?
		Yet another way to do (some) array assignments
			!HPF$ INDEPENDENT
			DO i = 1, n
				a(i) = b(i)
			END DO
		Express application-dependent information
			! ³colors² don't interfere with each other
			!HPF$ INDEPENDENT, NEW(ix, i1,i2,f12)
			DO i = 1, ncolor
				DO ix = color_beg(i), color_end(i)
					i1 = icolor(ix,1)
					i2 = icolor(ix,2)
					f12 = w(i1)-w(i2)
					x(i1) = x(i1) + f12
					x(i2) = x(i2) - f12
				END DO
			END DO
	Implementation of INDEPENDENT
		Virtually identical to FORALL
		Use owner-computes rule to partition computation
			See implementation of DISTRIBUTE
		Semantics of the construct allow parallelism
			All iterations can execute in parallel
		Data need only be moved at synchronization points
			Before and after the loop
		No further dependence analysis needed
		Warnings about functions still apply
			Possibly worse, due to more pathological cases
	In Summary: 
	DO, FORALL and INDEPENDENT
	Hints for Using
	Data Parallel Statements
		Use FORALL to extend array operations
			More shapes
			User-defined elemental functions
			Better than array syntax?
		Beware of hidden overheads
			Intra-statement overheads
			Inter-statement synchronizations
			Complex pure functions
		Assert INDEPENDENT only when it is true
			This has implications for debuggers and programming environments!
	C. Library, intrinsic, and EXTRINSIC functions
	The HPF Library and New Intrinsics
		Extended intrinsics: MAXLOC, MINLOC
		One elemental intrinsic: ILEN
		System inquiry intrinsics: NUMBER_OF_PROCESSORS
		New reduction functions: IAND
		Combining-scatter functions: SUM_SCATTER
		Prefix reduction functions: SUM_PREFIX
		Sorting functions: GRADE_UP
		Bit manipulation functions: POPCNT
		Data distribution inquiry subroutines: HPF_DISTRIBUTION
	Examples of HPF Library
		Many implement data-parallel operations that are not elemental
	Why Use the HPF Library?
		If you need the functionsŠ
			Round out the set of reductions
				All built-in associative, commutative functions
			Partial reductions (prefix reductions) for all reductions
			Sorting
		Functions were chosen for the library because
			They are useful
			They are data-parallel
			They are hard to implement efficiently by hand
	Typical Uses of HPF Library
		Accumulations through indirection arrays
			x = SUM_SCATTER(  flux, x, nbr(1,1:n) )
			x = SUM_SCATTER( -flux, x, nbr(2,1:n) )
			! Equivalent to the following
			DO i = 1, n
				x(nbr(1,i)) = x(nbr(1,i)) + flux(i)
				x(nbr(2,i)) = x(nbr(2,i)) - flux(i)
			END DO
		Manipulating array-based sparse structures
			inum(1:n) = MAX( iend(1:n)-ibeg(1:n)+1, 0 )
			ibeg_new(1:n) = SUM_PREFIX(inum(1:n)) + 1
			iend_new(1:n) = ibeg_new(1:n)+inum(1:n)-1
			! Moving the data left as exercise for reader
	Implementation of the HPF 1 Library
		Functions were added to the library because
			They are useful
			They can be implemented in parallel on all (known) machines
			The optimal implementation is complex or machine-dependent
		Therefore, HPF puts the onus on the vendors to provide the best possible implementation
			Can be implemented by subroutine library or in-line expansion
			Should use the best known algorithms for reduction, prefix operations, etc.
	EXTRINSIC Procedures
		Syntax:
			EXTRINSIC ( extrinsic-kind-keyword  )
			Goes in FUNCTION or SUBROUTINE header (like PURE)
		An escape mechanism for allowing HPF to call other languages and paradigms
		On the caller (HPF) side:
			There must be an explicit interface with the EXTRINSIC directive
			Remapping occurs (if needed) to meet distribution specifications
			System synchronizes all processors before the call
			System calls ³local² routine on every processor
		On the callee (non-HPF) side:
			INTENT(IN) and INTENT(OUT) must be obeyed
			If variables are replicated, callee must make them consistent before return
			Processors can access their own section of distributed arrays
	EXTRINSIC(HPF_LOCAL)
		€ HPF_LOCAL is a SPMD language that meshes well with ³global² HPF
			­ It can do things HPF can¹t
			­ Global HPF can do things it can¹t
		€ On the caller (global HPF) side
			EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x)
			REAL x(:)
			!HPF$ DISTRIBUTE x(BLOCK)
		€ On the callee (local HPF) side
			EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x)
			REAL x(:)
		€ Inside foo, SIZE(x) checks the local section of x
		€ HPF_LOCAL_LIBRARY provides GLOBAL_SIZE(x) to check the global size of x
	Example of HPF_LOCAL
	EXTRINSIC(F77_LOCAL)
		€ HPF_LOCAL requires a Fortran 90 interface to the local subroutine with assumed-shape arrays
			- Many libraries do not use assumed-shape arrays
			- Thus was born the need for F77_LOCAL
		€ On the caller (HPF) side:
			EXTRINSIC(HPF_LOCAL) REAL FUNCTION foo(x)
			REAL x(:)
			!HPF$ DISTRIBUTE x(BLOCK)
		€ On the callee side:
			- Implicit interface (FORTRAN 77 doesn't have explicit interfaces)
	Example of F77_LOCAL 
	(From NAS FT Benchmark)
		!
		! 3D FFT subroutine used by the PESSL implementation.
		!
		      subroutine fft (n1, n2, n3, isign, scale, x, y)
		         use blacs
		         use types
		         implicit none
		!
		! Arguments
		!
		         integer, intent(in) :: n1, n2, n3, isign
		         real(R8), intent(in) :: scale
		!
		         complex(R8), dimension(:,:,:), intent(in) :: x
		!hpf$    template gridx(n1,n2,n3)
		!hpf$    distribute(*,*,block) :: gridx
		!hpf$    align(:,:,:) with *gridx :: x
		!
		         complex(R8), dimension(:,:,:), intent(out) :: y
		!hpf$    template gridy(n3,n2,n1)
		!hpf$    distribute(*,*,block) :: gridy
		!hpf$    align(:,:,:) with *gridy :: y
	Example of F77_LOCAL 
	(From NAS FT Benchmark) II
		!
		! Interfaces
		!
		         interface
		            extrinsic (f77_local) subroutine   &
		     &            pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip)
		               use types
		               implicit none
		               integer, intent(in) :: n1, n2, n3, isign, ictxt
		               real(R8), intent(in) :: scale
		               integer, dimension(:), intent(in) :: ip
		               complex(R8), dimension(:,:,:), intent(in) :: x
		!hpf$          template tempx(n1,n2,n3)
		!hpf$          distribute(*,*,block) :: tempx
		!hpf$          align(:,:,:) with *tempx :: x
		               complex(R8), dimension(:,:,:), intent(out) :: y
		!hpf$          template tempy(n3,n2,n1)
		!hpf$          distribute(*,*,block) :: tempy
		!hpf$          align(:,:,:) with *tempy :: y
		            end subroutine pdcft3
		         end interface
		!
		         ip = 0
		         call pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip)
		!
	Example of F77_LOCAL 
	(From NAS FT Benchmark) III
		!
		         call blacs_get (0, 0, val)
		         ictxt = val(1)
		!
		! Calculate the USERMAP for the BLACS grid, and set up the grid.  The
		! processor ID assignment is done by the same algorithm used by PGHPF,
		! and may be different for other HPF implementions.
		!
		         nprow = size (usermap, 1)
		         npcol = size (usermap, 2)
		         do j = 1, npcol
		            do i = 1, nprow
		               usermap(i,j) = (j - 1) * nprow + (i - 1)
		            end do
		         end do
		         call blacs_gridmap (ictxt, usermap, nprow, nprow, npcol)
		!
		         ip = 0
		         call pdcft3 (x, y, n1, n2, n3, isign, scale, ictxt, ip)
		!
		         call blacs_gridexit (ictxt)
		!
	Why Use EXTRINSIC Procedures?
		Access to other programming languages and paradigms
			HPF_SERIAL: Sequential execution
				For example, interfacing with graphics libraries
			HPF_LOCAL: Single Program Multiple Data (SPMD) paradigm
			HPF_CRAFT: Cray native programming language (HPF 2.0)
		Expressiveness
			Some things are just hard to write in HPF
			Some things are impossible to do in HPF
				Nondeterminism
		Efficiency
			Because of interface requirements, this works best for superlinear operations (e.g. ScaLAPACK)