"High Performance Fortran in Practice" tutorial
	Presented at Supercomputing '95, San Diego, December 4, 1995
	Presented at University of Tennessee (short form), Knoxville, March 21, 1996
	Revised and presented at High Performance Computing and Networking Europe, April 18, 1996
	Presented at Metacenter Regional Alliances, Cornell, May 6, 1996
	Presented at Summer of HPF Workshop, Vienna, July 1, 1996
	Revised, expanded, and presented at Institute for Mathematics & its Applications, Minneapolis, September 11-13, 1994
	Presented at Corps of Engineers Waterways Experiments Station, Vicksburg, MS, October 30-November 1, 1996
	Presented at Supercomputing '96, Pittsburgh, PA, November 17, 1996
	Presented at NAVO, Stennis Space Center, MS, Feb 13, 1997
	Presented at HPF Users Group (short version), Santa Fe, NM, February 23, 1997
	Presented at ASC, Wright-Patterson Air Force Base, OH, March 5, 1997
	Parts presented at SC'97, November 17, 1997
	Parts presented (slideshow mode) at SC¹97, November 15-21, 1997
	Presented at DOD HPC Users Group, June 1, 1998
	HPF 2.0 preview
	Outline
		1. Introduction to Data-Parallelism
		2. Fortran 90/95 Features
		3. HPF Parallel Features
		4. HPF Data Mapping Features
		5. Parallel Programming in HPF
		6. HPF Version 2.0
	HPF 2 Background
		New HPFF meetings were held to develop extensions to HPF
			Meetings began January 1995
			Preliminary presentations at Supercomputing ¹95
			Complete draft at Supercomputing ¹96
			Finalized draft January 31, 1997
		Areas for extensions
			Control parallelism (hpff-task@cs.rice.edu)
			Data distribution (hpff-distribute@cs.rice.edu)
			External interfaces (hpff-external@cs.rice.edu)
		Input is still welcome!
			Send mail to majordomo@cs.rice.edu to get on a list
			Send mail to hpff-interpret@cs.rice.edu to comment
	HPF 1.x Features
	HPF 2.0 Features 
	(Language Laywer View)
	HPF 2.0 Features (Technical View)
	HPF 2.0 Deletions and Simplifications
		DYNAMIC, REALIGN, and REDISTRIBUTE
			Now ³just² approved extensions
			Removed due to complexity of implementation
		Mapping in the presence of sequence association
			Now forbidden in all cases
			Removed due to complexity, and lack of user demand
		Subroutine interfaces
			Explicit interface required if mapping changes, or INHERIT used
			Descriptive (³star²) syntax retained, primarily for error checking
			Changed to match spirit of Fortran 95, and simplify user model
	Methods of Avoiding DYNAMIC Distributions
		REDISTRIBUTE/REALIGN had two main uses:
			Changing access patterns dynamically
			Picking static distributions based on data size
		Changing access patterns
			Declare multiple arrays and copy data between them
		Picking a distribution based on input
			Clone sections of the program
	Sequence Association for Dummies
		Don¹t Use It!
	Methods of Avoiding Sequence Association
		Just say "No!"
			EQUIVALENCE was never a good idea for new HPF codes
			Included for ease of porting F77 codes, with debatable benefits
		For new codes:
			Always declare arrays to be their natural rank
			Use ALLOCATABLE to make arrays their natural size
			Use MODULE for global arrays, or pass as explicit arguments
		For porting codes:
			Top-down conversion of subroutines
			If subroutine really needs EQUIVALENCE, it may be better as an EXTRINSIC
	Methods of Rewriting Subroutine Interfaces
		Always use an explicit interface
			Create a MODULE with INTERFACE blocks
			Create an INCLUDE file with INTERFACE blocks
	New Data Mapping Features
		€ Extended Distribution Patterns
			!HPF$ SHADOW x( 1, 0 )
			!HPF$ DISTRIBUTE y( GEN_BLOCK( (/ 12,10,10,12 /) ) )
			!HPF$ DISTRIBUTE z( INDIRECT(map_array) )
		€ Distribution to processor subsets
			!HPF$ PROCESSORS procs(1:np)
			!HPF$ DISTRIBUTE b(BLOCK) ONTO procs(1:np/2-1)
		€ Distribution of derived type components
			TYPE set_of_meshes
				REAL p(100,100), q(100,100), r(100,100)
				!HPF$ DISTRIBUTE (BLOCK,*) :: p, q, r
			END TYPE
			TYPE(set_of_meshes) multi_block(32)
			!!! Do not try to DISTRIBUTE array multiblock !!!
		€ Rules for matching distributions (pointers, dummy parameters)
	Examples of Extended Distributions
		!HPF$ DISTRIBUTE x(BLOCK(SHADOW=1),*)
		!HPF$ DISTRIBUTE y(GEN_BLOCK((/4,2,2,4/)),*)
		!HPF$ DISTRIBUTE z(*,INDIRECT(map))
	Example of Distribution to Subsets
		!HPF$ PROCESSORS p(6)
		!HPF$ DISTRIBUTE a(*,BLOCK) ONTO p
		!HPF$ DISTRIBUTE b(*,BLOCK) ONTO p(1:3)
		!HPF$ DISTRIBUTE c(*,BLOCK) ONTO p(4:6)
	Rules for Mapping Pointers
		In the core language of HPF 2
			Pointers cannot be mapped
			Targets of pointers cannot be mapped (i.e. variables with the TARGET attribute)
		In the HPF 2 approved extensions
			Pointers can be mapped
				ALIGN and DISTRIBUTE take effect after ALLOCATE
				INHERIT can be used to declare a pointer to ³anything²
			Targets can be mapped
			In a pointer assignment, the target¹s  mapping must be a specialization of the pointer's
				That is, the pointer has to be higher in the diagram than the thing it points at
				If the target is an array section, the pointer must have the INHERIT attribute
	Implementation of HPF 2
	DISTRIBUTE Patterns
		Conceptually, the process remains the same
			Allocate memory, adjust indexing and loops, handle nonlocal data
			New patterns require more elaborate methods to achieve this
		SHADOW
			Add extra space to allocation
			Use that space for buffering of nonlocal data and adjusting indices
			Ignore that space for adjusting loop bounds
			Simplifies addressing, may avoid copying
		GEN_BLOCK
			Keep table of block bounds on each processor
			Search table to find home of nonlocal elements, adjust indices and loops
			Allows some load balancing with locality
	Implementation of HPF 2
	DISTRIBUTE Patterns (cont.)
		INDIRECT
			Keep a copy of the map array distributed by BLOCK
			Keep a list of all elements on the local processor for adjusting loop bounds
			Inspector/executor strategy for locating nonlocal elements
				Inspector: Gather all information needed from map array
				Executor: Use this information to perform the computation
				Only do the inspector once, if possible (i.e. when distributions and array access patterns stay exactly the same)
			Allows arbitrary load balancing and communication reduction through partitioning
				But finding the right partition is NP-complete
	New Parallel Execution Features
		€ Loop Reductions
			!HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x)
			DO i = 1, n
				CALL sub(i, xinc)
				x = x + xinc
			END DO
		€ Computation Placement
			!HPF$ INDEPENDENT
			DO i = 1, n
				!HPF$ ON HOME( ix(i) )
				x(i) = y(ix(i)) - y(iy(i))
			END DO
		€ Task Parallelism
			!HPF$ TASK_REGION
				!HPF$ ON HOME(p(1:8))
				CALL foo(x,y)
				!HPF$ ON HOME(p(9:16))
				CALL bar(z)
			!HPF$ END TASK_REGION
	Example of Loop Reductions
		!HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x)
		DO i = 1, n
			CALL sub(i, xinc)
			x = x + xinc
		END DO
	Example of Computation Placement
		!HPF$ INDEPENDENT
		DO i = 1, 12
			!HPF$ ON HOME( ix(i) )
			x(i) = y(ix(i))-y(iy(i))
		END DO
	Example of Locality Assertion
		!HPF$ ALIGN (i) WITH x(i) :: ix, y, iy
		!HPF$ DISTRIBUTE x(BLOCK)
		!HPF$ INDEPENDENT
		DO i = 1, n
			!HPF$ ON HOME( ix(i) ), RESIDENT(y(iy(i)))
			x(i) = y(ix(i)) - y(iy(i))
		END DO
	Example of Task Parallelism
		!HPF$ PROCESSORS p(8) 
		!HPF$ DISTRIBUTE a1(block,*) ONTO p(1:4)
		!HPF$ DISTRIBUTE a2(*,block) ONTO p(5:8)
		!HPF$ TEMPLATE, DIMENSION(4), DISTRIBUTE(BLOCK) ONTO p(1:4) :: td1
		!HPF$ ALIGN WITH td1(*) :: done1
		                 
		!HPF$ TASK_REGION
			done1 = .false.
			DO WHILE (.true.)
		!HPF$		ON HOME(p(1:4)) BEGIN, RESIDENT
				READ (unit = iu,end=100) a1
				CALL rowffts(a1)
				GOTO 101
		100		done1 = .true.
		101		CONTINUE
		!HPF$		END ON
		            
				IF (done1) EXIT
				a2 = a1
		
		!HPF$		ON HOME(p(5:8)) BEGIN, RESIDENT
			 	CALL colffts(a2)
				WRITE(unit = ou) a2
		!HPF$		END ON
			ENDDO
		!HPF$	END TASK_REGION
	GRADE_UP versus SORT_UP
		Sometimes you want to keep things together (GRADE_UP)
		Sometimes you don¹t (SORT_UP)
	Implementation of HPF 2 Parallel Features
		All new features represent useful information to the compiler
			They do not change the meaning of the program if properly used
		REDUCTION
			Choose an efficient order of evaluation for the reduction tree
			For scalar reductions, keep a local sum on each node and have a global combining phase at the end
				Alternate implementation: critical region
			For vector reductions, use same algorithm as XXX_SCATTER
		ON HOME
			Base the loop partitioning on the HOME expression
			Invert subscripting function to derive loop bounds
			Does not affect where communication/synchronization can be placed, but may change what must be communicated
			Warning: You can outsmart the compiler this way‹to your detriment
	Implementation of HPF 2 Parallel Features (cont.)
		RESIDENT
			Do not generate communication for the RESIDENT expressions (and look for logically equivalent expressions as well)
			If no expression is given, then no communication for any variable is needed
				Allows INDEPENDENT with CALL
		TASK_REGION
			Shared memory: Use barrier synchronization for tasks entering each ON block; also synchronize on shared data access outside ON blocks
			Distributed memory: Processor groups communicate only internally within an ON block (use ordinary communication), communicate with other tasks at ON boundaries (use group communication
	External Interfaces
		Calling HPF from other languages
			How to ensure HPF is called consistently?
		Calling other languages from HPF
			How to avoid unnecessary synchronization?
		Parallel Input/Output
			What do you mean by ³parallel I/O²?
			Asynchronous I/O
		Tools
			Can we define a symbol table format for debuggers, etc.?
	Example of Asynchronous I/O
		i0 = 0; i1 = 1
		READ (file0, ID=id0, END=100) a(i0,1:1048576)
		DO 
			WAIT (ID=id0, END=100)
			! start next read into other row
			itmp = i0; i0 = i1; i1 = itmp
			READ (file0, ID=id0, END=100) a(i0,1:1048576)
			! overlap I/O with useful work
			CALL PROCESSING( a(i1,1:1048576) )
		END DO
		100 CONTINUE
	Conclusions
	For More Information
		
			These are listed in order of usefulness
			www.crpc.rice.edu is physically at Rice University, Houston, TX
			www.vcpc.univie.ac.at is physically at the Vienna Center for Parallel Computing, Vienna, Austria
		World Wide Web
			http://www.crpc.rice.edu/HPFF/home.html
			http://www.vcpc.univie.ac.at/HPFF/home.html
		These slides
			http://www.cs.rice.edu/~chk/hpf-tutorial.html
			http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/
			Koelbel/
		Mailing Lists:
			Write to majordomo@cs.rice.edu
			In message body: subscribe <list-name> <your-id>
			Lists: hpff, hpff-interpret, hpff-core
		Anonymous FTP:
			Connect to titan.cs.rice.edu
			Full draft in public/HPFF/draft
			See public/HPFF/README for latest file list