New CPS615HPF and Fortran90 Discussion Sept 17 96
	CPS615 -- Base Course for the Simulation Track of Computational Science
	Fall Semester 1996 --
	Introduction to High Performance Fortran and Fortran 90
		Geoffrey Fox, Tom Haupt
		NPAC
		Room 3-131 CST
		111 College Place
		Syracuse NY 13244-4100
	Abstract of HPF and Fortran90 Technology Discussion
		A brief discussion of Fortran90 and Fortran77 and why Fortran90 has advantages and disadvantages
		Overview of Key Features of Fortran90
		See Metcalf and Reid, Fortran90 Explained, Oxford Scientific Publications
		Overview of Key Features of HPF
			Parallel Constructs
			Data Mapping
			Examples
		The Future -- HPF2
		See Chuck Koelbel from Rice University at
		http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/Koelbel
	HPF is an extension of  Fortran 90
		To express data parallelism and so hide machine dependent features of parallel programming from the user
		We use Fortran90 as base language both because it is 
			more advanced than Fortran 77 with object oriented features and 
			The array syntax allows an elegant explictly parallel expression of some operations
		Use of Fortran90 is a Problem because
			It is a complex language which it is difficult to build compilers for
			Not many people use Fortran90 and maybe they will all switch to Java before Fortran90 gets in common practice
			So perhaps it is "too little too late" and we should focus on supporting the past (Fortran77) and the "correct" future ( (HP)Java) and not an irrelevant middle solution .....
	Why is Fortran90 Easier than Fortran77
		      Do 1 I=1,N
		1     A(I)=B(I)    is obviously parallel
		Fortran90 Array Notation
		      A=B             expresses this parallelism naturally in way compiler can easily detect in a deterministic way
		      Do 1 I=1,N
		      J=I
		      K=I
		1     A(J)=B(K)    is not so obviously parallel
		Needs a difficult to define (in general case) algorithm (especially if IF statements, i.e. conditionals, defining I,J) to decide on existence and implementation of parallelism
		Use of Fortran77 has "thrown away" natural parallelism at language level even though "run-time" restores as creates explicit values for variables such as J and K which are only known by analysis at compile time.  
	Important Features of Fortran90
		Arrays are very well supported with memory allocation and set of intrinsics, better passing to procedures etc.
			This is key capability for HPF
		Derived Types allow general object structure (without inheritance) in F90
			Pointers
			This area NOT well supported by HPF Compilers as of Summer 1996
		Modules replace COMMON INCLUDE etc.
		Procedures (functions,subroutines) allow better interfaces, recursion, optional parameters etc.
		Better Syntax with free form, more loop control etc.
	Introduction to Fortran90 Arrays - I
		Arrays are "true" objects in Fortran90 and are stored as values of elements plus a data descriptor!
		There are operations on full arrays which have natural parallel implementations seen in HPF
		New set of array intrinsic (built-in) functions for elements, reductions (process array into a single value), tranformational (reshaping)
			Note these functions are NOT sufficient for real problems and must have FORALL to define new parallel Array functions
			FORALL is in HPF but not F90 -- Expected in next round of standard (Fortran95)
			FORALL( I=0:nx)
				A(i) = (A(I+1)+A(I-1))/(I+1)
			END FORALL
	Introduction to Fortran90 Arrays - II
		Extract sections (subarrays) of arrays as u(lb:ub:step)
			lb is Lower Bound
			ub is Upper Bound
			step (defaults to 1) is step
		masked (conditional) Array operations using WHERE .... ELSEWHERE
		Can still do Fortran77 array element operations (DO Loops) but of course this might not be interpretable for efficient parallelism by HPF compiler
		Note Fortran90 designed for science and engineering with originally special concern for vector supercomputers but Cray supports F77 better than F90(!) 
	Fortran90 Arrays and  Memory Allocation
		ALLOCATABLE Arrays can be defined at runtime with variable sizing
			REAL, ALLOCATABLE :: u(:,:) , f(:,:)
			ALLOCATE ( u(0:nx,0:ny) , f(1:27,0:ny) )
		One can define POINTER and TARGET attributes which can be used like REAL, DIMENSION etc.
			=> operator allows one to set a POINTER to "point to" a TARGET
		Arguments of a subroutine need NOT define array dimensions in subroutine as these as passed by calling program in data descriptor
		Local arrays are created on stack and bounds maybe non constant and evaluated at procedure entry
	More on Fortran90 Arrays and Subroutines
		One passes "assumed-shape" arrays from calling to callee routines using INTERFACE syntax
		INTERFACE
			SUBROUTINE residual (r,u,f)
				REAL r(:,:) , u(:,:) , F(:,:)
			END SUBROUTINE
		END INTERFACE         is called by
		call residual (r,u,f) or
		call residual ( r(0:nx:2, 0:ny:2) , u(0:nx:2, 0:ny:2) , f(0:nx:2, 0:ny:2) )
		where latter example just processes every other element of arrays
	Typical Use of Array and Intrinsic Operations
		REAL u(0:nx,0:ny), A(100,100) , fact , avg
		u= fact * (u -avg)   Scales and translates all elements of u
		avg = .25*( CSHIFT(u,1,1) + CSHIFT(u,-1,1) + CSHIFT(u,1,2)  + CSHIFT(u,-1,2)
		calculates of average of 4 array elements surrounding each point. Note third argument in CSHIFT is label for axis (1=x 2=y)
		SQRT( A(1:100) ) calculates a new array containing 100 square roots
		SUM(A) is a reduction operator sumimg all elements of array A as a scalar 
		SIZE(A,1) is an Array Query Intrinsic giving size of A in the first dimension and is particularly useful for "assumed-shape" arrays passed into subroutines
	Derived Type in Fortran90
		TYPE PERSON
			CHARACTER(LEN=10) NAME
			REAL AGE
			INTEGER ID
		END TYPE PERSON
		TYPE(PERSON) YOU,ME
		The Identification number of YOU would be accessed as YOU%ID as an ordinary integer
		One can define global operators  so that YOU+ME could be defined
		One can use name of derived type as a constructor
		YOU = PERSON ('Pamela Fox', 12, 3)
	Examples of POINTER's in Fortran90
		One can define a linked list as:
		TYPE ENTRY
			REAL VALUE
			INTEGER INDEX
			TYPE(ENTRY), POINTER :: NEXT
		END TYPE ENTRY
		ALLOCATE Creates dynamically elements in a linked list
		CURRENT = ENTRY( NEW_VALUE, NEW_INDEX, FIRST)
		FIRST => CURRENT
		adds a new entry at start of linked list and renames it with POINTER FIRST
	MODULEs in Fortran90
		General Syntax is:
		MODULE name
			Specify it!
		CONTAINS    This is optional
			module subprograms
		END MODULE name
		MODULE IllustratingCommonBlock
			INTEGER DIMENSION(52) :: CARDS
		END MODULE IllustratingCommonBlock
		replaces COMMON construct and can be used as
		USE IllustratingCommonBlock
	MODULEs INTERFACES and Overloaded Operators in Fortran90
		MODULE INTERVAL_ARITHMETIC
			TYPE INTERVAL
				REAL LOWER, UPPER
			END TYPE INTERVAL
			INTERFACE OPERATOR(+)  define overloaded + operator 
				MODULE PROCEDURE ADD_INTERVALS
			END INTERFACE
		CONTAINS
			FUNCTION ADD_INTERVALS(A,B)
				TYPE(INTERVAL) ADD_INTERVALS, A, B
				ADD_INTERVALS%LOWER = A%LOWER + B%LOWER
				ADD_INTERVALS%UPPER = A%UPPER + B%UPPER
			END FUNCTION ADD_INTERVALS(A,B)
		END MODULE INTERVAL_ARITHMETIC 
	Outline of HPF Discussion
		What is HPF, what we need it for, where it came from
		How does HPF Get its Parallelism
		Why it is called "High Performance"?
		What are HPF compiler directives
		Data mapping in HPF
			TEMPLATE PROCESSORS DISTRIBUTE ALIGN
		Parallel statements and constructs in HPF
			Array statements, WHERE/ELSEWHERE, Intrinsics, FORALL, PURE, INDEPENDENT
		Latest Discussions -- HPF-2
			ON HOME, TASKING, Dynamic Data Mapping, Reductions in INDEPENDENT DO loops
	Information on HPF and HPF Forum (HPFF)
		Rice has taken lead in HPF Forum which is a much faster mechanism of getting agreement than formal 10 year process which Fortran90 suffered
		World Wide Web at rice and Vienna
			http://www.crpc.rice.edu/HPFF/home.html
			http://www.vcpc.univie.ac.at/HPFF/home.html
		Mailing List is majordomo@cs.rice.edu and choose list (hpff, hpff-interpret, hpff-core) you wish to subscribe to
		Anonymous FTP to titan.cs.rice.edu and look at
			public/HPFF/README for latest list of files
	Possible Programming Models
		Explicit Message Passing as in PVM or MPI
		User breaks program into parts and the parts send messages between them to implement communication necessary for synchronization and integration of parts into solution of a single program
		This matches hardware but is not particularly natural for problem and can be machine dependent
		Object Oriented programming is like message passing but now objects and not programs communicate
			Very good when objects are natural from the problem and represent functional parallelism
			However in data parallel problems tackled with object oriented approach, one must break problem up into a number of objects that depends on number of processors and so reflects machine and not problem
	Data Parallel Programming Model
		Data Parallelism is higher level than either message passing or object models (if objects used to break up data to respect computer)
		It provides a Shared Memory Programming Model which can be executed on SIMD or MIMD computers, distributed or shared memory computers
		Note it specifies problem not machine structure
		It in principle provides the most attractive machine independent model for programmers as it reflects problem and not computer
		Its disadvantage is that hard to build compilers especially for the most interesting new algorithms which are dynamic and irregular!
	Parallelism in HPF
		Parallelism in HPF is expressed explicitly
			Fortran 90 array expressions and assignments (including WHERE)
			HPF Library and Array intrinsics 
			FORALL statement and construct
				PURE labels procedures that can be used in FORALL as they have no "side-effects"
			INDEPENDENT assertion on DO loops
		Compiler may choose not to exploit information about parallelism
		Compiler may detect parallelism in sequential code
	Fortran77 is part of Fortran90
		A=B or more interestingly
		WHERE( B > 0. ) A = B
		ELSEWHERE A=0.
		END WHERE
		can be written
		DO I = n1,n2
		DO J = m1,m2
		IF(B(I,J) >0.)  THEN A(I,J) = B(I,J)
		ELSE A(i,J) = 0.
		END IF
		END DO
		END DO
		Now a good HPF compiler will recognize the DO loops can be parallelized and give the same answer for Fortran90 and Fortran77 forms but often the detection of parallelism is not clear
		Note FORALL is guaranteed to be parallelizeable as by definition no side effects.
	HPF Features
		All of Fortran90
		New instructions FORALL and INDEPENDENT enhancing DO loops
		Data Alignment and Distribution Assertions
		Miscellaneous Support Operations but
		NO parallel Input/Output
		Little Support for Irregular Computations
		Little Support for any form of non mainstream data-parallelism
		Extrinsics as supporting links with explicit message-passing
	What gives high performance in HPF
		There is tradeoff between parallelism and communication
		Programmer defines the data mapping and compiler uses this to assign processing
		Underlying assumptions are that:
		An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor,
		 And that it may be possible to carry out many such operations concurrently if they can be performed on different processors
		This is embodied in "owner computes" rule -- namely that in for instance
			A(i,j)= .....
			One brings everything on right hand side to process "owning" A(i,j) and performs computation in this processor
		Owner computes algorithm is usually good and often best  
	Compiler directives used in HPF
		The directives are structured comments that suggest implementation strategies or assert facts about a program to the compiler
		They may affect the efficiency of the computation performed, but do not change the value computed by the program
		As in Fortran 90 statements, there are both:
			declarative directives
			executable directives
	What does an HPF Compiler do?
		It must generate Fortran77(90) + Message Passing code or possibly in one pass map HPF code onto parallel machine code
		Traditional dataflow and dependency analysis is especially critical in Fortran77 parts of code
		It must use data mapping assertions to decide what is stored where and so organize computation
		Code must be transformed to respect this owner-computes model
		It must typically use "Loosely Synchronous" model with communicate-compute phases and then compiler generates all the communication needed 
			Due to latency issues compiler must minimize communication needed and maximize size of packets sent
		We need an excellent run-time library which the compiler invokes with parallel Intrinsics etc.
	Syntax of HPF Directives
		HPF directives are consistent with Fortran 90 syntax except for the special prefix for directive:
			!HPF$  (only version allowed with free format)
			CHPF$
			*HPF$
		Two forms of the directives are allowed
			Specification statements, such as
			!HPF$ DISTRIBUTE MYTEMPLATE(BLOCK) ONTO P
			Equivalent Attributed form, such as,
			!HPF$ DISTRIBUTE (BLOCK) ONTO P :: MYTEMPLATE
	Data Mapping in HPF
		Data Mapping in HPF is all you need to do to get parallelism as long as you use the explicit array type syntax such as A=B+C 
		The Owner Computes rule implies that specifying location of variables specifies (optimally or not) parallel execution!
		The new HPF-2 ON HOME directive is exception to this rule as specifies where a particular statement is to be executed
		(RE)DISTRIBUTE tells you where data is to be placed
		(RE)ALIGN tells you how different data structures are to be placed relative to each other
	Staged Data Mapping in HPF
		  
	Template in HPF
		A Template is an abstract space of indexed positions (an "array of nothings") 
		In CMFortran terminology, Template is set of Virtual Processors -- one per data point
			Template typically specifies precisely the full natural data parallelism that is natural for problem and HPF maps this problem parallelism onto particular machine
		A template is declared by the TEMPLATE directive that specifies:
			name of the template
			the rank (i.e., number of dimensions)
			the extent in each dimension
		Examples:
			CHPF$ TEMPLATE T(1000)
			!HPF$ TEMPLATE FRED(N, 2*N)
			*HPF$ TEMPLATE, DIMENSION(5,100,50) :: MINE, YOURS
	Abstract Processors in HPF
		Abstract processors always form a rectilinear grid in 1 or more dimensions
		They are abstract coarse grain collections of data-points
			Remember efficiency says we must "block" communication into large chunks -- processors give us a general target for this
		The processor arrangement is defined by the PROCESSORS directive that specifies:
			name of the processor arrangement
			the rank (i.e., number of dimensions)
			the extend in each dimension
		Examples:
			!HPF$ PROCESSORS P(N)
			*HPF$ PROCESSORS BIZARRO(1972:1997,-20:17)
			CHPF$ PROCESSORS SCALARPROC  (by default sequential)
	Example of Template and Processors
		      REAL, DIMENSION(40) :: A, B, C
		!HPF$ PROCESSORS P(4)
		!HPF$ TEMPLATE X(40)
		!HPF$ ALIGN WITH X :: A, B, C
		!HPF$ DISTRIBUTE X(BLOCK)
		      ...
		      C = A + B
		      ...
	Align Directive in HPF
		Syntax of Align:
		 !HPF ALIGN alignee WITH align-target
			where -- note [..] implies optional component
			alignee: alignee [(align-source-list)] 
			align target: align-target[(align-subscript-list)]
			
		Alternatively
		*HPF ALIGN (align-source-list) WITH align-target :: alignee
	Examples of Align Directive
		Note a colon(:) in directive  denotes all values of array index
		Examples of array indices:
			
			CHPF$ ALIGN A(i) WITH B(i)
			*HPF$ ALIGN (i,j) WITH TEMPL(i,j) :: A, B
		
		 Use of : examples:
			
			!HPF$ ALIGN A(:) WITH B(:)
			CHPF$ align (:,:) WITH TEMPL(:,:) :: A, B
	Changing Rank in Align Directive
		Ranks of the alignee and the align-target may be different
		Examples:
			!HPF$ ALIGN A(:,j) WITH B(:)
			CHPF$ ALIGN A(:,*) WITH B(:)
	Replication in Align Directive
		... or other way round
			!HPF$ ALIGN A(:) WITH TEMPL(:,*)

		
		
		
		
		while this only puts A on some parts of template...
		!HPF$ ALIGN A(:) WITH TEMPL(:,i)
		
	General Alignments in HPF
		HPF allows for more general alignments such as:
		      REAL, DIMENSION(5,8) :: A,B
		!HPF$ TEMPLATE T(12,12)
		!HPF$ ALIGN A(:,J) WITH T(:,J+1)
		!HPF$ ALIGN B(I,J) WITH T(I+4,J+4)
		
		Useful for simple numerical shifts as in example but not useful
		in general case of arbitary
		index values allowed by
		ALIGN syntax 
	Formal Definition of Align Directive
		Each align-dummy variable is considered to range over all valid index values for the corresponding dimension of the alignee. An align-subscript is evaluated for any specific combination of values for the align-dummy variables simply by evaluating each align-subscript as a expression. Their resulting subscript values must be legitimate subscripts for the align-target 
	More obscure Complicated Examples of Align Directive 
		These examples have non-unit stride as perhaps in "red-black" Iterative Solver algorithms:
	Distribution Directive in HPF
		Syntax:
		!HPF$ DISTRIBUTE distributee (dist-format) 
		[ONTO dist-target]
		Allowed forms of dist-format:
			* -- Implies no distribution in this index
			BLOCK -- Critical to minimize communication
			CYCLIC -- Critical for load balancing
			BLOCK(int-expr) -- Not Obviously useful!
			CYCLIC(int-expr) -- Very useful
		Examples:
			CHPF$ DISTRIBUTE TEMP(BLOCK,CYCLIC)
			!HPF$ DISTRIBUTE FRED(BLOCK(10)) ONTO P
			*HPF$ DISTRIBUTE (BLOCK,*) :: MYTEMPLATE
	Basic Examples of Distribute Directive 
		!HPF$ PROCESSORS P(4)
		      REAL, DIMENSION(16) :: A
		!HPF$ TEMPLATE T(16)
		!HPF$ ALIGN A(:) WITH T(:)
		
	Two Dimensional Example of Distribute Directive
		      DIMENSION, REAL(4,4) :: A
		*HPF PROCESSORS SQUARE(2,2)
		*HPF TEMPLATE T(4,4)
		*HPF ALIGN A(:,:) WITH T(:,:)
		*HPF DISTRIBUTE T(BLOCK,CYCLIC)ONTO SQUARE 
	The Two Basic Distributions in HPF
		We used BLOCK in the Laplace equation example and so this is appropriate distribution for "local" or geometric type problems 
		CYCLIC is called scattered in our early work (or is a special case of scattered which is perhaps random distribution of objects on processors) is appropriate in cases where "load-balancing" is more important than locality
			Simplest examples are matrix inversion and graphics rendering problems
			In solving equations (we will do later) Ax=b , there is no "nearest neighbor" structure between rows and columns, but rather one eliminates rows and columns and cyclic distribution ensures work remains balanced
			In calculating pixels, work depends on complexity of picture at that pixel and so best to distribute pixels cyclically (or randomly) to processors.
	The Example of Matrix Inversion
		Matrix Inversion set up on two processors after
		0                 2               and   4 rows/columns eliminated
		
		
		
		
		

		Note BLOCK decomposition leads to all work being on one processor at end even if starts off balanced
	Example of Graphics Rendering
		Here we show a 16 by 16 array of pixels with either CYCLIC or 8 by 8 two dimensional BLOCK,BLOCK
	Example of Distribute Directive with Complex Alignment
		      REAL, DIMENSION(12,16) :: A
		      REAL, DIMENSION(8,10) :: B
		CHPF$ PROCESSORS Q(4)
		CHPF$ TEMPLATE FRED(16,16)
		CHPF$ ALIGN A(:,:) WITH FRED(:,:)
		CHPF$ ALIGN B(I,J) WITH FRED(I+2,J+2)
		CHPF$ DISTRIBUTE FRED(BLOCK,*) 
	Dynamic Data Mapping
		One data mapping is often not appropriate for an entire program
			Often one has phases in which different distributions are needed in different phases
			e.g. in 2D FFT, one typically finds FFT of F(I,J) by first distributing so for each J all I (x values) are in same processor and then transform so that for each I all J are in same processor
			This ensures no communication in FFT phases which is important as typically in distributed one dimensional FFT there is substantial overhead
		ALLOCATABLE arrays can change size
		REALIGN and REDISTRIBUTE are executable DISTRIBUTE and ALIGN commands but are only to be used if one declares arrays on which they act DYNAMIC
		Naturally DYNAMIC arrays can be initialized by ALIGN or DISTRIBUTE statements
	Advanced Mapping Directives -- ReDistribution and ReAlign
		This example illustrates remapping from one to two dimensional decomposition for A and changing B from alignment with columns to alignment with rows
		      REAL, DIMENSION(64,64) :: A
		      REAL, DIMENSION(64) :: B
		!HPF$ PROCESSORS P(64)
		!HPF$ PROCESSORS Q(8,8)
		!HPF$ DYNAMIC :: A,B
		!HPF$ ALIGN B(:) WITH A(:,*)
		!HPF$ DISTRIBUTE A(*,BLOCK)ONTO P
		      ...
		!HPF$ REALIGN B(:) WITH A(*,:)
		      ...
		!HPF$ REDISTRIBUTE A(CYCLIC,CYCLIC) ONTO Q
		      ...
	Advanced Mapping Directives -- Allocatable arrays and pointers
		      SUBROUTINE SUB(N,M)
		      REAL, ALLOCATABLE, DIMENSION(:) :: A,B
		      REAL, POINTER :: P(:)
		!HPF$ PROCESSORS Q(64)
		!HPF$ ALIGN B(I) WITH A(I+N)
		!HPF$ DISTRIBUTE A(BLOCK(M))
		!HPF$ DISTRIBUTE(BLOCK), DYNAMIC :: P
		      ...
		      ALLOCATE(A(128))
		      ALLOCATE(B(64))
		      ALLOCATE(P(1024))
		      ... 
		!HPF$ REDISTRIBUTE P(CYCLIC)
		      ...
		      RETURN
		      END 
	Subprograms in HPF
		Scope of any mapping directives is a single (sub)program unit
		A template or distribution is not a first-class Fortran 90 object:
		It cannot be passed as a subprogram argument and this creates significant complication!
		HPF Compiler will typically pass an extra argument which is effectively an array-descriptor telling subroutine about distribution of passed arrays
		One can use array query intrinsics to find out what is going on but of course compiler does this implicitly
	Passing Distributed Arrays as Subprogram Arguments in HPF
		There are three typical cases:
		Subroutine requires data to use a particular mapping determined by subroutine
			Arguments must be remapped
		Subroutine can use any mapping so actual argument should be passed and used with current mapping
			Here we have two cases depending on whether programmer knows or not (and tells subroutine) what incoming distribution is
		Sometimes we need to remap due to array sections being passed
		Any remappings must be undone on return from subroutine
	Mapping Options for Dummy (Subroutine) Arguments
		DISTRIBUTE
			use * instead of dist-format or ONTO clause indicates that incoming distribution is acceptable i.e. leave data in place
			* before dist-format or ONTO clause indicates that data should stay in place and asserts that distribution is what you claim
		ALIGN
			* instead of or before target has similar meanings to DISTRIBUTE
		INHERIT
			A new attribute allowing references back to the original full array and used when sections of array are passed
	Inherit Distribution Directive in HPF
		(not a comprehensive discussion; just an example)  	 
	Summary of Mapping Directives in HPF
		PROCESSORS
		TEMPLATE
		ALIGN
		DISTRIBUTE
		INHERIT
		DYNAMIC
		REALIGN
		REDISTRIBUTE
	Fundamental Parallelism Assumption in HPF
		An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor 
			i.e. minimize communication
		it may be possible to carry out many such operations concurrently if they can be performed on different processors
			data parallelism
	Parallel statements and Constructs in HPF
		Parallel Statements
			Fortran 90 array assignments
			masked array assignments (WHERE)
			FORALL statement
		Parallel Constructs
			WHERE and WHERE...ELSEWHERE construct
			FORALL construct
			INDEPENDENT DO
		Intrinsic functions and the HPF library
		Extrinsic functions
	Parallelism in Fortran 90 array assignments
		This is as in CMFortran and Maspar MPFortran with example:
	WHERE (masked array assignment) in HPF
		This is as in CMFortran and Maspar MPFortran with example:
		      WHERE (A .GT. 0) A = A - 100
		

		Semantics of WHERE statement: 
		                  1. evaluate mask (in parallel) and store as a temporary T1
		                  2. for each i that T1(i)=.TRUE. compute T2(i)=A(i) - 100
		                  3. for each i that T1(i)=.TRUE. assign A(i)=T2(1)
	WHERE...ELSEWHERE / IF...ELSE constructs in HPF
		There is a fundamental difference in semantics between IF...ELSE and WHERE...ELSEWHERE constructs
	Intrinsic functions in HPF
		elemental
			examples:
				A = SIN(X)
				FORALL (i=1:100:2) A(i) = EXP(A(i))
		transformational and inquiry functions
			Fortran 90
				SUM, PRODUCT, ANY, DOTPROD, EOSHIFT, MAXVAL, ...
			HPF
				system inquiry functions:  
					NUMBER_OF_PROCESSORS, 
					PROCESSORS_SHAPE
				extensions of MAXLOC and MINLOC, ILEN
			HPF library
	HPF library functions
		new array reduction functions
			IALL, IANY, IPARITY and PARITY 
			(IAND, IOR, IEOR, and .NEQV.)
		array combining scatter functions
			XXX_SCATTER
		array prefix and suffix functions
			XXX_PREFIX, XXX_SUFFIX
		array sorting functions
			GRADE_DOWN, GRADE_UP
		bit manipulation functions
			LEADZ, POPCNT, POPPAR
		mapping inquiry subroutines
			HPF_ALIGNMENT, HPF_TEMPLATE, HPF_DISTRIBUTION
	SUM, SUM_PREFIX and SUM_SCATTER defined
		X=SUM(A) sums all elements of A and places result in scalar X
		Y = SUM_PREFIX(A) sets array Y of same size as A so that Y(i) has the sum of all A(j) for 1 <= j <= i
		Y = SUM_SCATTER(A,B, IND) sets array element Y(i) as the sum of array element  B(i) plus those elements of A(j) where IND(j) = I 
			X = SUM_SCATTER( flux, X, INDEX) is equivalent to
			FORALL (i=1:N)
				X(INDEX(i)) = X(INDEX(i)) + flux(i)
			END FORALL   assuming INDEX(i) just permutes numbers 1 to N and has no repeated values
	HPF Intrinsic EXAMPLE: SUM
		 
	FORALL Statement in HPF
		A very important extension to Fortran 90 and defines one class of parallel DO loop
		FORALL will be a language feature of Fortran95
		It relaxes the restriction that operands of the rhs expressions must be conformable with the lhs array
		It may be masked with a scalar logical expression  (extension of WHERE construct)
		A FORALL statement may call user-defined (PURE) functions on the elements of an array,  simulating Fortran 90 elemental function invocation (albeit with a different syntax)
		FORALL( index-spec-list [,mask-expr] ) forall assignment
			where forall-assignment is conventional single Fortran90 statement
	Examples of FORALL statements in HPF
		FORALL (i=1:100,k=1:100)  a(i,k) = b(i,k)     A = B
		FORALL (i=2:100:2) a(i) = a(i-1)                    A(2:100:2) = A(1:99:2)
		FORALL (i=1:100) a(i) = i                               A = [1..100]
		
		FORALL (i=1:100, j=1:100) a(i, j) = i+j
		FORALL (i=1,100) a(i,i) = b(i)
		FORALL (i=1,100,j=1:100) a(i,j) = b(j,i)
		FORALL (i=1,100)  a(i, 1:100) = b(1:100, i)
		FORALL (i=1:100, j=1:100, y(i,j).NE.0) x(i,j) = REAL(i+j)/y(i,j)
		FORALL (i=1,100)  a(i,ix(i)) = x(i)
		FORALL (i=1,9)  x(i) = SUM(x(1:10:i))
		FORALL (i= 1,100) a(i) = myfunction(a(i+1))
	Semantics of the FORALL statement in HPF
		Similar to Fortran 90 array assignments and WHERE
		Consider example:
	Vector Indices in FORALL's
		Consider FORALL( i=1:n ) a(ix(i)) = a(i)
		is allowed in HPF but will only give sensible reproducible results if ix(i) is a true permutation of 1...n
			but if ix(i) has repeated values, then the result is undefined
		Of course we always use "old" value" of a(i) on rhs so that if ix(i)= i+1 and a(0) defined, then result is
			a(i)/new = a(i-1)/old
			and not result of recursion
			a(1)/new =a(0)/old
			a(2)/new = a(1)/new just calculated = a(0)/old
	Multiple Statement FORALL's
		FORALL( index-spec-list [,mask-expr] )
		 forall-body
		END FORALL
		where forall-body can be a list of forall-assignment statements, FORALL or WHERE statements
		So Multi-Statement FORALL's support nesting of FORALL's but
		is in general Shorthand for a sequence of single statement FORALL's with by definition each statement completed before next one begins
		The multi-statement FORALL is likely to be more efficient than several single statement ones as latter have synchronization overhead on each statement
	HPF FORALL construct Pictorially
		 
	PURE Functions in HPF
		PURE functions have no side effects
			ALL Intrinsics are PURE
		DO loops can call any functions and parallelism unclear as function call can destroy parallelism
			DO I=1,1000
				A(I) = FUNC(A(I-1),X)
			END DO
		If FUNC alters A(I-1) or in fact A(any index except I), then this loop cannot be easily parallelized
		FORALL statements can only call PURE functions and these must NOT define any global (e.g. any element of A in example) or dummy (A(i-1) or X) variable
			PURE functions can only INHERIT distribution and alignment statements
	Example of PURE Function from Chuck Koelbel
		FORALL( i=1:n, j=1:m )
			k(i,j) = mandelbrot ( CMPLX((i-1)*1.0/(n-1), (j-1)*1.0/(m-1)), 1000)
		END FORALL
		This can call the PURE function mandelbrot which is essentially a generalized intrinsic
		PURE INTEGER FUNCTION mandelbrot (x,itol)
			COMPLEX, INTENT(IN) :: x
			INTEGER, INTENT(IN) :: itol
			COMPLEX xtmp
			INTEGER k
				k=0
				xtmp = -x
				DO WHILE( ABS(xtmp) < 2. .AND. k < itol )
					xtmp = xtmp*xtmp - x
					k = k + 1
				END DO
				mandelbrot = k
		END FUNCTION mandelbrot
	The INDEPENDENT Assertion in HPF
		!HPF$ INDEPENDENT [ ,NEW (variable-list) ]
		INDEPENDENT asserts that no iteration affects any other in any way
		It implements the "embarassingly parallel" problem class we discussed under structure of problems
		Note rest of HPF tackles mainly the synchronous problem class with some loosely synchronous capability
			HPF2 has "tasking" for metaproblem class and some extensions for further irregular loosely synchronous problems
		NEW variables are defined to have fresh instantiations for each iteration as is typically needed for embarassingly parallel problems where in fact essentially all variables in a loop would be NEW
		Note INDEPENDENT can be applied to FORALL and asserts that no index point assigns to any location that another iteration index value uses
			This reduces copying needed in FORALL by COMPILER
		HPF2 (see later) has extra feature of allowing REDUCTION (accumulated) variables in INDEPENDENT DO loops
	!HPF$ INDEPENDENT FORALL Pictorially
		 
	!HPF$ INDEPENDENT DO Pictorially
		 
	!HPF$ INDEPENDENT, NEW  Variable
		This is an exception from the conventional HPF picture of a global name space with either distributed or replicated variables
	Extrinsics in HPF
		An extrinsic function is a function written in a language other than HPF including most naturally any node programming language (e.g. Fortran77) targeted to a single processor SPMD) with message passing such as MPI
		HPF defines (Fortran90) interface and invocation characteristics
			This defines what is input and output via INTENT and these rules must be obeyed by callee (non HPF side)
			If variables are implicitly replicated, the callee must make them consistent before reurn i.e. callee must respect HPF model of parallelism
			HPF will execute any remapping commands and hands callee the "right" part of any distributed arrays
			All processors are synchronized before call to EXTRINSIC function
		Allows one to get efficient parallel code where HPF language or compiler inadequate
			Analogous to calling assembly language from Fortran, Native classes from Java, C from PERL etc.
	High Performance Fortran HPF2 Changes
		The original HPF 1.0 omitted some key capabilities which were known to be important but syntax and functionality was unclear in 1993
			Further experience has shown that HPF compilers have proven to be difficult to write!
		The HPF Forum met in 1995-96 and has approved a set of Extensions and Simplifications of HPF
		The concept of a base HPF 2.0 and Approved Extensions has been agreed
		Note approved extensions (which presumably vendors need not implement) include critical capability for dynamic irregular problems
		DYNAMIC REALIGN and REDISTRIBUTE are no longer in base language and are just "appproved extensions"
		Suprisingly no parallel I/O capabilities were approved!
	ON HOME for Computation Placement
		!HPF$ INDEPENDENT
		DO i = 1 , n
			!HPF$ ON HOME( ix(i) )
			x(i) = y(ix(i)) - y(iy(i))
		END DO
		This modifies the owner computes rule by specifying that computation will not be performed on processor owning left hand side
		ON HOME is an approved extension
	Reductions in INDEPENDENT DO Loops
		Many applications of INDEPENDENT DO loops do require reductions as they are typically calculating independently quantities but storing results as parts of various averages
		e.g. in High Enegry Physics Data Analysis, each measured event can be computed via an INDEPENDENT DO but one wishes to find a particular observable (histogram, scatterplot) which is averaged over each event
		Financial modelling is similar
		x = 0
		!HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x)
		do i = 1 , N
			call sub(i, xinc)
			x = x + xinc
		END DO
		xinc is a separate new variable each iteration but result is accumulated into global x
	Spawning Tasks in HPF
		Task Parallelism is sort of supported in HPF but not clear to me that this is a great idea as better to keep sophisticated task parallelism outside HPF which is really only designed to support data parallelism
		!HPF$ TASKING
			!HPF$ ON HOME(p(1:8))
			CALL foo(x,y)
			!HPF$ ON HOME(p(9:16))
			CALL bar(z)
		!HPF$ END
		This extends SPMD model with foo running on eight and bar on another processors
		Note foo and bar are expected to contain data parallel statements which distribute execution using conventional HPF over 8 processors
	New Data Mapping Features in HPF 2.0 - I
		!HPF DISTRIBUTE x( BLOCK(SHADOW = 1 ) )
			is designed to help compiler by specifying that it should set up a guard ring each side of BLOCK in each processor. In example Guard ring has extent 1
			If x has dimension 100 and 4 processors, we allocate storage for
			x(1..26) in Processor 1
			x(24..51) in Processor 2
			x(49..76) in Processor 3
			x(74..100) in Processor 4
		!HPF DISTRIBUTE x( BLOCK( /26,24,24,26/ ) )
			is designed to specify general BLOCK distribution with in above example
			x(1..26) in Processor 1
			x(26..50) in Processor 2
			x(51..76) in Processor 3
			x(77..100) in Processor 4
		!HPF DISTRIBUTE x( INDIRECT(map_array) )
			is designed to allow an arbitary user array to specify location of x(i)
			the processor number in map_array(i) would be typically be calculated by your favorite load balancing routine!

	New Data Mapping Features in HPF 2.0 - II
		Distribution is now allowed to Processor Subsets with typical Syntax:
			!HPF$ PROCESSORS procs(1:np)
			!HPF$ PROCESSORS b(BLOCK) ONTO procs(1:np/2-1)
		Distribution is allowed for Derived Types but can only be done at ONE level
			TYPE bunch_of_meshes
				REAL A1(100,100,100), A2(100,100,100)
				!HPF$ DISTRIBUTE (BLOCK,CYCLIC,*) :: A1,A2
			END TYPE
			Another Example shows distribution outside TYPE definition
			TYPE tree    with each node having one hundred children
				TYPE(tree) , POINTER, DIMENSION(100) :: children
				!HPF$ DYNAMIC children
				REAL value
			END TYPE
			TYPE(tree) actualtree
			ALLOCATE ( actualtree%children) allocates POINTERs to children
			!HPF$ REDISTRIBUTE t%children(BLOCK) distributes (for first time) these pointers