New CPS615HPF and Fortran90 Discussion Sept 17 96 CPS615 -- Base Course for the Simulation Track of Computational Science Fall Semester 1996 -- Introduction to High Performance Fortran and Fortran 90 Geoffrey Fox, Tom Haupt NPAC Room 3-131 CST 111 College Place Syracuse NY 13244-4100 Abstract of HPF and Fortran90 Technology Discussion A brief discussion of Fortran90 and Fortran77 and why Fortran90 has advantages and disadvantages Overview of Key Features of Fortran90 See Metcalf and Reid, Fortran90 Explained, Oxford Scientific Publications Overview of Key Features of HPF Parallel Constructs Data Mapping Examples The Future -- HPF2 See Chuck Koelbel from Rice University at http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/Koelbel HPF is an extension of Fortran 90 To express data parallelism and so hide machine dependent features of parallel programming from the user We use Fortran90 as base language both because it is more advanced than Fortran 77 with object oriented features and The array syntax allows an elegant explictly parallel expression of some operations Use of Fortran90 is a Problem because It is a complex language which it is difficult to build compilers for Not many people use Fortran90 and maybe they will all switch to Java before Fortran90 gets in common practice So perhaps it is "too little too late" and we should focus on supporting the past (Fortran77) and the "correct" future ( (HP)Java) and not an irrelevant middle solution ..... Why is Fortran90 Easier than Fortran77 Do 1 I=1,N 1 A(I)=B(I) is obviously parallel Fortran90 Array Notation A=B expresses this parallelism naturally in way compiler can easily detect in a deterministic way Do 1 I=1,N J=I K=I 1 A(J)=B(K) is not so obviously parallel Needs a difficult to define (in general case) algorithm (especially if IF statements, i.e. conditionals, defining I,J) to decide on existence and implementation of parallelism Use of Fortran77 has "thrown away" natural parallelism at language level even though "run-time" restores as creates explicit values for variables such as J and K which are only known by analysis at compile time. Important Features of Fortran90 Arrays are very well supported with memory allocation and set of intrinsics, better passing to procedures etc. This is key capability for HPF Derived Types allow general object structure (without inheritance) in F90 Pointers This area NOT well supported by HPF Compilers as of Summer 1996 Modules replace COMMON INCLUDE etc. Procedures (functions,subroutines) allow better interfaces, recursion, optional parameters etc. Better Syntax with free form, more loop control etc. Introduction to Fortran90 Arrays - I Arrays are "true" objects in Fortran90 and are stored as values of elements plus a data descriptor! There are operations on full arrays which have natural parallel implementations seen in HPF New set of array intrinsic (built-in) functions for elements, reductions (process array into a single value), tranformational (reshaping) Note these functions are NOT sufficient for real problems and must have FORALL to define new parallel Array functions FORALL is in HPF but not F90 -- Expected in next round of standard (Fortran95) FORALL( I=0:nx) A(i) = (A(I+1)+A(I-1))/(I+1) END FORALL Introduction to Fortran90 Arrays - II Extract sections (subarrays) of arrays as u(lb:ub:step) lb is Lower Bound ub is Upper Bound step (defaults to 1) is step masked (conditional) Array operations using WHERE .... ELSEWHERE Can still do Fortran77 array element operations (DO Loops) but of course this might not be interpretable for efficient parallelism by HPF compiler Note Fortran90 designed for science and engineering with originally special concern for vector supercomputers but Cray supports F77 better than F90(!) Fortran90 Arrays and Memory Allocation ALLOCATABLE Arrays can be defined at runtime with variable sizing REAL, ALLOCATABLE :: u(:,:) , f(:,:) ALLOCATE ( u(0:nx,0:ny) , f(1:27,0:ny) ) One can define POINTER and TARGET attributes which can be used like REAL, DIMENSION etc. => operator allows one to set a POINTER to "point to" a TARGET Arguments of a subroutine need NOT define array dimensions in subroutine as these as passed by calling program in data descriptor Local arrays are created on stack and bounds maybe non constant and evaluated at procedure entry More on Fortran90 Arrays and Subroutines One passes "assumed-shape" arrays from calling to callee routines using INTERFACE syntax INTERFACE SUBROUTINE residual (r,u,f) REAL r(:,:) , u(:,:) , F(:,:) END SUBROUTINE END INTERFACE is called by call residual (r,u,f) or call residual ( r(0:nx:2, 0:ny:2) , u(0:nx:2, 0:ny:2) , f(0:nx:2, 0:ny:2) ) where latter example just processes every other element of arrays Typical Use of Array and Intrinsic Operations REAL u(0:nx,0:ny), A(100,100) , fact , avg u= fact * (u -avg) Scales and translates all elements of u avg = .25*( CSHIFT(u,1,1) + CSHIFT(u,-1,1) + CSHIFT(u,1,2) + CSHIFT(u,-1,2) calculates of average of 4 array elements surrounding each point. Note third argument in CSHIFT is label for axis (1=x 2=y) SQRT( A(1:100) ) calculates a new array containing 100 square roots SUM(A) is a reduction operator sumimg all elements of array A as a scalar SIZE(A,1) is an Array Query Intrinsic giving size of A in the first dimension and is particularly useful for "assumed-shape" arrays passed into subroutines Derived Type in Fortran90 TYPE PERSON CHARACTER(LEN=10) NAME REAL AGE INTEGER ID END TYPE PERSON TYPE(PERSON) YOU,ME The Identification number of YOU would be accessed as YOU%ID as an ordinary integer One can define global operators so that YOU+ME could be defined One can use name of derived type as a constructor YOU = PERSON ('Pamela Fox', 12, 3) Examples of POINTER's in Fortran90 One can define a linked list as: TYPE ENTRY REAL VALUE INTEGER INDEX TYPE(ENTRY), POINTER :: NEXT END TYPE ENTRY ALLOCATE Creates dynamically elements in a linked list CURRENT = ENTRY( NEW_VALUE, NEW_INDEX, FIRST) FIRST => CURRENT adds a new entry at start of linked list and renames it with POINTER FIRST MODULEs in Fortran90 General Syntax is: MODULE name Specify it! CONTAINS This is optional module subprograms END MODULE name MODULE IllustratingCommonBlock INTEGER DIMENSION(52) :: CARDS END MODULE IllustratingCommonBlock replaces COMMON construct and can be used as USE IllustratingCommonBlock MODULEs INTERFACES and Overloaded Operators in Fortran90 MODULE INTERVAL_ARITHMETIC TYPE INTERVAL REAL LOWER, UPPER END TYPE INTERVAL INTERFACE OPERATOR(+) define overloaded + operator MODULE PROCEDURE ADD_INTERVALS END INTERFACE CONTAINS FUNCTION ADD_INTERVALS(A,B) TYPE(INTERVAL) ADD_INTERVALS, A, B ADD_INTERVALS%LOWER = A%LOWER + B%LOWER ADD_INTERVALS%UPPER = A%UPPER + B%UPPER END FUNCTION ADD_INTERVALS(A,B) END MODULE INTERVAL_ARITHMETIC Outline of HPF Discussion What is HPF, what we need it for, where it came from How does HPF Get its Parallelism Why it is called "High Performance"? What are HPF compiler directives Data mapping in HPF TEMPLATE PROCESSORS DISTRIBUTE ALIGN Parallel statements and constructs in HPF Array statements, WHERE/ELSEWHERE, Intrinsics, FORALL, PURE, INDEPENDENT Latest Discussions -- HPF-2 ON HOME, TASKING, Dynamic Data Mapping, Reductions in INDEPENDENT DO loops Information on HPF and HPF Forum (HPFF) Rice has taken lead in HPF Forum which is a much faster mechanism of getting agreement than formal 10 year process which Fortran90 suffered World Wide Web at rice and Vienna http://www.crpc.rice.edu/HPFF/home.html http://www.vcpc.univie.ac.at/HPFF/home.html Mailing List is majordomo@cs.rice.edu and choose list (hpff, hpff-interpret, hpff-core) you wish to subscribe to Anonymous FTP to titan.cs.rice.edu and look at public/HPFF/README for latest list of files Possible Programming Models Explicit Message Passing as in PVM or MPI User breaks program into parts and the parts send messages between them to implement communication necessary for synchronization and integration of parts into solution of a single program This matches hardware but is not particularly natural for problem and can be machine dependent Object Oriented programming is like message passing but now objects and not programs communicate Very good when objects are natural from the problem and represent functional parallelism However in data parallel problems tackled with object oriented approach, one must break problem up into a number of objects that depends on number of processors and so reflects machine and not problem Data Parallel Programming Model Data Parallelism is higher level than either message passing or object models (if objects used to break up data to respect computer) It provides a Shared Memory Programming Model which can be executed on SIMD or MIMD computers, distributed or shared memory computers Note it specifies problem not machine structure It in principle provides the most attractive machine independent model for programmers as it reflects problem and not computer Its disadvantage is that hard to build compilers especially for the most interesting new algorithms which are dynamic and irregular! Parallelism in HPF Parallelism in HPF is expressed explicitly Fortran 90 array expressions and assignments (including WHERE) HPF Library and Array intrinsics FORALL statement and construct PURE labels procedures that can be used in FORALL as they have no "side-effects" INDEPENDENT assertion on DO loops Compiler may choose not to exploit information about parallelism Compiler may detect parallelism in sequential code Fortran77 is part of Fortran90 A=B or more interestingly WHERE( B > 0. ) A = B ELSEWHERE A=0. END WHERE can be written DO I = n1,n2 DO J = m1,m2 IF(B(I,J) >0.) THEN A(I,J) = B(I,J) ELSE A(i,J) = 0. END IF END DO END DO Now a good HPF compiler will recognize the DO loops can be parallelized and give the same answer for Fortran90 and Fortran77 forms but often the detection of parallelism is not clear Note FORALL is guaranteed to be parallelizeable as by definition no side effects. HPF Features All of Fortran90 New instructions FORALL and INDEPENDENT enhancing DO loops Data Alignment and Distribution Assertions Miscellaneous Support Operations but NO parallel Input/Output Little Support for Irregular Computations Little Support for any form of non mainstream data-parallelism Extrinsics as supporting links with explicit message-passing What gives high performance in HPF There is tradeoff between parallelism and communication Programmer defines the data mapping and compiler uses this to assign processing Underlying assumptions are that: An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor, And that it may be possible to carry out many such operations concurrently if they can be performed on different processors This is embodied in "owner computes" rule -- namely that in for instance A(i,j)= ..... One brings everything on right hand side to process "owning" A(i,j) and performs computation in this processor Owner computes algorithm is usually good and often best Compiler directives used in HPF The directives are structured comments that suggest implementation strategies or assert facts about a program to the compiler They may affect the efficiency of the computation performed, but do not change the value computed by the program As in Fortran 90 statements, there are both: declarative directives executable directives What does an HPF Compiler do? It must generate Fortran77(90) + Message Passing code or possibly in one pass map HPF code onto parallel machine code Traditional dataflow and dependency analysis is especially critical in Fortran77 parts of code It must use data mapping assertions to decide what is stored where and so organize computation Code must be transformed to respect this owner-computes model It must typically use "Loosely Synchronous" model with communicate-compute phases and then compiler generates all the communication needed Due to latency issues compiler must minimize communication needed and maximize size of packets sent We need an excellent run-time library which the compiler invokes with parallel Intrinsics etc. Syntax of HPF Directives HPF directives are consistent with Fortran 90 syntax except for the special prefix for directive: !HPF$ (only version allowed with free format) CHPF$ *HPF$ Two forms of the directives are allowed Specification statements, such as !HPF$ DISTRIBUTE MYTEMPLATE(BLOCK) ONTO P Equivalent Attributed form, such as, !HPF$ DISTRIBUTE (BLOCK) ONTO P :: MYTEMPLATE Data Mapping in HPF Data Mapping in HPF is all you need to do to get parallelism as long as you use the explicit array type syntax such as A=B+C The Owner Computes rule implies that specifying location of variables specifies (optimally or not) parallel execution! The new HPF-2 ON HOME directive is exception to this rule as specifies where a particular statement is to be executed (RE)DISTRIBUTE tells you where data is to be placed (RE)ALIGN tells you how different data structures are to be placed relative to each other Staged Data Mapping in HPF Template in HPF A Template is an abstract space of indexed positions (an "array of nothings") In CMFortran terminology, Template is set of Virtual Processors -- one per data point Template typically specifies precisely the full natural data parallelism that is natural for problem and HPF maps this problem parallelism onto particular machine A template is declared by the TEMPLATE directive that specifies: name of the template the rank (i.e., number of dimensions) the extent in each dimension Examples: CHPF$ TEMPLATE T(1000) !HPF$ TEMPLATE FRED(N, 2*N) *HPF$ TEMPLATE, DIMENSION(5,100,50) :: MINE, YOURS Abstract Processors in HPF Abstract processors always form a rectilinear grid in 1 or more dimensions They are abstract coarse grain collections of data-points Remember efficiency says we must "block" communication into large chunks -- processors give us a general target for this The processor arrangement is defined by the PROCESSORS directive that specifies: name of the processor arrangement the rank (i.e., number of dimensions) the extend in each dimension Examples: !HPF$ PROCESSORS P(N) *HPF$ PROCESSORS BIZARRO(1972:1997,-20:17) CHPF$ PROCESSORS SCALARPROC (by default sequential) Example of Template and Processors REAL, DIMENSION(40) :: A, B, C !HPF$ PROCESSORS P(4) !HPF$ TEMPLATE X(40) !HPF$ ALIGN WITH X :: A, B, C !HPF$ DISTRIBUTE X(BLOCK) ... C = A + B ... Align Directive in HPF Syntax of Align: !HPF ALIGN alignee WITH align-target where -- note [..] implies optional component alignee: alignee [(align-source-list)] align target: align-target[(align-subscript-list)] Alternatively *HPF ALIGN (align-source-list) WITH align-target :: alignee Examples of Align Directive Note a colon(:) in directive denotes all values of array index Examples of array indices: CHPF$ ALIGN A(i) WITH B(i) *HPF$ ALIGN (i,j) WITH TEMPL(i,j) :: A, B Use of : examples: !HPF$ ALIGN A(:) WITH B(:) CHPF$ align (:,:) WITH TEMPL(:,:) :: A, B Changing Rank in Align Directive Ranks of the alignee and the align-target may be different Examples: !HPF$ ALIGN A(:,j) WITH B(:) CHPF$ ALIGN A(:,*) WITH B(:) Replication in Align Directive ... or other way round !HPF$ ALIGN A(:) WITH TEMPL(:,*) while this only puts A on some parts of template... !HPF$ ALIGN A(:) WITH TEMPL(:,i) General Alignments in HPF HPF allows for more general alignments such as: REAL, DIMENSION(5,8) :: A,B !HPF$ TEMPLATE T(12,12) !HPF$ ALIGN A(:,J) WITH T(:,J+1) !HPF$ ALIGN B(I,J) WITH T(I+4,J+4) Useful for simple numerical shifts as in example but not useful in general case of arbitary index values allowed by ALIGN syntax Formal Definition of Align Directive Each align-dummy variable is considered to range over all valid index values for the corresponding dimension of the alignee. An align-subscript is evaluated for any specific combination of values for the align-dummy variables simply by evaluating each align-subscript as a expression. Their resulting subscript values must be legitimate subscripts for the align-target More obscure Complicated Examples of Align Directive These examples have non-unit stride as perhaps in "red-black" Iterative Solver algorithms: Distribution Directive in HPF Syntax: !HPF$ DISTRIBUTE distributee (dist-format) [ONTO dist-target] Allowed forms of dist-format: * -- Implies no distribution in this index BLOCK -- Critical to minimize communication CYCLIC -- Critical for load balancing BLOCK(int-expr) -- Not Obviously useful! CYCLIC(int-expr) -- Very useful Examples: CHPF$ DISTRIBUTE TEMP(BLOCK,CYCLIC) !HPF$ DISTRIBUTE FRED(BLOCK(10)) ONTO P *HPF$ DISTRIBUTE (BLOCK,*) :: MYTEMPLATE Basic Examples of Distribute Directive !HPF$ PROCESSORS P(4) REAL, DIMENSION(16) :: A !HPF$ TEMPLATE T(16) !HPF$ ALIGN A(:) WITH T(:) Two Dimensional Example of Distribute Directive DIMENSION, REAL(4,4) :: A *HPF PROCESSORS SQUARE(2,2) *HPF TEMPLATE T(4,4) *HPF ALIGN A(:,:) WITH T(:,:) *HPF DISTRIBUTE T(BLOCK,CYCLIC)ONTO SQUARE The Two Basic Distributions in HPF We used BLOCK in the Laplace equation example and so this is appropriate distribution for "local" or geometric type problems CYCLIC is called scattered in our early work (or is a special case of scattered which is perhaps random distribution of objects on processors) is appropriate in cases where "load-balancing" is more important than locality Simplest examples are matrix inversion and graphics rendering problems In solving equations (we will do later) Ax=b , there is no "nearest neighbor" structure between rows and columns, but rather one eliminates rows and columns and cyclic distribution ensures work remains balanced In calculating pixels, work depends on complexity of picture at that pixel and so best to distribute pixels cyclically (or randomly) to processors. The Example of Matrix Inversion Matrix Inversion set up on two processors after 0 2 and 4 rows/columns eliminated Note BLOCK decomposition leads to all work being on one processor at end even if starts off balanced Example of Graphics Rendering Here we show a 16 by 16 array of pixels with either CYCLIC or 8 by 8 two dimensional BLOCK,BLOCK Example of Distribute Directive with Complex Alignment REAL, DIMENSION(12,16) :: A REAL, DIMENSION(8,10) :: B CHPF$ PROCESSORS Q(4) CHPF$ TEMPLATE FRED(16,16) CHPF$ ALIGN A(:,:) WITH FRED(:,:) CHPF$ ALIGN B(I,J) WITH FRED(I+2,J+2) CHPF$ DISTRIBUTE FRED(BLOCK,*) Dynamic Data Mapping One data mapping is often not appropriate for an entire program Often one has phases in which different distributions are needed in different phases e.g. in 2D FFT, one typically finds FFT of F(I,J) by first distributing so for each J all I (x values) are in same processor and then transform so that for each I all J are in same processor This ensures no communication in FFT phases which is important as typically in distributed one dimensional FFT there is substantial overhead ALLOCATABLE arrays can change size REALIGN and REDISTRIBUTE are executable DISTRIBUTE and ALIGN commands but are only to be used if one declares arrays on which they act DYNAMIC Naturally DYNAMIC arrays can be initialized by ALIGN or DISTRIBUTE statements Advanced Mapping Directives -- ReDistribution and ReAlign This example illustrates remapping from one to two dimensional decomposition for A and changing B from alignment with columns to alignment with rows REAL, DIMENSION(64,64) :: A REAL, DIMENSION(64) :: B !HPF$ PROCESSORS P(64) !HPF$ PROCESSORS Q(8,8) !HPF$ DYNAMIC :: A,B !HPF$ ALIGN B(:) WITH A(:,*) !HPF$ DISTRIBUTE A(*,BLOCK)ONTO P ... !HPF$ REALIGN B(:) WITH A(*,:) ... !HPF$ REDISTRIBUTE A(CYCLIC,CYCLIC) ONTO Q ... Advanced Mapping Directives -- Allocatable arrays and pointers SUBROUTINE SUB(N,M) REAL, ALLOCATABLE, DIMENSION(:) :: A,B REAL, POINTER :: P(:) !HPF$ PROCESSORS Q(64) !HPF$ ALIGN B(I) WITH A(I+N) !HPF$ DISTRIBUTE A(BLOCK(M)) !HPF$ DISTRIBUTE(BLOCK), DYNAMIC :: P ... ALLOCATE(A(128)) ALLOCATE(B(64)) ALLOCATE(P(1024)) ... !HPF$ REDISTRIBUTE P(CYCLIC) ... RETURN END Subprograms in HPF Scope of any mapping directives is a single (sub)program unit A template or distribution is not a first-class Fortran 90 object: It cannot be passed as a subprogram argument and this creates significant complication! HPF Compiler will typically pass an extra argument which is effectively an array-descriptor telling subroutine about distribution of passed arrays One can use array query intrinsics to find out what is going on but of course compiler does this implicitly Passing Distributed Arrays as Subprogram Arguments in HPF There are three typical cases: Subroutine requires data to use a particular mapping determined by subroutine Arguments must be remapped Subroutine can use any mapping so actual argument should be passed and used with current mapping Here we have two cases depending on whether programmer knows or not (and tells subroutine) what incoming distribution is Sometimes we need to remap due to array sections being passed Any remappings must be undone on return from subroutine Mapping Options for Dummy (Subroutine) Arguments DISTRIBUTE use * instead of dist-format or ONTO clause indicates that incoming distribution is acceptable i.e. leave data in place * before dist-format or ONTO clause indicates that data should stay in place and asserts that distribution is what you claim ALIGN * instead of or before target has similar meanings to DISTRIBUTE INHERIT A new attribute allowing references back to the original full array and used when sections of array are passed Inherit Distribution Directive in HPF (not a comprehensive discussion; just an example) Summary of Mapping Directives in HPF PROCESSORS TEMPLATE ALIGN DISTRIBUTE INHERIT DYNAMIC REALIGN REDISTRIBUTE Fundamental Parallelism Assumption in HPF An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor i.e. minimize communication it may be possible to carry out many such operations concurrently if they can be performed on different processors data parallelism Parallel statements and Constructs in HPF Parallel Statements Fortran 90 array assignments masked array assignments (WHERE) FORALL statement Parallel Constructs WHERE and WHERE...ELSEWHERE construct FORALL construct INDEPENDENT DO Intrinsic functions and the HPF library Extrinsic functions Parallelism in Fortran 90 array assignments This is as in CMFortran and Maspar MPFortran with example: WHERE (masked array assignment) in HPF This is as in CMFortran and Maspar MPFortran with example: WHERE (A .GT. 0) A = A - 100 Semantics of WHERE statement: 1. evaluate mask (in parallel) and store as a temporary T1 2. for each i that T1(i)=.TRUE. compute T2(i)=A(i) - 100 3. for each i that T1(i)=.TRUE. assign A(i)=T2(1) WHERE...ELSEWHERE / IF...ELSE constructs in HPF There is a fundamental difference in semantics between IF...ELSE and WHERE...ELSEWHERE constructs Intrinsic functions in HPF elemental examples: A = SIN(X) FORALL (i=1:100:2) A(i) = EXP(A(i)) transformational and inquiry functions Fortran 90 SUM, PRODUCT, ANY, DOTPROD, EOSHIFT, MAXVAL, ... HPF system inquiry functions: NUMBER_OF_PROCESSORS, PROCESSORS_SHAPE extensions of MAXLOC and MINLOC, ILEN HPF library HPF library functions new array reduction functions IALL, IANY, IPARITY and PARITY (IAND, IOR, IEOR, and .NEQV.) array combining scatter functions XXX_SCATTER array prefix and suffix functions XXX_PREFIX, XXX_SUFFIX array sorting functions GRADE_DOWN, GRADE_UP bit manipulation functions LEADZ, POPCNT, POPPAR mapping inquiry subroutines HPF_ALIGNMENT, HPF_TEMPLATE, HPF_DISTRIBUTION SUM, SUM_PREFIX and SUM_SCATTER defined X=SUM(A) sums all elements of A and places result in scalar X Y = SUM_PREFIX(A) sets array Y of same size as A so that Y(i) has the sum of all A(j) for 1 <= j <= i Y = SUM_SCATTER(A,B, IND) sets array element Y(i) as the sum of array element B(i) plus those elements of A(j) where IND(j) = I X = SUM_SCATTER( flux, X, INDEX) is equivalent to FORALL (i=1:N) X(INDEX(i)) = X(INDEX(i)) + flux(i) END FORALL assuming INDEX(i) just permutes numbers 1 to N and has no repeated values HPF Intrinsic EXAMPLE: SUM FORALL Statement in HPF A very important extension to Fortran 90 and defines one class of parallel DO loop FORALL will be a language feature of Fortran95 It relaxes the restriction that operands of the rhs expressions must be conformable with the lhs array It may be masked with a scalar logical expression (extension of WHERE construct) A FORALL statement may call user-defined (PURE) functions on the elements of an array, simulating Fortran 90 elemental function invocation (albeit with a different syntax) FORALL( index-spec-list [,mask-expr] ) forall assignment where forall-assignment is conventional single Fortran90 statement Examples of FORALL statements in HPF FORALL (i=1:100,k=1:100) a(i,k) = b(i,k) A = B FORALL (i=2:100:2) a(i) = a(i-1) A(2:100:2) = A(1:99:2) FORALL (i=1:100) a(i) = i A = [1..100] FORALL (i=1:100, j=1:100) a(i, j) = i+j FORALL (i=1,100) a(i,i) = b(i) FORALL (i=1,100,j=1:100) a(i,j) = b(j,i) FORALL (i=1,100) a(i, 1:100) = b(1:100, i) FORALL (i=1:100, j=1:100, y(i,j).NE.0) x(i,j) = REAL(i+j)/y(i,j) FORALL (i=1,100) a(i,ix(i)) = x(i) FORALL (i=1,9) x(i) = SUM(x(1:10:i)) FORALL (i= 1,100) a(i) = myfunction(a(i+1)) Semantics of the FORALL statement in HPF Similar to Fortran 90 array assignments and WHERE Consider example: Vector Indices in FORALL's Consider FORALL( i=1:n ) a(ix(i)) = a(i) is allowed in HPF but will only give sensible reproducible results if ix(i) is a true permutation of 1...n but if ix(i) has repeated values, then the result is undefined Of course we always use "old" value" of a(i) on rhs so that if ix(i)= i+1 and a(0) defined, then result is a(i)/new = a(i-1)/old and not result of recursion a(1)/new =a(0)/old a(2)/new = a(1)/new just calculated = a(0)/old Multiple Statement FORALL's FORALL( index-spec-list [,mask-expr] ) forall-body END FORALL where forall-body can be a list of forall-assignment statements, FORALL or WHERE statements So Multi-Statement FORALL's support nesting of FORALL's but is in general Shorthand for a sequence of single statement FORALL's with by definition each statement completed before next one begins The multi-statement FORALL is likely to be more efficient than several single statement ones as latter have synchronization overhead on each statement HPF FORALL construct Pictorially PURE Functions in HPF PURE functions have no side effects ALL Intrinsics are PURE DO loops can call any functions and parallelism unclear as function call can destroy parallelism DO I=1,1000 A(I) = FUNC(A(I-1),X) END DO If FUNC alters A(I-1) or in fact A(any index except I), then this loop cannot be easily parallelized FORALL statements can only call PURE functions and these must NOT define any global (e.g. any element of A in example) or dummy (A(i-1) or X) variable PURE functions can only INHERIT distribution and alignment statements Example of PURE Function from Chuck Koelbel FORALL( i=1:n, j=1:m ) k(i,j) = mandelbrot ( CMPLX((i-1)*1.0/(n-1), (j-1)*1.0/(m-1)), 1000) END FORALL This can call the PURE function mandelbrot which is essentially a generalized intrinsic PURE INTEGER FUNCTION mandelbrot (x,itol) COMPLEX, INTENT(IN) :: x INTEGER, INTENT(IN) :: itol COMPLEX xtmp INTEGER k k=0 xtmp = -x DO WHILE( ABS(xtmp) < 2. .AND. k < itol ) xtmp = xtmp*xtmp - x k = k + 1 END DO mandelbrot = k END FUNCTION mandelbrot The INDEPENDENT Assertion in HPF !HPF$ INDEPENDENT [ ,NEW (variable-list) ] INDEPENDENT asserts that no iteration affects any other in any way It implements the "embarassingly parallel" problem class we discussed under structure of problems Note rest of HPF tackles mainly the synchronous problem class with some loosely synchronous capability HPF2 has "tasking" for metaproblem class and some extensions for further irregular loosely synchronous problems NEW variables are defined to have fresh instantiations for each iteration as is typically needed for embarassingly parallel problems where in fact essentially all variables in a loop would be NEW Note INDEPENDENT can be applied to FORALL and asserts that no index point assigns to any location that another iteration index value uses This reduces copying needed in FORALL by COMPILER HPF2 (see later) has extra feature of allowing REDUCTION (accumulated) variables in INDEPENDENT DO loops !HPF$ INDEPENDENT FORALL Pictorially !HPF$ INDEPENDENT DO Pictorially !HPF$ INDEPENDENT, NEW Variable This is an exception from the conventional HPF picture of a global name space with either distributed or replicated variables Extrinsics in HPF An extrinsic function is a function written in a language other than HPF including most naturally any node programming language (e.g. Fortran77) targeted to a single processor SPMD) with message passing such as MPI HPF defines (Fortran90) interface and invocation characteristics This defines what is input and output via INTENT and these rules must be obeyed by callee (non HPF side) If variables are implicitly replicated, the callee must make them consistent before reurn i.e. callee must respect HPF model of parallelism HPF will execute any remapping commands and hands callee the "right" part of any distributed arrays All processors are synchronized before call to EXTRINSIC function Allows one to get efficient parallel code where HPF language or compiler inadequate Analogous to calling assembly language from Fortran, Native classes from Java, C from PERL etc. High Performance Fortran HPF2 Changes The original HPF 1.0 omitted some key capabilities which were known to be important but syntax and functionality was unclear in 1993 Further experience has shown that HPF compilers have proven to be difficult to write! The HPF Forum met in 1995-96 and has approved a set of Extensions and Simplifications of HPF The concept of a base HPF 2.0 and Approved Extensions has been agreed Note approved extensions (which presumably vendors need not implement) include critical capability for dynamic irregular problems DYNAMIC REALIGN and REDISTRIBUTE are no longer in base language and are just "appproved extensions" Suprisingly no parallel I/O capabilities were approved! ON HOME for Computation Placement !HPF$ INDEPENDENT DO i = 1 , n !HPF$ ON HOME( ix(i) ) x(i) = y(ix(i)) - y(iy(i)) END DO This modifies the owner computes rule by specifying that computation will not be performed on processor owning left hand side ON HOME is an approved extension Reductions in INDEPENDENT DO Loops Many applications of INDEPENDENT DO loops do require reductions as they are typically calculating independently quantities but storing results as parts of various averages e.g. in High Enegry Physics Data Analysis, each measured event can be computed via an INDEPENDENT DO but one wishes to find a particular observable (histogram, scatterplot) which is averaged over each event Financial modelling is similar x = 0 !HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x) do i = 1 , N call sub(i, xinc) x = x + xinc END DO xinc is a separate new variable each iteration but result is accumulated into global x Spawning Tasks in HPF Task Parallelism is sort of supported in HPF but not clear to me that this is a great idea as better to keep sophisticated task parallelism outside HPF which is really only designed to support data parallelism !HPF$ TASKING !HPF$ ON HOME(p(1:8)) CALL foo(x,y) !HPF$ ON HOME(p(9:16)) CALL bar(z) !HPF$ END This extends SPMD model with foo running on eight and bar on another processors Note foo and bar are expected to contain data parallel statements which distribute execution using conventional HPF over 8 processors New Data Mapping Features in HPF 2.0 - I !HPF DISTRIBUTE x( BLOCK(SHADOW = 1 ) ) is designed to help compiler by specifying that it should set up a guard ring each side of BLOCK in each processor. In example Guard ring has extent 1 If x has dimension 100 and 4 processors, we allocate storage for x(1..26) in Processor 1 x(24..51) in Processor 2 x(49..76) in Processor 3 x(74..100) in Processor 4 !HPF DISTRIBUTE x( BLOCK( /26,24,24,26/ ) ) is designed to specify general BLOCK distribution with in above example x(1..26) in Processor 1 x(26..50) in Processor 2 x(51..76) in Processor 3 x(77..100) in Processor 4 !HPF DISTRIBUTE x( INDIRECT(map_array) ) is designed to allow an arbitary user array to specify location of x(i) the processor number in map_array(i) would be typically be calculated by your favorite load balancing routine! New Data Mapping Features in HPF 2.0 - II Distribution is now allowed to Processor Subsets with typical Syntax: !HPF$ PROCESSORS procs(1:np) !HPF$ PROCESSORS b(BLOCK) ONTO procs(1:np/2-1) Distribution is allowed for Derived Types but can only be done at ONE level TYPE bunch_of_meshes REAL A1(100,100,100), A2(100,100,100) !HPF$ DISTRIBUTE (BLOCK,CYCLIC,*) :: A1,A2 END TYPE Another Example shows distribution outside TYPE definition TYPE tree with each node having one hundred children TYPE(tree) , POINTER, DIMENSION(100) :: children !HPF$ DYNAMIC children REAL value END TYPE TYPE(tree) actualtree ALLOCATE ( actualtree%children) allocates POINTERs to children !HPF$ REDISTRIBUTE t%children(BLOCK) distributes (for first time) these pointers