Given by Geoffrey C. Fox, Tom Haupt at Basic Simulation Track for Computational Science CPS615 on Fall Semester 96. Foils prepared 17 Sept 1996
Outside Index
Summary of Material
A brief discussion of Fortran90 and Fortran77 and why Fortran90 has advantages and disadvantages |
Overview of Key Features of Fortran90 |
See Metcalf and Reid, Fortran90 Explained, Oxford Scientific Publications |
Overview of Key Features of HPF
|
The Future -- HPF2 |
See Chuck Koelbel from Rice University at |
http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/Koelbel |
Outside Index Summary of Material
Geoffrey Fox, Tom Haupt |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
A brief discussion of Fortran90 and Fortran77 and why Fortran90 has advantages and disadvantages |
Overview of Key Features of Fortran90 |
See Metcalf and Reid, Fortran90 Explained, Oxford Scientific Publications |
Overview of Key Features of HPF
|
The Future -- HPF2 |
See Chuck Koelbel from Rice University at |
http://renoir.csc.ncsu.edu/MRA/HTML/Workshop2/Koelbel |
To express data parallelism and so hide machine dependent features of parallel programming from the user |
We use Fortran90 as base language both because it is
|
Use of Fortran90 is a Problem because
|
1 A(I)=B(I) is obviously parallel |
Fortran90 Array Notation
|
1 A(J)=B(K) is not so obviously parallel |
Needs a difficult to define (in general case) algorithm (especially if IF statements, i.e. conditionals, defining I,J) to decide on existence and implementation of parallelism |
Use of Fortran77 has "thrown away" natural parallelism at language level even though "run-time" restores as creates explicit values for variables such as J and K which are only known by analysis at compile time. |
Arrays are very well supported with memory allocation and set of intrinsics, better passing to procedures etc.
|
Derived Types allow general object structure (without inheritance) in F90
|
Modules replace COMMON INCLUDE etc. |
Procedures (functions,subroutines) allow better interfaces, recursion, optional parameters etc. |
Better Syntax with free form, more loop control etc. |
Arrays are "true" objects in Fortran90 and are stored as values of elements plus a data descriptor! |
There are operations on full arrays which have natural parallel implementations seen in HPF |
New set of array intrinsic (built-in) functions for elements, reductions (process array into a single value), tranformational (reshaping)
|
Extract sections (subarrays) of arrays as u(lb:ub:step)
|
masked (conditional) Array operations using WHERE .... ELSEWHERE |
Can still do Fortran77 array element operations (DO Loops) but of course this might not be interpretable for efficient parallelism by HPF compiler |
Note Fortran90 designed for science and engineering with originally special concern for vector supercomputers but Cray supports F77 better than F90(!) |
ALLOCATABLE Arrays can be defined at runtime with variable sizing
|
One can define POINTER and TARGET attributes which can be used like REAL, DIMENSION etc.
|
Arguments of a subroutine need NOT define array dimensions in subroutine as these as passed by calling program in data descriptor |
Local arrays are created on stack and bounds maybe non constant and evaluated at procedure entry |
One passes "assumed-shape" arrays from calling to callee routines using INTERFACE syntax |
INTERFACE
|
END INTERFACE is called by |
call residual (r,u,f) or |
call residual ( r(0:nx:2, 0:ny:2) , u(0:nx:2, 0:ny:2) , f(0:nx:2, 0:ny:2) ) |
where latter example just processes every other element of arrays |
REAL u(0:nx,0:ny), A(100,100) , fact , avg |
u= fact * (u -avg) Scales and translates all elements of u |
avg = .25*( CSHIFT(u,1,1) + CSHIFT(u,-1,1) + CSHIFT(u,1,2) + CSHIFT(u,-1,2) |
calculates of average of 4 array elements surrounding each point. Note third argument in CSHIFT is label for axis (1=x 2=y) |
SQRT( A(1:100) ) calculates a new array containing 100 square roots |
SUM(A) is a reduction operator sumimg all elements of array A as a scalar |
SIZE(A,1) is an Array Query Intrinsic giving size of A in the first dimension and is particularly useful for "assumed-shape" arrays passed into subroutines |
TYPE PERSON
|
END TYPE PERSON |
TYPE(PERSON) YOU,ME |
The Identification number of YOU would be accessed as YOU%ID as an ordinary integer |
One can define global operators so that YOU+ME could be defined |
One can use name of derived type as a constructor |
YOU = PERSON ('Pamela Fox', 12, 3) |
One can define a linked list as: |
TYPE ENTRY
|
END TYPE ENTRY |
ALLOCATE Creates dynamically elements in a linked list |
CURRENT = ENTRY( NEW_VALUE, NEW_INDEX, FIRST) |
FIRST => CURRENT |
adds a new entry at start of linked list and renames it with POINTER FIRST |
General Syntax is: |
MODULE name
|
CONTAINS This is optional
|
END MODULE name |
MODULE IllustratingCommonBlock
|
END MODULE IllustratingCommonBlock |
replaces COMMON construct and can be used as |
USE IllustratingCommonBlock |
MODULE INTERVAL_ARITHMETIC
|
CONTAINS
|
END MODULE INTERVAL_ARITHMETIC |
What is HPF, what we need it for, where it came from |
How does HPF Get its Parallelism |
Why it is called "High Performance"? |
What are HPF compiler directives |
Data mapping in HPF
|
Parallel statements and constructs in HPF
|
Latest Discussions -- HPF-2
|
Rice has taken lead in HPF Forum which is a much faster mechanism of getting agreement than formal 10 year process which Fortran90 suffered |
World Wide Web at rice and Vienna
|
Mailing List is majordomo@cs.rice.edu and choose list (hpff, hpff-interpret, hpff-core) you wish to subscribe to |
Anonymous FTP to titan.cs.rice.edu and look at
|
Explicit Message Passing as in PVM or MPI |
User breaks program into parts and the parts send messages between them to implement communication necessary for synchronization and integration of parts into solution of a single program |
This matches hardware but is not particularly natural for problem and can be machine dependent |
Object Oriented programming is like message passing but now objects and not programs communicate
|
Data Parallelism is higher level than either message passing or object models (if objects used to break up data to respect computer) |
It provides a Shared Memory Programming Model which can be executed on SIMD or MIMD computers, distributed or shared memory computers |
Note it specifies problem not machine structure |
It in principle provides the most attractive machine independent model for programmers as it reflects problem and not computer |
Its disadvantage is that hard to build compilers especially for the most interesting new algorithms which are dynamic and irregular! |
Parallelism in HPF is expressed explicitly
|
Compiler may choose not to exploit information about parallelism |
Compiler may detect parallelism in sequential code |
A=B or more interestingly |
WHERE( B > 0. ) A = B |
ELSEWHERE A=0. |
END WHERE |
can be written |
DO I = n1,n2 |
DO J = m1,m2 |
IF(B(I,J) >0.) THEN A(I,J) = B(I,J) |
ELSE A(i,J) = 0. |
END IF |
END DO |
END DO |
Now a good HPF compiler will recognize the DO loops can be parallelized and give the same answer for Fortran90 and Fortran77 forms but often the detection of parallelism is not clear |
Note FORALL is guaranteed to be parallelizeable as by definition no side effects. |
All of Fortran90 |
New instructions FORALL and INDEPENDENT enhancing DO loops |
Data Alignment and Distribution Assertions |
Miscellaneous Support Operations but |
NO parallel Input/Output |
Little Support for Irregular Computations |
Little Support for any form of non mainstream data-parallelism |
Extrinsics as supporting links with explicit message-passing |
There is tradeoff between parallelism and communication |
Programmer defines the data mapping and compiler uses this to assign processing |
Underlying assumptions are that: |
An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor, |
And that it may be possible to carry out many such operations concurrently if they can be performed on different processors |
This is embodied in "owner computes" rule -- namely that in for instance
|
Owner computes algorithm is usually good and often best |
The directives are structured comments that suggest implementation strategies or assert facts about a program to the compiler |
They may affect the efficiency of the computation performed, but do not change the value computed by the program |
As in Fortran 90 statements, there are both:
|
It must generate Fortran77(90) + Message Passing code or possibly in one pass map HPF code onto parallel machine code |
Traditional dataflow and dependency analysis is especially critical in Fortran77 parts of code |
It must use data mapping assertions to decide what is stored where and so organize computation |
Code must be transformed to respect this owner-computes model |
It must typically use "Loosely Synchronous" model with communicate-compute phases and then compiler generates all the communication needed
|
We need an excellent run-time library which the compiler invokes with parallel Intrinsics etc. |
HPF directives are consistent with Fortran 90 syntax except for the special prefix for directive:
|
Two forms of the directives are allowed
|
Data Mapping in HPF is all you need to do to get parallelism as long as you use the explicit array type syntax such as A=B+C |
The Owner Computes rule implies that specifying location of variables specifies (optimally or not) parallel execution! |
The new HPF-2 ON HOME directive is exception to this rule as specifies where a particular statement is to be executed |
(RE)DISTRIBUTE tells you where data is to be placed |
(RE)ALIGN tells you how different data structures are to be placed relative to each other |
A Template is an abstract space of indexed positions (an "array of nothings") |
In CMFortran terminology, Template is set of Virtual Processors -- one per data point
|
A template is declared by the TEMPLATE directive that specifies:
|
Examples:
|
Abstract processors always form a rectilinear grid in 1 or more dimensions |
They are abstract coarse grain collections of data-points
|
The processor arrangement is defined by the PROCESSORS directive that specifies:
|
Examples:
|
!HPF$ PROCESSORS P(4) |
!HPF$ TEMPLATE X(40) |
!HPF$ ALIGN WITH X :: A, B, C |
!HPF$ DISTRIBUTE X(BLOCK)
|
Syntax of Align: |
!HPF ALIGN alignee WITH align-target
|
Alternatively |
*HPF ALIGN (align-source-list) WITH align-target :: alignee |
Note a colon(:) in directive denotes all values of array index |
Examples of array indices:
|
Use of : examples:
|
Ranks of the alignee and the align-target may be different |
Examples:
|
... or other way round
|
while this only puts A on some parts of template... |
!HPF$ ALIGN A(:) WITH TEMPL(:,i) |
HPF allows for more general alignments such as:
|
!HPF$ TEMPLATE T(12,12) |
!HPF$ ALIGN A(:,J) WITH T(:,J+1) |
!HPF$ ALIGN B(I,J) WITH T(I+4,J+4) |
Useful for simple numerical shifts as in example but not useful |
in general case of arbitary |
index values allowed by |
ALIGN syntax |
Each align-dummy variable is considered to range over all valid index values for the corresponding dimension of the alignee. An align-subscript is evaluated for any specific combination of values for the align-dummy variables simply by evaluating each align-subscript as a expression. Their resulting subscript values must be legitimate subscripts for the align-target |
These examples have non-unit stride as perhaps in "red-black" Iterative Solver algorithms: |
Syntax: |
!HPF$ DISTRIBUTE distributee (dist-format) |
[ONTO dist-target] |
Allowed forms of dist-format:
|
Examples:
|
!HPF$ PROCESSORS P(4)
|
!HPF$ TEMPLATE T(16) |
!HPF$ ALIGN A(:) WITH T(:) |
*HPF PROCESSORS SQUARE(2,2) |
*HPF TEMPLATE T(4,4) |
*HPF ALIGN A(:,:) WITH T(:,:) |
*HPF DISTRIBUTE T(BLOCK,CYCLIC)ONTO SQUARE |
We used BLOCK in the Laplace equation example and so this is appropriate distribution for "local" or geometric type problems |
CYCLIC is called scattered in our early work (or is a special case of scattered which is perhaps random distribution of objects on processors) is appropriate in cases where "load-balancing" is more important than locality
|
Matrix Inversion set up on two processors after |
0 2 and 4 rows/columns eliminated |
Note BLOCK decomposition leads to all work being on one processor at end even if starts off balanced |
Here we show a 16 by 16 array of pixels with either CYCLIC or 8 by 8 two dimensional BLOCK,BLOCK |
CHPF$ PROCESSORS Q(4) |
CHPF$ TEMPLATE FRED(16,16) |
CHPF$ ALIGN A(:,:) WITH FRED(:,:) |
CHPF$ ALIGN B(I,J) WITH FRED(I+2,J+2) |
CHPF$ DISTRIBUTE FRED(BLOCK,*) |
One data mapping is often not appropriate for an entire program
|
ALLOCATABLE arrays can change size |
REALIGN and REDISTRIBUTE are executable DISTRIBUTE and ALIGN commands but are only to be used if one declares arrays on which they act DYNAMIC |
Naturally DYNAMIC arrays can be initialized by ALIGN or DISTRIBUTE statements |
This example illustrates remapping from one to two dimensional decomposition for A and changing B from alignment with columns to alignment with rows
|
!HPF$ PROCESSORS P(64) |
!HPF$ PROCESSORS Q(8,8) |
!HPF$ DYNAMIC :: A,B |
!HPF$ ALIGN B(:) WITH A(:,*) |
!HPF$ DISTRIBUTE A(*,BLOCK)ONTO P
|
!HPF$ REALIGN B(:) WITH A(*,:)
|
!HPF$ REDISTRIBUTE A(CYCLIC,CYCLIC) ONTO Q
|
!HPF$ PROCESSORS Q(64) |
!HPF$ ALIGN B(I) WITH A(I+N) |
!HPF$ DISTRIBUTE A(BLOCK(M)) |
!HPF$ DISTRIBUTE(BLOCK), DYNAMIC :: P
|
!HPF$ REDISTRIBUTE P(CYCLIC)
|
Scope of any mapping directives is a single (sub)program unit |
A template or distribution is not a first-class Fortran 90 object: |
It cannot be passed as a subprogram argument and this creates significant complication! |
HPF Compiler will typically pass an extra argument which is effectively an array-descriptor telling subroutine about distribution of passed arrays |
One can use array query intrinsics to find out what is going on but of course compiler does this implicitly |
There are three typical cases: |
Subroutine requires data to use a particular mapping determined by subroutine
|
Subroutine can use any mapping so actual argument should be passed and used with current mapping
|
Sometimes we need to remap due to array sections being passed |
Any remappings must be undone on return from subroutine |
DISTRIBUTE
|
ALIGN
|
INHERIT
|
(not a comprehensive discussion; just an example) |
PROCESSORS |
TEMPLATE |
ALIGN |
DISTRIBUTE |
INHERIT |
DYNAMIC |
REALIGN |
REDISTRIBUTE |
An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor
|
it may be possible to carry out many such operations concurrently if they can be performed on different processors
|
Parallel Statements
|
Parallel Constructs
|
Intrinsic functions and the HPF library |
Extrinsic functions |
This is as in CMFortran and Maspar MPFortran with example: |
This is as in CMFortran and Maspar MPFortran with example:
|
Semantics of WHERE statement:
|
There is a fundamental difference in semantics between IF...ELSE and WHERE...ELSEWHERE constructs |
elemental
|
transformational and inquiry functions
|
new array reduction functions
|
array combining scatter functions
|
array prefix and suffix functions
|
array sorting functions
|
bit manipulation functions
|
mapping inquiry subroutines
|
X=SUM(A) sums all elements of A and places result in scalar X |
Y = SUM_PREFIX(A) sets array Y of same size as A so that Y(i) has the sum of all A(j) for 1 <= j <= i |
Y = SUM_SCATTER(A,B, IND) sets array element Y(i) as the sum of array element B(i) plus those elements of A(j) where IND(j) = I
|
A very important extension to Fortran 90 and defines one class of parallel DO loop |
FORALL will be a language feature of Fortran95 |
It relaxes the restriction that operands of the rhs expressions must be conformable with the lhs array |
It may be masked with a scalar logical expression (extension of WHERE construct) |
A FORALL statement may call user-defined (PURE) functions on the elements of an array, simulating Fortran 90 elemental function invocation (albeit with a different syntax) |
FORALL( index-spec-list [,mask-expr] ) forall assignment
|
FORALL (i=1:100,k=1:100) a(i,k) = b(i,k) A = B |
FORALL (i=2:100:2) a(i) = a(i-1) A(2:100:2) = A(1:99:2) |
FORALL (i=1:100) a(i) = i A = [1..100] |
FORALL (i=1:100, j=1:100) a(i, j) = i+j |
FORALL (i=1,100) a(i,i) = b(i) |
FORALL (i=1,100,j=1:100) a(i,j) = b(j,i) |
FORALL (i=1,100) a(i, 1:100) = b(1:100, i) |
FORALL (i=1:100, j=1:100, y(i,j).NE.0) x(i,j) = REAL(i+j)/y(i,j) |
FORALL (i=1,100) a(i,ix(i)) = x(i) |
FORALL (i=1,9) x(i) = SUM(x(1:10:i)) |
FORALL (i= 1,100) a(i) = myfunction(a(i+1)) |
Similar to Fortran 90 array assignments and WHERE |
Consider example: |
Consider FORALL( i=1:n ) a(ix(i)) = a(i) |
is allowed in HPF but will only give sensible reproducible results if ix(i) is a true permutation of 1...n
|
Of course we always use "old" value" of a(i) on rhs so that if ix(i)= i+1 and a(0) defined, then result is
|
FORALL( index-spec-list [,mask-expr] ) |
forall-body |
END FORALL |
where forall-body can be a list of forall-assignment statements, FORALL or WHERE statements |
So Multi-Statement FORALL's support nesting of FORALL's but |
is in general Shorthand for a sequence of single statement FORALL's with by definition each statement completed before next one begins |
The multi-statement FORALL is likely to be more efficient than several single statement ones as latter have synchronization overhead on each statement |
PURE functions have no side effects
|
DO loops can call any functions and parallelism unclear as function call can destroy parallelism
|
If FUNC alters A(I-1) or in fact A(any index except I), then this loop cannot be easily parallelized |
FORALL statements can only call PURE functions and these must NOT define any global (e.g. any element of A in example) or dummy (A(i-1) or X) variable
|
FORALL( i=1:n, j=1:m )
|
END FORALL |
This can call the PURE function mandelbrot which is essentially a generalized intrinsic |
PURE INTEGER FUNCTION mandelbrot (x,itol)
|
END FUNCTION mandelbrot |
!HPF$ INDEPENDENT [ ,NEW (variable-list) ] |
INDEPENDENT asserts that no iteration affects any other in any way |
It implements the "embarassingly parallel" problem class we discussed under structure of problems |
Note rest of HPF tackles mainly the synchronous problem class with some loosely synchronous capability
|
NEW variables are defined to have fresh instantiations for each iteration as is typically needed for embarassingly parallel problems where in fact essentially all variables in a loop would be NEW |
Note INDEPENDENT can be applied to FORALL and asserts that no index point assigns to any location that another iteration index value uses
|
HPF2 (see later) has extra feature of allowing REDUCTION (accumulated) variables in INDEPENDENT DO loops |
This is an exception from the conventional HPF picture of a global name space with either distributed or replicated variables |
An extrinsic function is a function written in a language other than HPF including most naturally any node programming language (e.g. Fortran77) targeted to a single processor SPMD) with message passing such as MPI |
HPF defines (Fortran90) interface and invocation characteristics
|
Allows one to get efficient parallel code where HPF language or compiler inadequate
|
The original HPF 1.0 omitted some key capabilities which were known to be important but syntax and functionality was unclear in 1993
|
The HPF Forum met in 1995-96 and has approved a set of Extensions and Simplifications of HPF |
The concept of a base HPF 2.0 and Approved Extensions has been agreed |
Note approved extensions (which presumably vendors need not implement) include critical capability for dynamic irregular problems |
DYNAMIC REALIGN and REDISTRIBUTE are no longer in base language and are just "appproved extensions" |
Suprisingly no parallel I/O capabilities were approved! |
!HPF$ INDEPENDENT |
DO i = 1 , n
|
END DO |
This modifies the owner computes rule by specifying that computation will not be performed on processor owning left hand side |
ON HOME is an approved extension |
Many applications of INDEPENDENT DO loops do require reductions as they are typically calculating independently quantities but storing results as parts of various averages |
e.g. in High Enegry Physics Data Analysis, each measured event can be computed via an INDEPENDENT DO but one wishes to find a particular observable (histogram, scatterplot) which is averaged over each event |
Financial modelling is similar |
x = 0 |
!HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x) |
do i = 1 , N
|
END DO |
xinc is a separate new variable each iteration but result is accumulated into global x |
Task Parallelism is sort of supported in HPF but not clear to me that this is a great idea as better to keep sophisticated task parallelism outside HPF which is really only designed to support data parallelism |
!HPF$ TASKING
|
!HPF$ END |
This extends SPMD model with foo running on eight and bar on another processors |
Note foo and bar are expected to contain data parallel statements which distribute execution using conventional HPF over 8 processors |
!HPF DISTRIBUTE x( BLOCK(SHADOW = 1 ) )
|
!HPF DISTRIBUTE x( BLOCK( /26,24,24,26/ ) )
|
!HPF DISTRIBUTE x( INDIRECT(map_array) )
|
Distribution is now allowed to Processor Subsets with typical Syntax:
|
Distribution is allowed for Derived Types but can only be done at ONE level
|