Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 26 September 96. Foils prepared 26 September 1996
Outside Index
Summary of Material
We go through the 2D Laplace's Equation with both HPF and MPI for Simple Jacobi Iteration |
HPF and Fortran90 are reviewed followed by MPI |
We also discuss the structure of problems as these determine why and when certain software approaches are appropriate |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
We go through the 2D Laplace's Equation with both HPF and MPI for Simple Jacobi Iteration |
HPF and Fortran90 are reviewed followed by MPI |
We also discuss the structure of problems as these determine why and when certain software approaches are appropriate |
Solve Laplace's Equation: |
on a rectangular domain, with specified boundary conditions. |
Use simple iterative Gauss-Seidel algorithm: |
f new = ( f left + f right + f up + f down ) / 4 |
256 Grid Points |
16-Node Concurrent Processor |
16 Grid Points in each Processor |
f is unknown |
f is known
|
X denotes f values to be communicated |
BEGIN TEST=0
|
1 PHINEW (I) = TEMP
|
2 PHIOLD (I) = PHINEW(I)
|
BEGIN TEST = 0
|
Data Parallel typified by CMFortran and its generalization - High Performance Fortran which we have discussed separately |
Typical Data Parallel Fortran Statements are full array statements
|
Message Passing typified by later discussion of Laplace Example which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature |
Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" |
Data parallel software is translated by a compiler into "machine language" which is typically message passing |
A particular way of using MIMD machines - DOMINANT successful use so far. |
Each processor runs same program but in general executes different instructions at different times. |
Will later see corresponds to "loosely synchronous problems". |
Style of current program example -- note although each doing roughly same thing -- i.e. updating grid points -- each node is NOT at same point in update at each clock cycle |
!HPF$ TEMPLATE WORLD(NTOT) (1) |
!HPF$ DISTRIBUTE WORLD(BLOCK) (2)
|
!HPF$ ALIGN PHINEW WITH WORLD (3) |
!HPF$ ALIGN PHIOLD WITH WORLD (3)
|
BEGIN PHINEW (2:NTOT1) =0.5* (EOSHIFT (PHIOLD,1) + EOSHIFT (PHIOLD, -1)) (4)
|
(1) DefInes a data world of NTOT entries |
(2) Breaks up world into equal parts in each processor decomposed in one dimension |
(3) Instructs that PHINEW AND PHIOLD are aligned exactly with world i.e. PHINEW(I) and PHIOLD(I) are stored in the I'th position of world |
(4) EOSHIFT is "End-Off" SHIFT. The calls shift PHIOLD by one both to left and right. The indices take care of WORLD (1 and NTOT) being boundary values
|
BEGIN PHINEW(2:NTOT1) = 0.5 * (PHIOLD (1:NTOT2) + PHIOLD (3:NTOT)) |
(5) Forms TEST as maximum change in absolute value of j at any world location. MAXVAL is a Fortran90 intrinsic function. |
NOTE: Any subscript n:m is an "array section" or set of indices |
To express data parallelism and so hide machine dependent features of parallel programming from the user |
We use Fortran90 as base language both because it is
|
Use of Fortran90 is a Problem because
|
1 A(I)=B(I) is obviously parallel |
Fortran90 Array Notation
|
1 A(J)=B(K) is not so obviously parallel |
Needs a difficult to define (in general case) algorithm (especially if IF statements, i.e. conditionals, defining I,J) to decide on existence and implementation of parallelism |
Use of Fortran77 has "thrown away" natural parallelism at language level even though "run-time" restores as creates explicit values for variables such as J and K which are only known by analysis at compile time. |
Arrays are very well supported with memory allocation and set of intrinsics, better passing to procedures etc.
|
Derived Types allow general object structure (without inheritance) in F90
|
Modules replace COMMON INCLUDE etc. |
Procedures (functions,subroutines) allow better interfaces, recursion, optional parameters etc. |
Better Syntax with free form, more loop control etc. |
Arrays are "true" objects in Fortran90 and are stored as values of elements plus a data descriptor! |
There are operations on full arrays which have natural parallel implementations seen in HPF |
New set of array intrinsic (built-in) functions for elements, reductions (process array into a single value), tranformational (reshaping)
|
Extract sections (subarrays) of arrays as u(lb:ub:step)
|
masked (conditional) Array operations using WHERE .... ELSEWHERE |
Can still do Fortran77 array element operations (DO Loops) but of course this might not be interpretable for efficient parallelism by HPF compiler |
Note Fortran90 designed for science and engineering with originally special concern for vector supercomputers but Cray supports F77 better than F90(!) |
ALLOCATABLE Arrays can be defined at runtime with variable sizing
|
One can define POINTER and TARGET attributes which can be used like REAL, DIMENSION etc.
|
Arguments of a subroutine need NOT define array dimensions in subroutine as these as passed by calling program in data descriptor |
Local arrays are created on stack and bounds maybe non constant and evaluated at procedure entry |
One passes "assumed-shape" arrays from calling to callee routines using INTERFACE syntax |
INTERFACE
|
END INTERFACE is called by |
call residual (r,u,f) or |
call residual ( r(0:nx:2, 0:ny:2) , u(0:nx:2, 0:ny:2) , f(0:nx:2, 0:ny:2) ) |
where latter example just processes every other element of arrays |
REAL u(0:nx,0:ny), A(100,100) , fact , avg |
u= fact * (u -avg) Scales and translates all elements of u |
avg = .25*( CSHIFT(u,1,1) + CSHIFT(u,-1,1) + CSHIFT(u,1,2) + CSHIFT(u,-1,2) |
calculates of average of 4 array elements surrounding each point. Note third argument in CSHIFT is label for axis (1=x 2=y) |
SQRT( A(1:100) ) calculates a new array containing 100 square roots |
SUM(A) is a reduction operator sumimg all elements of array A as a scalar |
SIZE(A,1) is an Array Query Intrinsic giving size of A in the first dimension and is particularly useful for "assumed-shape" arrays passed into subroutines |
TYPE PERSON
|
END TYPE PERSON |
TYPE(PERSON) YOU,ME |
The Identification number of YOU would be accessed as YOU%ID as an ordinary integer |
One can define global operators so that YOU+ME could be defined |
One can use name of derived type as a constructor |
YOU = PERSON ('Pamela Fox', 12, 3) |
One can define a linked list as: |
TYPE ENTRY
|
END TYPE ENTRY |
ALLOCATE Creates dynamically elements in a linked list |
CURRENT = ENTRY( NEW_VALUE, NEW_INDEX, FIRST) |
FIRST => CURRENT |
adds a new entry at start of linked list and renames it with POINTER FIRST |
General Syntax is: |
MODULE name
|
CONTAINS This is optional
|
END MODULE name |
MODULE IllustratingCommonBlock
|
END MODULE IllustratingCommonBlock |
replaces COMMON construct and can be used as |
USE IllustratingCommonBlock |
MODULE INTERVAL_ARITHMETIC
|
CONTAINS
|
END MODULE INTERVAL_ARITHMETIC |
What is HPF, what we need it for, where it came from |
How does HPF Get its Parallelism |
Why it is called "High Performance"? |
What are HPF compiler directives |
Data mapping in HPF
|
Parallel statements and constructs in HPF
|
Latest Discussions -- HPF-2
|
Rice has taken lead in HPF Forum which is a much faster mechanism of getting agreement than formal 10 year process which Fortran90 suffered |
World Wide Web at rice and Vienna |
Mailing List is majordomo@cs.rice.edu and choose list (hpff, hpff-interpret, hpff-core) you wish to subscribe to |
Anonymous FTP to titan.cs.rice.edu and look at
|
Explicit Message Passing as in PVM or MPI |
User breaks program into parts and the parts send messages between them to implement communication necessary for synchronization and integration of parts into solution of a single program |
This matches hardware but is not particularly natural for problem and can be machine dependent |
Object Oriented programming is like message passing but now objects and not programs communicate
|
Data Parallelism is higher level than either message passing or object models (if objects used to break up data to respect computer) |
It provides a Shared Memory Programming Model which can be executed on SIMD or MIMD computers, distributed or shared memory computers |
Note it specifies problem not machine structure |
It in principle provides the most attractive machine independent model for programmers as it reflects problem and not computer |
Its disadvantage is that hard to build compilers especially for the most interesting new algorithms which are dynamic and irregular! |
See Parallel Computing Works for general discussion of problem architecture |
Software Maps |
Problem ---> Machine |
Often software designed around machine architecture |
Rather software should be designed around problem architecture |
Only able to persuade application scientist to parallelize |
Synchronous: Data Parallel Tightly coupled and software needs to exploit features of problem structure to get good performance. Comparatively easy as different data elements are essentially identical. |
Loosely Synchronous:
|
Asynchronous:
|
Embarrassingly parallel:
|
Metaproblems
|
See Parallel Computing Works for QCD and Synchronous problems |
eg. Quantum chromodynamics calculations (calculate proton mass) |
Regular grid in 4D space-time - domain decomposition |
Update algorithm identical at each point |
Time evolution very simple |
The world looked at microscopically in terms of a set of similar fundamental quantities |
Monte Carlo - subtlety that cannot update concurrently points linked in Hamiltonian |
See Parallel Computing Works for Loosely Synchronous problems |
The world looked at macroscopically in terms of interactions between irregular inhomogeneous objects evolved as a time synchronized simulation |
eg. circuit simulation for computer or biological applications |
Linked sets of neurons (components) of different types |
Time taken to model one time step of type i depends on nature of node and interconnect between them |
See Parallel Computing Works Asynchronous problems (including Computer Chess) |
The world looked at macroscopically in terms of interactions between irregular inhomogeneous objects evolved as an event driven simulation |
eg. Battle of Hastings |
Loosely Synchronous, Synchronous or Asynchronous classify problems by their "control" or "synchronization" structure |
However there is an important class of problems where this does not matter as the synchronization overhead -- even if in the difficult asynchronous case -- is irrelevant |
This is when overhead small or zero. These are "embarassingly parallel problems" where each component of decomposed problem is essentially independent |
Examples are:
|
See Parallel Computing Works for Metaproblems including Command and Control |
Metaproblems are of growing importance in general HPCC community |
One important Example is: |
Manufacturing and Design including Multidisciplinary Optimization which combines many fields to look at total product design, manufacturability etc.
|
Also link in concurrent engineering
|
USMADE is US Multidisciplinary Analysis and Design Environment Project of MADIC Industrial Consortium |
A hierarchy of mapping problems |
We would like to optimize the overall mapping |
Both problems and computer are |
Collections of homogeneous or heterogeneous basic entities with some sort of static or dynamic connection between them |
Modeling involves mapping one "complex problem" into another |
Simulation or parallel computing involves map
|
What parameters of these two underlying complex systems control effectiveness of map? |
Execution time
|
User happiness
|
See Parallel Computing Works for problem architecture |
Architecture of "virtual problem" determines nature of language |
Key idea behind data-parallel languages - and perhaps all good languages |
The language expresses problem and not the machine architecture |
Need different approach to express functional parallelism |
Use"object-oriented" feature when problems has natural objects |
Do not use "object-oriented" features when objects are artifacts of target machine |
Need data and functional (object) parallel paradigms in many problems - especially multidisciplinary |
See NPAC's High Performance Fortran Applications Resource |
Should not solve ALL Problems |
Goal: Express in a high level portable scalable fashion those aspects of problems one would like to use Fortran for
|
Lessons from study of
|
Particle Dynamics and Parallel Differential Equation Solving have been studied in detail (for HPF). Other fields less completely understood. |
Express in a high level portable scalable fashion those aspects of problems one would like to use Fortran for
|
Lessons from study of
|
Particle Dynamics and Parallel Differential Equation Solving have been studied in detail (for HPF).
|
Data Parallelism
|
Task Parallelism
|
Metaproblems (task parallelism where each component data parallel)
|
See HPF Forum or National Software Exchange List or NPAC List for educational resources |
HPF is designed so that parallelism is essentially explicit |
This is embodied in 2 classes of operations:
|
HPF makes much greater use of runtime optimizations than the traditional sequential or parallel compiler |
Fortran 77: Do 1 I=1,N 1 A(I) = FUNC(B(I), B(I +/- 1), C(I), C(I +/- 1), .....) |
Traditionally arrange access to these elements in COMPILER |
HPF: A=FUNC(B,C) |
Invoke ALREADY OPTIMIZED parallel implementation of FUNC In many cases, HPF compiler need not know anything about parallelization except interfaces, data layout etc. |
Much of HPF's Parallel "Power" is buried in its library |
Personal note: Fortran 90D motivated by similarities between Caltech message passing "collective" communication and F90 Intrinsics |
18 Non-elemental (true array) Fortran 90 Intrinsics e.g. CSHIFT, SUM |
7 new HPF Intrinsics e.g. Alignment, Template, Distribution inquiries |
4 new HPF Reduction (combine) Functions |
11 new HPF Combine-Scatter Functions |
22 new HPF Parallel Prefix Functions |
2 new HPF Parallel Sorting Functions |
3 other new HPF Parallel Functions |
7 HPF Local Intrinsics |
74 library routines (==> ~1000 routines when written in clumsy F77) |
Non-Elemental Fortran 90 Intrinsics
|
HPF Intrinsics
|
New Reduction Functions
|
Combining- Scatter Functions
|
Sorting Functions
|
Parallel Prefix Functions
|
Other Functions
|
See Parallel Computing works General Discussion |
STATIC
|
ADAPTIVE
|
ASYNCHRONOUS
|
INTEGRATION
|
Standardized syntax lowers risk in application development
|
HPF, HPC++ benchmarks will enable easier comparison between different machines
|
Allows vendors to concentrate their scarce software resources
|
Parallelism is a feature of problems |
It can be expressed in different languages with trade-offs in performance, flexibility and convenience |
Fortran 77 Þ HPF |
Fortran 90 Þ HPF |
C++ Þ HPC++
|
HPC++ more naturally handles complex data structures e.g. data parallel collections which are not the staple array data structure of Fortran |
Must have common runtime support
|
The following foils are expanded in HPF Applications Resource at NPAC |
Somewhat related issue ?
|
Classification of Problems (Fox, 1988) |
See overview discussion in Parallel Computing Works |
Synchronous: Data Parallel Tightly coupled and software needs to exploit features of problem structure to get good performance. Comparatively easy as different data elements are essentially identical. |
Loosely Synchronous:
|
Asynchronous:
|
Embarrassingly parallel:
|
Metaproblems
|
Parallelism in HPF is expressed explicitly
|
Compiler may choose not to exploit information about parallelism |
Compiler may detect parallelism in sequential code |
A=B or more interestingly |
WHERE( B > 0. ) A = B |
ELSEWHERE A=0. |
END WHERE |
can be written |
DO I = n1,n2 |
DO J = m1,m2 |
IF(B(I,J) >0.) THEN A(I,J) = B(I,J) |
ELSE A(i,J) = 0. |
END IF |
END DO |
END DO |
Now a good HPF compiler will recognize the DO loops can be parallelized and give the same answer for Fortran90 and Fortran77 forms but often the detection of parallelism is not clear |
Note FORALL is guaranteed to be parallelizeable as by definition no side effects. |
All of Fortran90 |
New instructions FORALL and INDEPENDENT enhancing DO loops |
Data Alignment and Distribution Assertions |
Miscellaneous Support Operations but |
NO parallel Input/Output |
Little Support for Irregular Computations |
Little Support for any form of non mainstream data-parallelism |
Extrinsics as supporting links with explicit message-passing |
There is tradeoff between parallelism and communication |
Programmer defines the data mapping and compiler uses this to assign processing |
Underlying assumptions are that: |
An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor, |
And that it may be possible to carry out many such operations concurrently if they can be performed on different processors |
This is embodied in "owner computes" rule -- namely that in for instance
|
Owner computes algorithm is usually good and often best |
The directives are structured comments that suggest implementation strategies or assert facts about a program to the compiler |
They may affect the efficiency of the computation performed, but do not change the value computed by the program |
As in Fortran 90 statements, there are both:
|
It must generate Fortran77(90) + Message Passing code or possibly in one pass map HPF code onto parallel machine code |
Traditional dataflow and dependency analysis is especially critical in Fortran77 parts of code |
It must use data mapping assertions to decide what is stored where and so organize computation |
Code must be transformed to respect this owner-computes model |
It must typically use "Loosely Synchronous" model with communicate-compute phases and then compiler generates all the communication needed
|
We need an excellent run-time library which the compiler invokes with parallel Intrinsics etc. |
HPF directives are consistent with Fortran 90 syntax except for the special prefix for directive:
|
Two forms of the directives are allowed
|
Data Mapping in HPF is all you need to do to get parallelism as long as you use the explicit array type syntax such as A=B+C |
The Owner Computes rule implies that specifying location of variables specifies (optimally or not) parallel execution! |
The new HPF-2 ON HOME directive is exception to this rule as specifies where a particular statement is to be executed |
(RE)DISTRIBUTE tells you where data is to be placed |
(RE)ALIGN tells you how different data structures are to be placed relative to each other |
A Template is an abstract space of indexed positions (an "array of nothings") |
In CMFortran terminology, Template is set of Virtual Processors -- one per data point
|
A template is declared by the TEMPLATE directive that specifies:
|
Examples:
|
Abstract processors always form a rectilinear grid in 1 or more dimensions |
They are abstract coarse grain collections of data-points
|
The processor arrangement is defined by the PROCESSORS directive that specifies:
|
Examples:
|
!HPF$ PROCESSORS P(4) |
!HPF$ TEMPLATE X(40) |
!HPF$ ALIGN WITH X :: A, B, C |
!HPF$ DISTRIBUTE X(BLOCK)
|
Syntax of Align: |
!HPF ALIGN alignee WITH align-target
|
Alternatively |
*HPF ALIGN (align-source-list) WITH align-target :: alignee |
Note a colon(:) in directive denotes all values of array index |
Examples of array indices:
|
Use of : examples:
|
Ranks of the alignee and the align-target may be different |
Examples:
|
... or other way round
|
while this only puts A on some parts of template... |
!HPF$ ALIGN A(:) WITH TEMPL(:,i) |
HPF allows for more general alignments such as:
|
!HPF$ TEMPLATE T(12,12) |
!HPF$ ALIGN A(:,J) WITH T(:,J+1) |
!HPF$ ALIGN B(I,J) WITH T(I+4,J+4) |
Useful for simple numerical shifts as in example but not useful |
in general case of arbitary |
index values allowed by |
ALIGN syntax |
Each align-dummy variable is considered to range over all valid index values for the corresponding dimension of the alignee. An align-subscript is evaluated for any specific combination of values for the align-dummy variables simply by evaluating each align-subscript as a expression. Their resulting subscript values must be legitimate subscripts for the align-target |
These examples have non-unit stride as perhaps in "red-black" Iterative Solver algorithms: |
Syntax: |
!HPF$ DISTRIBUTE distributee (dist-format) |
[ONTO dist-target] |
Allowed forms of dist-format:
|
Examples:
|
!HPF$ PROCESSORS P(4)
|
!HPF$ TEMPLATE T(16) |
!HPF$ ALIGN A(:) WITH T(:) |
*HPF PROCESSORS SQUARE(2,2) |
*HPF TEMPLATE T(4,4) |
*HPF ALIGN A(:,:) WITH T(:,:) |
*HPF DISTRIBUTE T(BLOCK,CYCLIC)ONTO SQUARE |
We used BLOCK in the Laplace equation example and so this is appropriate distribution for "local" or geometric type problems |
CYCLIC is called scattered in our early work (or is a special case of scattered which is perhaps random distribution of objects on processors) is appropriate in cases where "load-balancing" is more important than locality
|
Matrix Inversion set up on two processors after |
0 2 and 4 rows/columns eliminated |
Note BLOCK decomposition leads to all work being on one processor at end even if starts off balanced |
Here we show a 16 by 16 array of pixels with either CYCLIC or 8 by 8 two dimensional BLOCK,BLOCK |
CHPF$ PROCESSORS Q(4) |
CHPF$ TEMPLATE FRED(16,16) |
CHPF$ ALIGN A(:,:) WITH FRED(:,:) |
CHPF$ ALIGN B(I,J) WITH FRED(I+2,J+2) |
CHPF$ DISTRIBUTE FRED(BLOCK,*) |
One data mapping is often not appropriate for an entire program
|
ALLOCATABLE arrays can change size |
REALIGN and REDISTRIBUTE are executable DISTRIBUTE and ALIGN commands but are only to be used if one declares arrays on which they act DYNAMIC |
Naturally DYNAMIC arrays can be initialized by ALIGN or DISTRIBUTE statements |
This example illustrates remapping from one to two dimensional decomposition for A and changing B from alignment with columns to alignment with rows
|
!HPF$ PROCESSORS P(64) |
!HPF$ PROCESSORS Q(8,8) |
!HPF$ DYNAMIC :: A,B |
!HPF$ ALIGN B(:) WITH A(:,*) |
!HPF$ DISTRIBUTE A(*,BLOCK)ONTO P
|
!HPF$ REALIGN B(:) WITH A(*,:)
|
!HPF$ REDISTRIBUTE A(CYCLIC,CYCLIC) ONTO Q
|
!HPF$ PROCESSORS Q(64) |
!HPF$ ALIGN B(I) WITH A(I+N) |
!HPF$ DISTRIBUTE A(BLOCK(M)) |
!HPF$ DISTRIBUTE(BLOCK), DYNAMIC :: P
|
!HPF$ REDISTRIBUTE P(CYCLIC)
|
Scope of any mapping directives is a single (sub)program unit |
A template or distribution is not a first-class Fortran 90 object: |
It cannot be passed as a subprogram argument and this creates significant complication! |
HPF Compiler will typically pass an extra argument which is effectively an array-descriptor telling subroutine about distribution of passed arrays |
One can use array query intrinsics to find out what is going on but of course compiler does this implicitly |
There are three typical cases: |
Subroutine requires data to use a particular mapping determined by subroutine
|
Subroutine can use any mapping so actual argument should be passed and used with current mapping
|
Sometimes we need to remap due to array sections being passed |
Any remappings must be undone on return from subroutine |
DISTRIBUTE
|
ALIGN
|
INHERIT
|
(not a comprehensive discussion; just an example) |
PROCESSORS |
TEMPLATE |
ALIGN |
DISTRIBUTE |
INHERIT |
DYNAMIC |
REALIGN |
REDISTRIBUTE |
An operation on two or more data object is likely to be carried out much faster if they all reside in the same processor
|
it may be possible to carry out many such operations concurrently if they can be performed on different processors
|
Parallel Statements
|
Parallel Constructs
|
Intrinsic functions and the HPF library |
Extrinsic functions |
This is as in CMFortran and Maspar MPFortran with example: |
This is as in CMFortran and Maspar MPFortran with example:
|
Semantics of WHERE statement:
|
There is a fundamental difference in semantics between IF...ELSE and WHERE...ELSEWHERE constructs |
elemental
|
transformational and inquiry functions
|
new array reduction functions
|
array combining scatter functions
|
array prefix and suffix functions
|
array sorting functions
|
bit manipulation functions
|
mapping inquiry subroutines
|
X=SUM(A) sums all elements of A and places result in scalar X |
Y = SUM_PREFIX(A) sets array Y of same size as A so that Y(i) has the sum of all A(j) for 1 <= j <= i |
Y = SUM_SCATTER(A,B, IND) sets array element Y(i) as the sum of array element B(i) plus those elements of A(j) where IND(j) = I
|
A very important extension to Fortran 90 and defines one class of parallel DO loop |
FORALL will be a language feature of Fortran95 |
It relaxes the restriction that operands of the rhs expressions must be conformable with the lhs array |
It may be masked with a scalar logical expression (extension of WHERE construct) |
A FORALL statement may call user-defined (PURE) functions on the elements of an array, simulating Fortran 90 elemental function invocation (albeit with a different syntax) |
FORALL( index-spec-list [,mask-expr] ) forall assignment
|
FORALL (i=1:100,k=1:100) a(i,k) = b(i,k) A = B |
FORALL (i=2:100:2) a(i) = a(i-1) A(2:100:2) = A(1:99:2) |
FORALL (i=1:100) a(i) = i A = [1..100] |
FORALL (i=1:100, j=1:100) a(i, j) = i+j |
FORALL (i=1,100) a(i,i) = b(i) |
FORALL (i=1,100,j=1:100) a(i,j) = b(j,i) |
FORALL (i=1,100) a(i, 1:100) = b(1:100, i) |
FORALL (i=1:100, j=1:100, y(i,j).NE.0) x(i,j) = REAL(i+j)/y(i,j) |
FORALL (i=1,100) a(i,ix(i)) = x(i) |
FORALL (i=1,9) x(i) = SUM(x(1:10:i)) |
FORALL (i= 1,100) a(i) = myfunction(a(i+1)) |
Similar to Fortran 90 array assignments and WHERE |
Consider example: |
Consider FORALL( i=1:n ) a(ix(i)) = a(i) |
is allowed in HPF but will only give sensible reproducible results if ix(i) is a true permutation of 1...n
|
Of course we always use "old" value" of a(i) on rhs so that if ix(i)= i+1 and a(0) defined, then result is
|
FORALL( index-spec-list [,mask-expr] ) |
forall-body |
END FORALL |
where forall-body can be a list of forall-assignment statements, FORALL or WHERE statements |
So Multi-Statement FORALL's support nesting of FORALL's but |
is in general Shorthand for a sequence of single statement FORALL's with by definition each statement completed before next one begins |
The multi-statement FORALL is likely to be more efficient than several single statement ones as latter have synchronization overhead on each statement |
PURE functions have no side effects
|
DO loops can call any functions and parallelism unclear as function call can destroy parallelism
|
If FUNC alters A(I-1) or in fact A(any index except I), then this loop cannot be easily parallelized |
FORALL statements can only call PURE functions and these must NOT define any global (e.g. any element of A in example) or dummy (A(i-1) or X) variable
|
FORALL( i=1:n, j=1:m )
|
END FORALL |
This can call the PURE function mandelbrot which is essentially a generalized intrinsic |
PURE INTEGER FUNCTION mandelbrot (x,itol)
|
END FUNCTION mandelbrot |
!HPF$ INDEPENDENT [ ,NEW (variable-list) ] |
INDEPENDENT asserts that no iteration affects any other in any way |
It implements the "embarassingly parallel" problem class we discussed under structure of problems |
Note rest of HPF tackles mainly the synchronous problem class with some loosely synchronous capability
|
NEW variables are defined to have fresh instantiations for each iteration as is typically needed for embarassingly parallel problems where in fact essentially all variables in a loop would be NEW |
Note INDEPENDENT can be applied to FORALL and asserts that no index point assigns to any location that another iteration index value uses
|
HPF2 (see later) has extra feature of allowing REDUCTION (accumulated) variables in INDEPENDENT DO loops |
This is an exception from the conventional HPF picture of a global name space with either distributed or replicated variables |
An extrinsic function is a function written in a language other than HPF including most naturally any node programming language (e.g. Fortran77) targeted to a single processor SPMD) with message passing such as MPI |
HPF defines (Fortran90) interface and invocation characteristics
|
Allows one to get efficient parallel code where HPF language or compiler inadequate
|
The original HPF 1.0 omitted some key capabilities which were known to be important but syntax and functionality was unclear in 1993
|
The HPF Forum met in 1995-96 and has approved a set of Extensions and Simplifications of HPF |
The concept of a base HPF 2.0 and Approved Extensions has been agreed |
Note approved extensions (which presumably vendors need not implement) include critical capability for dynamic irregular problems |
DYNAMIC REALIGN and REDISTRIBUTE are no longer in base language and are just "appproved extensions" |
Suprisingly no parallel I/O capabilities were approved! |
!HPF$ INDEPENDENT |
DO i = 1 , n
|
END DO |
This modifies the owner computes rule by specifying that computation will not be performed on processor owning left hand side |
ON HOME is an approved extension |
Many applications of INDEPENDENT DO loops do require reductions as they are typically calculating independently quantities but storing results as parts of various averages |
e.g. in High Enegry Physics Data Analysis, each measured event can be computed via an INDEPENDENT DO but one wishes to find a particular observable (histogram, scatterplot) which is averaged over each event |
Financial modelling is similar |
x = 0 |
!HPF$ INDEPENDENT, NEW(xinc), REDUCTION(x) |
do i = 1 , N
|
END DO |
xinc is a separate new variable each iteration but result is accumulated into global x |
Task Parallelism is sort of supported in HPF but not clear to me that this is a great idea as better to keep sophisticated task parallelism outside HPF which is really only designed to support data parallelism |
!HPF$ TASKING
|
!HPF$ END |
This extends SPMD model with foo running on eight and bar on another processors |
Note foo and bar are expected to contain data parallel statements which distribute execution using conventional HPF over 8 processors |
!HPF DISTRIBUTE x( BLOCK(SHADOW = 1 ) )
|
!HPF DISTRIBUTE x( BLOCK( /26,24,24,26/ ) )
|
!HPF DISTRIBUTE x( INDIRECT(map_array) )
|
Distribution is now allowed to Processor Subsets with typical Syntax:
|
Distribution is allowed for Derived Types but can only be done at ONE level
|