H W Yau, 4th of June, 1996.
This document begins with some introductory paragraphs on
the use of parallel computers, and the approach taken in this
exercise. Next there will be a brief descriptions of the programming
language used, the problem we wish to solve, and the (short) code
itself. After all this introduction, you will then be asked to
perform some exercises with the aforementioned code. These
experiments should hopefully clarify a few of the main concepts of
parallel programming. A list of HTML reference links is also
provided, which the reader is encouraged to investigate.
Parallel Computation
Traditionally, parallel computation has had its roots in the
scientific and engineering disciplines. This was simply because the
problems required to be solved there were typically simulations of
phenomena in the natural world, where the accuracy of the results
depended on the complexity of the computer model. That is, the
researcher would always demand more compute power to solve a more
complex and hence realistic problem. In response to this, the obvious
idea of tasking many processors to work on the same problem first came
into fruition in the late 1970's. Today, almost two decades on, the
problem of effectively and easily using parallel computers is still
not entirely resolved, although much genuine progress has been made.
One of the ideas we would like to bring across is that of decomposing the problem into sub-parts, which can then be computed concurrently by the each of the participating processors. This approach (`domain decomposition') has been known since the first parallel computers, and research into them would have been rather brief if this was all there was to their use. Fortunately for institutes like NPAC, the vast majority of problems in the real world demand that data be communicated from one processor's domain to another. Since one cannot always assume this would be done automatically by the hardware (as would be the case for shared memory machines, which have their own limitations) it is left to the software to solve this problem.
A fool-proof parallelising compiler is still an `active
area of research', and the traditional way processors were
co-ordinated in a parallel program was through calls to a so-called
message passing library, such as the Message Passing Interface (MPI)
standard. However, modifying one's serial code to use such library
calls is often laborious and error-prone. Hence the strong motivation
to design a language with which the application programmer can express
to the compiler both how data should be distributed amongst the
processors, and where the parallelisms in the code may be exploited.
Moreover, this should be done in such a fashion as to preserve the
functionality of the original serial code ---a feat not normally
possible with the more intrusive message passing library calls. Such
a language is the goal of the High Performance Fortran project.
High Performance Fortran
The principle goal of the High Performance Fortran (HPF)
language is to define a set of features with which a programmer can
achieve the best performance from a multiprocessor computer. The
basis for this is to not use an entirely new language, but to build on
the standard Fortran 90 definition, which is in turn a superset of the
Fortran 77 definition. The intent is that modulo a couple of
exceptions, a HPF programme for a parallel computer can be compiled
and executed correctly by a Fortran 90 compiler for a serial machine.
This trick of benignly tuning the application code for a
parallel computer is done with so-called compiler directives.
They exploit the Fortran comment characters of `!
' or
`C
' in the first column of a program line to express
information to the HPF compiler which can also be safely ignored by a
Fortran 90 compiler, by prefixing such code with the keyword
`HPF$
'. So for example the line:
!HPF$ PROCESSORS :: procs( NUMBER_OF_PROCESSORS() )is a HPF statement defining a processor arrangement consisting of a number of processors in a linear array, where the number of processors is given by a HPF intrinsic function and is determined at run-time.
The choice of Fortran 90 as the starting point for HPF was not simply dictated by the anticipated market, ie, scientists and engineers. Fortran 90 also allows users to express parallelism in their code through operations on entire arrays. Hence the array assignment expression in Fortran 77:
DO i = 1,N a(i) = b(i) + c(i) END DOcan be succinctly expressed by the Fortran 90 array assignment expression:
a = b + cand since each of these elements for `
a()
' may be
computed independently of each other, an intelligent compiler would be
able to translate this to a set of parallel operations. This
promotion of arrays to basic datatypes extends to the Fortran 90
intrinsics library. Hence we may find the sine of the value in each
array element with the expression:
asin = SIN( a )instead of the Fortran 77 loop:
DO i = 1,N asin(i) = SIN( a(i) ) END DO
The array operations in Fortran 90 are made a wee bit more flexible through the use of the triplet notation, for selecting starting, ending, and stride for a given array index. So for example, the expression:
a(2:12) = 42.0will set the elements
a(2)
, a(3)...a(12)
to
the value 42.0; and the expression:
a(2:12:2) = 42.0will set the elements
a(2)
, a(4)
,
a(6)...a(12)
to 42.0. The `:
' specifier
refers to `all the elements along this rank'.
Finally, Fortran 90 also includes a first class construct
`WHERE...END WHERE
' for masked array operations, as a
generalisation of the array assignment expression. For example, if
instead of adding all elements of `b()
' with
`c()
' one wishes to restrict to only those elements of
`b()
' which are positive definite, one could express this
in Fortran 90 as:
WHERE( b .GT. 0.0 ) a = b + c END WHEREinstead of the Fortran 77:
DO i = 1,N IF( b(i) .GT. 0.0 )THEN a(i) = b(i) + c(i) END IF END DO
In addition to compiler directives, HPF introduced a
first class language construct not in Fortran 90 (but which was
mooted, and is planned for Fortran 95), the parallel loop:
`FORALL...END FORALL
' for array assignments not easily
done by the above approaches. For example, the following expression,
whilst parallelisable, would be very hard to express with either array
assignments or the `WHERE
' construct:
FORALL( i=1:N ) x(i) = REAL(i)
We may summarise the HPF language as consisting largely of the following parts:
WHERE
' construct, and array assignment expressions; but
implemented for a parallel computer.
FORALL
'
for parallel loop operations.
NUMBER_OF_PROCESSORS()
' function shown above.
main.f90
', and two subroutines
`find_rhs_psi.f90
' and `tridiag_psi.f90
';
all such HPF codes have the `.f90
' filename extension.
At the level of the main program, the code simply loops over a number
of time-steps, inside which are two pairs of calls to the
aforementioned subroutines and the TRANSPOSE
intrinsic
function ---corresponding to solving for the x (column-wise) and y
(row-wise) directions of the three NN x NN
arrays `psi
', `zeta
' and `rhs
'.
Before the start of the executable code there are a number of HPF directives to explain.
!HPF$ PROCESSORS :: procs(NUMBER_OF_PROCESSORS())
!HPF$ TEMPLATE,DIMENSION(NN) :: tplate
!HPF$ DISTRIBUTE (BLOCK) ONTO procs :: tplate
BLOCK
' such that adjacent columns are mapped
to the same processor's memory. There are other
distributions allowed in HPF, including
`CYCLIC
' which would have shuffled the
columns across the set of processors.
!HPF$ ALIGN beta(*,:) WITH tplate(:)
ALIGN
' directive actually partitions
the array amongst the processors, according to the method
described by the template. Because here the template is a
rank-1 array, whilst the arrays to be distributed are
rank-2, one of the ranks has the `*
'
specifier to indicate that it should not be distributed,
but live on the same processor. In the code example, this
first rank specifies that the columns of the array are
placed on the same processor. The other second rank of
the distributed array and the template index has the
specifier `:
' for indicating the alignment is
over all the dimensions in that rank. Note that this
specifier is an example of the Fortran 90 array index
triplet notation, hence one can manipulate how arrays are
distributed with respect to the template.
!HPF$ DISTRIBUTE *(*,BLOCK) :: beta,rhs,u,temp
BLOCK
' manner. The extra
`*
' after the `DISTRIBUTE
'
keyword tells the compiler that this really is the
distribution used. Should this be erroneous (eg, the
distributions are actually `(BLOCK,*)
'), then
the behaviour of the compiler and the code will be
undefined.
A graphical illustration of the use of templates, block distribution, and the alignment directive is given below.
An example of distributing an 8 x 8 array `ar()
'
over a linear chain of four processors, by aligning the array with the
distributed 8 x 8 template `tplate()
'. The distribution
method used is `BLOCK
' such that a given processor would
see adjacent columns in its memory.
Compiling and Running the Code
The aim of this section is to explain how you can obtain the
code for this exercise, how to compile it, and (of course) the actual
running part.
% cp -r /home/T6G/hpfa/REU ./
merlin1
,
merlin2
, merlin3...merlin8
. You will have
to do a Unix rlogin into one of these eight nodes in order to perform
this exercise (reply with `xterm
' for the terminal type).
A fiendishly clever load balancing algorithm for allocating these SP-2
nodes amongst the REU students is currently being devised. But once
you have logged in, you may use the resources of the other nodes in
your parallel computation.
The HPF compiler system we shall be using is the product from Portland Group Inc (PGI), and is probably the best HPF compiler NPAC currently has. There is however a little incantation that you need to invoke before you can use the compiler:
% cd REU % source Incantationin order to set up your environment variables to point to the PGI binaries. Alternatively, you can concatenate the
Incantation
script at the end of your
`~/.cshrc
' file, exit the SP-2, and re-rlogin; this way,
you won't need to invoke the incantation everytime you log in to use
the PGI compiler. If everything has gone smoothly, when you type:
% echo $PGIyou should get the reply:
/usr/local/pgi
REU/
directory there are
the three .f90
source files already discussed above, and
some other ancillary directories and files associated with the
compilation process. The code uses the Unix Make utility to handle
the compilation and linking phase. Hence after you have changed one
of the source files, you should recompile the code with the command:
% makewhich will generate the executable:
exec.rs6000
. You
should now compile your code, and hopefully see several lines of
error-free diagnostic messages.
-pghpf
' flag,
and tell it how many nodes you wish to grab from the SP-2 with the
`-np
' flag. Hence for running on two nodes, you should
type:
% ./exec.rs6000 -pghpf -np 2After the execution has completed, a binary file of profiling information `
pgprof.out
' is generated, with which you
will investigate in the following section. Also generated are files
with the extension `.f
'. These are intermediate files in
Fortran 77, written by the PGI compiler, and may give you an idea of
the effort required to produce a HPF compiler!
If you wish to investigate further the capabilities of the
PGI HPF compiler, please follow the relevant link amongst the
references listed below.
Profiling & Exercises
The profiler we shall use has an X11 graphical interface, so
we need to do some more setting up before continuing. This is best
illustrated by an example. If the
workstation you are currently using is (say) called
tapeworm
, and you have logged onto the SP-2 node
merlin4
, please enter on your workstation:
% xhost + merlin4and from within your SP-2 node:
% setenv DISPLAY tapeworm.syr.edu:0.0You can now run the PGI profiler in the background with the command:
% pgprof pgprof.out &The first display you will see is that of the times spent at each of the routines. Near the top there are buttons for you to select which processor to examine, what statistics to display, and how to show the HPF-statistics ---eg, selecting `all' will display the timings for all the processors used in the run.
After you have familiarised yourself with the various display options, you can home into the routines by clicking on them. This will pop up another window, with its own set of display and HPF-Statistics options. You should then have enough information to perform the tasks in the following section. There is also a HTML link to the PGI profiler user guide in the references below.
which can be expressed as an efficiency measure Eta:
Understand what happens to this efficiency measure at the limits of f=0.0 and f=1.0.
Return to REU 1996 Home Page.