NPAC REU Programme
Tutorial on Distributed Computing with High Performance Fortran (HPF)

H W Yau, 4th of June, 1996.

Aims of this Tutorial

The following exercise is an attempt to illustrate how the resources of a parallel computer can be harnessed by the application programmer. The technological tool we shall use here is the High Performance Fortran language, and the compute resource the 12 node IBM SP-2 parallel computer at NPAC. These two currently pretty much represent the state of the art in parallel computation.

This document begins with some introductory paragraphs on the use of parallel computers, and the approach taken in this exercise. Next there will be a brief descriptions of the programming language used, the problem we wish to solve, and the (short) code itself. After all this introduction, you will then be asked to perform some exercises with the aforementioned code. These experiments should hopefully clarify a few of the main concepts of parallel programming. A list of HTML reference links is also provided, which the reader is encouraged to investigate.

Parallel Computation

Traditionally, parallel computation has had its roots in the scientific and engineering disciplines. This was simply because the problems required to be solved there were typically simulations of phenomena in the natural world, where the accuracy of the results depended on the complexity of the computer model. That is, the researcher would always demand more compute power to solve a more complex and hence realistic problem. In response to this, the obvious idea of tasking many processors to work on the same problem first came into fruition in the late 1970's. Today, almost two decades on, the problem of effectively and easily using parallel computers is still not entirely resolved, although much genuine progress has been made.

One of the ideas we would like to bring across is that of decomposing the problem into sub-parts, which can then be computed concurrently by the each of the participating processors. This approach (`domain decomposition') has been known since the first parallel computers, and research into them would have been rather brief if this was all there was to their use. Fortunately for institutes like NPAC, the vast majority of problems in the real world demand that data be communicated from one processor's domain to another. Since one cannot always assume this would be done automatically by the hardware (as would be the case for shared memory machines, which have their own limitations) it is left to the software to solve this problem.

A fool-proof parallelising compiler is still an `active area of research', and the traditional way processors were co-ordinated in a parallel program was through calls to a so-called message passing library, such as the Message Passing Interface (MPI) standard. However, modifying one's serial code to use such library calls is often laborious and error-prone. Hence the strong motivation to design a language with which the application programmer can express to the compiler both how data should be distributed amongst the processors, and where the parallelisms in the code may be exploited. Moreover, this should be done in such a fashion as to preserve the functionality of the original serial code ---a feat not normally possible with the more intrusive message passing library calls. Such a language is the goal of the High Performance Fortran project.

High Performance Fortran

The principle goal of the High Performance Fortran (HPF) language is to define a set of features with which a programmer can achieve the best performance from a multiprocessor computer. The basis for this is to not use an entirely new language, but to build on the standard Fortran 90 definition, which is in turn a superset of the Fortran 77 definition. The intent is that modulo a couple of exceptions, a HPF programme for a parallel computer can be compiled and executed correctly by a Fortran 90 compiler for a serial machine.

This trick of benignly tuning the application code for a parallel computer is done with so-called compiler directives. They exploit the Fortran comment characters of `!' or `C' in the first column of a program line to express information to the HPF compiler which can also be safely ignored by a Fortran 90 compiler, by prefixing such code with the keyword `HPF$'. So for example the line:

!HPF$ PROCESSORS :: procs( NUMBER_OF_PROCESSORS() )

is a HPF statement defining a processor arrangement consisting of a number of processors in a linear array, where the number of processors is given by a HPF intrinsic function and is determined at run-time.

The choice of Fortran 90 as the starting point for HPF was not simply dictated by the anticipated market, ie, scientists and engineers. Fortran 90 also allows users to express parallelism in their code through operations on entire arrays. Hence the array assignment expression in Fortran 77:

      DO i = 1,N
        a(i) = b(i) + c(i)
      END DO

can be succinctly expressed by the Fortran 90 array assignment expression:

      a = b + c

and since each of these elements for `a()' may be computed independently of each other, an intelligent compiler would be able to translate this to a set of parallel operations. This promotion of arrays to basic datatypes extends to the Fortran 90 intrinsics library. Hence we may find the sine of the value in each array element with the expression:

      asin = SIN( a )

instead of the Fortran 77 loop:

      DO i = 1,N
        asin(i) = SIN( a(i) )
      END DO

The array operations in Fortran 90 are made a wee bit more flexible through the use of the triplet notation, for selecting starting, ending, and stride for a given array index. So for example, the expression:

      a(2:12) = 42.0

will set the elements a(2), a(3)...a(12) to the value 42.0; and the expression:

      a(2:12:2) = 42.0

will set the elements a(2), a(4), a(6)...a(12) to 42.0. The `:' specifier refers to `all the elements along this rank'.

Finally, Fortran 90 also includes a first class construct `WHERE...END WHERE' for masked array operations, as a generalisation of the array assignment expression. For example, if instead of adding all elements of `b()' with `c()' one wishes to restrict to only those elements of `b()' which are positive definite, one could express this in Fortran 90 as:

      WHERE( b .GT. 0.0 )
        a = b + c
      END WHERE

instead of the Fortran 77:

      DO i = 1,N
        IF( b(i) .GT. 0.0 )THEN
          a(i) = b(i) + c(i)
        END IF
      END DO

In addition to compiler directives, HPF introduced a first class language construct not in Fortran 90 (but which was mooted, and is planned for Fortran 95), the parallel loop: `FORALL...END FORALL' for array assignments not easily done by the above approaches. For example, the following expression, whilst parallelisable, would be very hard to express with either array assignments or the `WHERE' construct:

      FORALL( i=1:N ) x(i) = REAL(i)

We may summarise the HPF language as consisting largely of the following parts:

The Fortran 90 language, with its intrinsics library, `WHERE' construct, and array assignment expressions; but implemented for a parallel computer.
Directives for informing the compiler how arrays are to be distributed amongst the processors, and where parts of the code may be performed in parallel.
An additional first class construct `FORALL' for parallel loop operations.
Additional intrinsic functions, most of which are irrelevant in Fortran 90; such as the `NUMBER_OF_PROCESSORS()' function shown above.

If you want to know more about the HPF language, a HTML version of the specifications document is listed in the references below.

The Physical Problem

The problem the example code solves is a very simple fluid dynamics problem in a two-dimensional box. This box is discretised and the properties at each grid point calculated by the program from one simulation time step to the next. At each such time step a series of computation is performed along the columns of the problem's array, where each column can be calculated independently of the other columns. Then, all the arrays are transposed, and the computation performed repeated on the new columns (which before were rows of the arrays). This alternating between computing the columns and rows of the problem mesh couples all the points in a non-trivial manner, and also gives the algorithm its name, `Alternating Direction Implicit'. For more information, please take a look at the ADI postscript paper in the references below.

The Code

The code is in three parts, consisting of the main progran `main.f90', and two subroutines `find_rhs_psi.f90' and `tridiag_psi.f90'; all such HPF codes have the `.f90' filename extension. At the level of the main program, the code simply loops over a number of time-steps, inside which are two pairs of calls to the aforementioned subroutines and the TRANSPOSE intrinsic function ---corresponding to solving for the x (column-wise) and y (row-wise) directions of the three NN x NN arrays `psi', `zeta' and `rhs'.

Before the start of the executable code there are a number of HPF directives to explain.

!HPF$ PROCESSORS :: procs(NUMBER_OF_PROCESSORS())
This defines the processor arrangement which the program wishes to use; as opposed to what the physical layout may actually be. It is the HPF compiler's job to map this description to what is physically available.
!HPF$ TEMPLATE,DIMENSION(NN) :: tplate
This statement defines a template of a given dimension. Templates are like ordinary Fortran arrays, but of minimal storage size, and are used solely to specifying how data are to be distributed.
!HPF$ DISTRIBUTE (BLOCK) ONTO procs :: tplate
This directive distributes the above template amongst the available processors. The distribution chosen here is `BLOCK' such that adjacent columns are mapped to the same processor's memory. There are other distributions allowed in HPF, including `CYCLIC' which would have shuffled the columns across the set of processors.
!HPF$ ALIGN beta(*,:) WITH tplate(:)
The `ALIGN' directive actually partitions the array amongst the processors, according to the method described by the template. Because here the template is a rank-1 array, whilst the arrays to be distributed are rank-2, one of the ranks has the `*' specifier to indicate that it should not be distributed, but live on the same processor. In the code example, this first rank specifies that the columns of the array are placed on the same processor. The other second rank of the distributed array and the template index has the specifier `:' for indicating the alignment is over all the dimensions in that rank. Note that this specifier is an example of the Fortran 90 array index triplet notation, hence one can manipulate how arrays are distributed with respect to the template.
!HPF$ DISTRIBUTE *(*,BLOCK) :: beta,rhs,u,temp
This above occurs in the functional interface for one of the routines, and describes to the subroutine how the arrays passed into it have been distributed. In this case, with the rank-1 index undistributed, and the rank-2 distributed in a `BLOCK' manner. The extra `*' after the `DISTRIBUTE' keyword tells the compiler that this really is the distribution used. Should this be erroneous (eg, the distributions are actually `(BLOCK,*)'), then the behaviour of the compiler and the code will be undefined.

A graphical illustration of the use of templates, block distribution, and the alignment directive is given below.

Nothing Like a Puffin

An example of distributing an 8 x 8 array `ar()' over a linear chain of four processors, by aligning the array with the distributed 8 x 8 template `tplate()'. The distribution method used is `BLOCK' such that a given processor would see adjacent columns in its memory.

Compiling and Running the Code

The aim of this section is to explain how you can obtain the code for this exercise, how to compile it, and (of course) the actual running part.

Obtaining the Code

The code may be obtained by copying from its home directory to wherever you want, viz:

% cp -r /home/T6G/hpfa/REU ./

The Hardware and Software

You shall be making your runs on the NPAC IBM SP-2. This is a 12 node machine, but us mortals are only allowed to use the first eight nodes, which are referred to as merlin1, merlin2, merlin3...merlin8. You will have to do a Unix rlogin into one of these eight nodes in order to perform this exercise (reply with `xterm' for the terminal type). A fiendishly clever load balancing algorithm for allocating these SP-2 nodes amongst the REU students is currently being devised. But once you have logged in, you may use the resources of the other nodes in your parallel computation.

The HPF compiler system we shall be using is the product from Portland Group Inc (PGI), and is probably the best HPF compiler NPAC currently has. There is however a little incantation that you need to invoke before you can use the compiler:

% cd REU
% source Incantation

in order to set up your environment variables to point to the PGI binaries. Alternatively, you can concatenate the Incantation script at the end of your `~/.cshrc' file, exit the SP-2, and re-rlogin; this way, you won't need to invoke the incantation everytime you log in to use the PGI compiler. If everything has gone smoothly, when you type:

% echo $PGI

you should get the reply:

/usr/local/pgi

Compiling the Code

Inside your copy of the REU/ directory there are the three .f90 source files already discussed above, and some other ancillary directories and files associated with the compilation process. The code uses the Unix Make utility to handle the compilation and linking phase. Hence after you have changed one of the source files, you should recompile the code with the command:

% make

which will generate the executable: exec.rs6000. You should now compile your code, and hopefully see several lines of error-free diagnostic messages.

Running the Code

To run the code, you have to tell the executable that it was compiled by the PGI compiler(!) with the `-pghpf' flag, and tell it how many nodes you wish to grab from the SP-2 with the `-np' flag. Hence for running on two nodes, you should type:

% ./exec.rs6000 -pghpf -np 2

After the execution has completed, a binary file of profiling information `pgprof.out' is generated, with which you will investigate in the following section. Also generated are files with the extension `.f'. These are intermediate files in Fortran 77, written by the PGI compiler, and may give you an idea of the effort required to produce a HPF compiler!

If you wish to investigate further the capabilities of the PGI HPF compiler, please follow the relevant link amongst the references listed below.

Profiling & Exercises

The profiler we shall use has an X11 graphical interface, so we need to do some more setting up before continuing. This is best illustrated by an example. If the workstation you are currently using is (say) called tapeworm, and you have logged onto the SP-2 node merlin4, please enter on your workstation:

% xhost + merlin4

and from within your SP-2 node:

% setenv DISPLAY tapeworm.syr.edu:0.0

You can now run the PGI profiler in the background with the command:

% pgprof pgprof.out &

The first display you will see is that of the times spent at each of the routines. Near the top there are buttons for you to select which processor to examine, what statistics to display, and how to show the HPF-statistics ---eg, selecting `all' will display the timings for all the processors used in the run.

After you have familiarised yourself with the various display options, you can home into the routines by clicking on them. This will pop up another window, with its own set of display and HPF-Statistics options. You should then have enough information to perform the tasks in the following section. There is also a HTML link to the PGI profiler user guide in the references below.

Tasks

Determine which part of the code is computationally most expensive.
Determine which part of the code is communication intensive.
How does overall execution time differ with the number of processors (1, 2, 4, 6, 8)?
How do the profile figures differ with the number of processors.
What are the bottlenecks in the code?
Speed-up and scalability. The speed up for parallel computations are usually calculated as the ratio of `Time taken by one processor' to the `Time taken by P processors'. How far is this code from the expected ideal?
Amdahl's Law. A crude performance model of this code consists of the terms: the `Time taken by the scalable parts of the code' plus `Time taken by the scalable parts of the code divided by the number of processors'. These times can be translated into the fraction f in the code which is scalable, and the fraction (1-f) which is not. This gives an expression for the speed-up for a code to be:

which can be expressed as an efficiency measure Eta:

Understand what happens to this efficiency measure at the limits of f=0.0 and f=1.0.
Derive the equations above.
Try to measure what the fraction f of scalability is in the code.
How many measurements do you think are necessary before the statistics are meaningful?
Increasing/decreasing the problem size. Do these changes make the code less or more scalable? Why?
From the point of view of someone wishing to make a code run faster, can the profiler provide more information or a better interface?
Try other data decomposition schemes, such as cyclic.

References

Portland Group's HPF Profiler User's Guide.
Portland Group's HPF Compiler User's Guide.
Specifications for the High Performance Fortran language.
Fortran 90 Frequently Asked Questions.
Information on the Message Passing Interface standard.
Geoffrey C Fox et al's `Parallel Computing Works!'.
A Glossary of High Performance Computing and Communications terms.
A postscript paper on the Alternating Direction Implicit algorithm.

Written: HWY 30th of May, 1996.

Return to REU 1996 Home Page.

NPAC REU Programme Tutorial on Distributed Computing with High Performance Fortran (HPF)