NPAC REU Programme
Tutorial on Distributed Computing with High Performance Fortran (HPF)

H W Yau, 4th of June, 1996.


Aims of this Tutorial

The following exercise is an attempt to illustrate how the resources of a parallel computer can be harnessed by the application programmer. The technological tool we shall use here is the High Performance Fortran language, and the compute resource the 12 node IBM SP-2 parallel computer at NPAC. These two currently pretty much represent the state of the art in parallel computation.

This document begins with some introductory paragraphs on the use of parallel computers, and the approach taken in this exercise. Next there will be a brief descriptions of the programming language used, the problem we wish to solve, and the (short) code itself. After all this introduction, you will then be asked to perform some exercises with the aforementioned code. These experiments should hopefully clarify a few of the main concepts of parallel programming. A list of HTML reference links is also provided, which the reader is encouraged to investigate.

Parallel Computation

Traditionally, parallel computation has had its roots in the scientific and engineering disciplines. This was simply because the problems required to be solved there were typically simulations of phenomena in the natural world, where the accuracy of the results depended on the complexity of the computer model. That is, the researcher would always demand more compute power to solve a more complex and hence realistic problem. In response to this, the obvious idea of tasking many processors to work on the same problem first came into fruition in the late 1970's. Today, almost two decades on, the problem of effectively and easily using parallel computers is still not entirely resolved, although much genuine progress has been made.

One of the ideas we would like to bring across is that of decomposing the problem into sub-parts, which can then be computed concurrently by the each of the participating processors. This approach (`domain decomposition') has been known since the first parallel computers, and research into them would have been rather brief if this was all there was to their use. Fortunately for institutes like NPAC, the vast majority of problems in the real world demand that data be communicated from one processor's domain to another. Since one cannot always assume this would be done automatically by the hardware (as would be the case for shared memory machines, which have their own limitations) it is left to the software to solve this problem.

A fool-proof parallelising compiler is still an `active area of research', and the traditional way processors were co-ordinated in a parallel program was through calls to a so-called message passing library, such as the Message Passing Interface (MPI) standard. However, modifying one's serial code to use such library calls is often laborious and error-prone. Hence the strong motivation to design a language with which the application programmer can express to the compiler both how data should be distributed amongst the processors, and where the parallelisms in the code may be exploited. Moreover, this should be done in such a fashion as to preserve the functionality of the original serial code ---a feat not normally possible with the more intrusive message passing library calls. Such a language is the goal of the High Performance Fortran project.

High Performance Fortran

The principle goal of the High Performance Fortran (HPF) language is to define a set of features with which a programmer can achieve the best performance from a multiprocessor computer. The basis for this is to not use an entirely new language, but to build on the standard Fortran 90 definition, which is in turn a superset of the Fortran 77 definition. The intent is that modulo a couple of exceptions, a HPF programme for a parallel computer can be compiled and executed correctly by a Fortran 90 compiler for a serial machine.

This trick of benignly tuning the application code for a parallel computer is done with so-called compiler directives. They exploit the Fortran comment characters of `!' or `C' in the first column of a program line to express information to the HPF compiler which can also be safely ignored by a Fortran 90 compiler, by prefixing such code with the keyword `HPF$'. So for example the line:

!HPF$ PROCESSORS :: procs( NUMBER_OF_PROCESSORS() ) 

is a HPF statement defining a processor arrangement consisting of a number of processors in a linear array, where the number of processors is given by a HPF intrinsic function and is determined at run-time.

The choice of Fortran 90 as the starting point for HPF was not simply dictated by the anticipated market, ie, scientists and engineers. Fortran 90 also allows users to express parallelism in their code through operations on entire arrays. Hence the array assignment expression in Fortran 77:

      DO i = 1,N
        a(i) = b(i) + c(i)
      END DO
can be succinctly expressed by the Fortran 90 array assignment expression:
      a = b + c
and since each of these elements for `a()' may be computed independently of each other, an intelligent compiler would be able to translate this to a set of parallel operations. This promotion of arrays to basic datatypes extends to the Fortran 90 intrinsics library. Hence we may find the sine of the value in each array element with the expression:
      asin = SIN( a )
instead of the Fortran 77 loop:
      DO i = 1,N
        asin(i) = SIN( a(i) )
      END DO

The array operations in Fortran 90 are made a wee bit more flexible through the use of the triplet notation, for selecting starting, ending, and stride for a given array index. So for example, the expression:

      a(2:12) = 42.0
will set the elements a(2), a(3)...a(12) to the value 42.0; and the expression:
      a(2:12:2) = 42.0
will set the elements a(2), a(4), a(6)...a(12) to 42.0. The `:' specifier refers to `all the elements along this rank'.

Finally, Fortran 90 also includes a first class construct `WHERE...END WHERE' for masked array operations, as a generalisation of the array assignment expression. For example, if instead of adding all elements of `b()' with `c()' one wishes to restrict to only those elements of `b()' which are positive definite, one could express this in Fortran 90 as:

      WHERE( b .GT. 0.0 )
        a = b + c
      END WHERE
instead of the Fortran 77:
      DO i = 1,N
        IF( b(i) .GT. 0.0 )THEN
          a(i) = b(i) + c(i)
        END IF
      END DO

In addition to compiler directives, HPF introduced a first class language construct not in Fortran 90 (but which was mooted, and is planned for Fortran 95), the parallel loop: `FORALL...END FORALL' for array assignments not easily done by the above approaches. For example, the following expression, whilst parallelisable, would be very hard to express with either array assignments or the `WHERE' construct:

      FORALL( i=1:N ) x(i) = REAL(i)

We may summarise the HPF language as consisting largely of the following parts:

If you want to know more about the HPF language, a HTML version of the specifications document is listed in the references below.

The Physical Problem

The problem the example code solves is a very simple fluid dynamics problem in a two-dimensional box. This box is discretised and the properties at each grid point calculated by the program from one simulation time step to the next. At each such time step a series of computation is performed along the columns of the problem's array, where each column can be calculated independently of the other columns. Then, all the arrays are transposed, and the computation performed repeated on the new columns (which before were rows of the arrays). This alternating between computing the columns and rows of the problem mesh couples all the points in a non-trivial manner, and also gives the algorithm its name, `Alternating Direction Implicit'. For more information, please take a look at the ADI postscript paper in the references below.

The Code

The code is in three parts, consisting of the main progran `main.f90', and two subroutines `find_rhs_psi.f90' and `tridiag_psi.f90'; all such HPF codes have the `.f90' filename extension. At the level of the main program, the code simply loops over a number of time-steps, inside which are two pairs of calls to the aforementioned subroutines and the TRANSPOSE intrinsic function ---corresponding to solving for the x (column-wise) and y (row-wise) directions of the three NN x NN arrays `psi', `zeta' and `rhs'.

Before the start of the executable code there are a number of HPF directives to explain.

A graphical illustration of the use of templates, block distribution, and the alignment directive is given below.

Nothing Like a Puffin

An example of distributing an 8 x 8 array `ar()' over a linear chain of four processors, by aligning the array with the distributed 8 x 8 template `tplate()'. The distribution method used is `BLOCK' such that a given processor would see adjacent columns in its memory.

Compiling and Running the Code

The aim of this section is to explain how you can obtain the code for this exercise, how to compile it, and (of course) the actual running part.

Obtaining the Code

The code may be obtained by copying from its home directory to wherever you want, viz:
% cp -r /home/T6G/hpfa/REU ./

The Hardware and Software

You shall be making your runs on the NPAC IBM SP-2. This is a 12 node machine, but us mortals are only allowed to use the first eight nodes, which are referred to as merlin1, merlin2, merlin3...merlin8. You will have to do a Unix rlogin into one of these eight nodes in order to perform this exercise (reply with `xterm' for the terminal type). A fiendishly clever load balancing algorithm for allocating these SP-2 nodes amongst the REU students is currently being devised. But once you have logged in, you may use the resources of the other nodes in your parallel computation.

The HPF compiler system we shall be using is the product from Portland Group Inc (PGI), and is probably the best HPF compiler NPAC currently has. There is however a little incantation that you need to invoke before you can use the compiler:

% cd REU
% source Incantation
in order to set up your environment variables to point to the PGI binaries. Alternatively, you can concatenate the Incantation script at the end of your `~/.cshrc' file, exit the SP-2, and re-rlogin; this way, you won't need to invoke the incantation everytime you log in to use the PGI compiler. If everything has gone smoothly, when you type:
% echo $PGI
you should get the reply:
/usr/local/pgi

Compiling the Code

Inside your copy of the REU/ directory there are the three .f90 source files already discussed above, and some other ancillary directories and files associated with the compilation process. The code uses the Unix Make utility to handle the compilation and linking phase. Hence after you have changed one of the source files, you should recompile the code with the command:
% make
which will generate the executable: exec.rs6000. You should now compile your code, and hopefully see several lines of error-free diagnostic messages.

Running the Code

To run the code, you have to tell the executable that it was compiled by the PGI compiler(!) with the `-pghpf' flag, and tell it how many nodes you wish to grab from the SP-2 with the `-np' flag. Hence for running on two nodes, you should type:
% ./exec.rs6000 -pghpf -np 2
After the execution has completed, a binary file of profiling information `pgprof.out' is generated, with which you will investigate in the following section. Also generated are files with the extension `.f'. These are intermediate files in Fortran 77, written by the PGI compiler, and may give you an idea of the effort required to produce a HPF compiler!

If you wish to investigate further the capabilities of the PGI HPF compiler, please follow the relevant link amongst the references listed below.

Profiling & Exercises

The profiler we shall use has an X11 graphical interface, so we need to do some more setting up before continuing. This is best illustrated by an example. If the workstation you are currently using is (say) called tapeworm, and you have logged onto the SP-2 node merlin4, please enter on your workstation:
% xhost + merlin4
and from within your SP-2 node:
% setenv DISPLAY tapeworm.syr.edu:0.0
You can now run the PGI profiler in the background with the command:
% pgprof pgprof.out &
The first display you will see is that of the times spent at each of the routines. Near the top there are buttons for you to select which processor to examine, what statistics to display, and how to show the HPF-statistics ---eg, selecting `all' will display the timings for all the processors used in the run.

After you have familiarised yourself with the various display options, you can home into the routines by clicking on them. This will pop up another window, with its own set of display and HPF-Statistics options. You should then have enough information to perform the tasks in the following section. There is also a HTML link to the PGI profiler user guide in the references below.

Tasks

  1. Determine which part of the code is computationally most expensive.
  2. Determine which part of the code is communication intensive.
  3. How does overall execution time differ with the number of processors (1, 2, 4, 6, 8)?
  4. How do the profile figures differ with the number of processors.
  5. What are the bottlenecks in the code?
  6. Speed-up and scalability. The speed up for parallel computations are usually calculated as the ratio of `Time taken by one processor' to the `Time taken by P processors'. How far is this code from the expected ideal?
  7. Amdahl's Law. A crude performance model of this code consists of the terms: the `Time taken by the scalable parts of the code' plus `Time taken by the scalable parts of the code divided by the number of processors'. These times can be translated into the fraction f in the code which is scalable, and the fraction (1-f) which is not. This gives an expression for the speed-up for a code to be:

    Nothing Like a Puffin

    which can be expressed as an efficiency measure Eta:

    Nothing Like a Puffin

    Understand what happens to this efficiency measure at the limits of f=0.0 and f=1.0.

  8. Derive the equations above.
  9. Try to measure what the fraction f of scalability is in the code.
  10. How many measurements do you think are necessary before the statistics are meaningful?
  11. Increasing/decreasing the problem size. Do these changes make the code less or more scalable? Why?
  12. From the point of view of someone wishing to make a code run faster, can the profiler provide more information or a better interface?
  13. Try other data decomposition schemes, such as cyclic.

References


Written: HWY 30th of May, 1996.

Return to REU 1996 Home Page.