We use Gaussian Elimination (GE) with partial pivoting as an example
for translating a Fortran90D/HPF program into a Fortran+MP program
to show effectiveness of communication elimination optimization.
The Fortran90D/HPF code is shown in Figure .
Arrays
and row are partitioned by compiler directives.
The second dimension of
is block-partitioned, while the first
dimension is not partitioned.
Array row is block-partitioned.
This program illustrates now the programmer of working
in Fortran 90D/HPF may easily parallelize a program.
Data parallelism is concisely represented
by array operations, while the sequential computation is expressed
by do loops. More importantly, explicit communication primitives
do not need to be added
since the program is written for a single address space using Fortran 90D/HPF.
Figure shows how the Fortran 90D/HPF compiler translates
the GE code into Fortran 77+MP form.
It is easy to see that
the generated code is structured as alternating phases of
local computation and global communication.
Local computations consist of operations by each processor on
the data in its own memory. Global communication includes any transfer of
data among processors.
The compiler partitions the distributed arrays into small sizes and
the parallel constructs are sequentialized into do loops.
The compiler generates the appropriate communication primitives depending on the reference patterns of distributed arrays. For example, the statement:
temp = ABS(a(:,k))is transformed into a broadcast primitive since the array
The program code shows how
the intrinsic function MAXLOC is translated into
the library routine MaxLoc_R_M. The suffix specifies
the data type and
specifies that MAXLOC intrinsic has an optional
mask array. Once again the array information passed to the run-time system
with the associated
data structure.
Fortran 90D/HPF
may perform several optimizations to reduce the
total cost of communication.
The compiler can generate better code by observing the following
from Figure :
15. Tmp = ABS(a(:,k)) 17. maxNum = a(indxRow,k) 19. fac = a(:,k) / maxNum
The distributed array section is used at lines 15,17 and 19.
The array
is not changed between lines 15-19.
Each statement causes a broadcast operation.
Because the compiler performs statement level code generation
for the above three
statements.
However, the compiler can eliminate two of three communication calls
by performing the above dependency analysis.
The compiler needs only generate one broadcast for line 15 which communicates
a column of array
.
The Lines 17 and 19 can use the same data.
The optimized code is shown in Figure
.
We generated this program by hand since the optimizations
have not yet been implemented in the Fortran 90D/HPF compiler.
Currently our compiler performs statement level
optimizations. It does not perform basic-block level optimizations.
To validate the performance of optimization on GE
which is a part of the FortranD/HPF benchmark test suite [66],
we tested three codes on the iPSC/860 and plotted the results. 1-)
The code is given Figure . This is shown as the dotted line in
Figure
, and represent the compiler generated code.
2-) The code in Figure
. This appears as the dashed line in Figure
and represents the hand optimized code on the compiler generated code as discussed above.
3-) The hand-written GE with Fortran 77+MP.
This appears as the solid line in Figure
.
The code is written outside of the compiler group at NPAC to be unbiased.
The programs were compiled using Parasoft Express Fortran compiler which calls Portland Group if77 release 4.0 compiler with all optimization turned on (-O4). We can observe that the performance of the compiler generated code is within 10%of the hand-written code. This is due to the fact that the compiler generated code produces extra communication calls that can be eliminated using optimizations. The hand optimized code gives performance close to the hand-written code. From this experiment we conclude that Fortran 90D/HPF compiler, incorporated with node-level optimizations, can compete with hand-crafted code on some significant algorithms, such as GE.