Next: Experimental Results Up: Optimizations Previous: Forall Mask Insertion

An Example Program for Optimization

We use Gaussian Elimination (GE) with partial pivoting as an example for translating a Fortran90D/HPF program into a Fortran+MP program to show effectiveness of communication elimination optimization. The Fortran90D/HPF code is shown in Figure . Arrays and row are partitioned by compiler directives. The second dimension of is block-partitioned, while the first dimension is not partitioned. Array row is block-partitioned. This program illustrates now the programmer of working in Fortran 90D/HPF may easily parallelize a program. Data parallelism is concisely represented by array operations, while the sequential computation is expressed by do loops. More importantly, explicit communication primitives do not need to be added since the program is written for a single address space using Fortran 90D/HPF.

Figure shows how the Fortran 90D/HPF compiler translates the GE code into Fortran 77+MP form. It is easy to see that the generated code is structured as alternating phases of local computation and global communication. Local computations consist of operations by each processor on the data in its own memory. Global communication includes any transfer of data among processors. The compiler partitions the distributed arrays into small sizes and the parallel constructs are sequentialized into do loops.

The compiler generates the appropriate communication primitives depending on the reference patterns of distributed arrays. For example, the statement:


      temp = ABS(a(:,k))
is transformed into a broadcast primitive since the array is distributed in the second dimension. All runtime routines are classified according to data types. For example, specifies the data type of the communication as real and specifies that it is vector communication. The primitive set_DADis used to fill the Distributed Array Descriptor (DAD) associated with array so that it can be passed to the broadcast communication primitive at run-time. The DAD has sufficient information for the communication primitives to compute all the necessary information including local lower and upper bounds, distributions, local and global shape etc. In this way the communication routines also has an option to combine messages for the same source and destination into a single message to reduce communication overhead. This is the typical characteristic of our compiler since we are only parallelizing array assignments and statements in Fortran 90D/HPF, there is no data dependency between different iterations. Thus, all the required communication can be performed before or after the execution of the loop on each of the processors involved.

The program code shows how the intrinsic function MAXLOC is translated into the library routine MaxLoc_R_M. The suffix specifies the data type and specifies that MAXLOC intrinsic has an optional mask array. Once again the array information passed to the run-time system with the associated data structure.

Fortran 90D/HPF may perform several optimizations to reduce the total cost of communication. The compiler can generate better code by observing the following from Figure :


      15.          Tmp = ABS(a(:,k))
      17.          maxNum = a(indxRow,k)
      19.          fac = a(:,k) / maxNum

The distributed array section is used at lines 15,17 and 19. The array is not changed between lines 15-19. Each statement causes a broadcast operation. Because the compiler performs statement level code generation for the above three statements. However, the compiler can eliminate two of three communication calls by performing the above dependency analysis. The compiler needs only generate one broadcast for line 15 which communicates a column of array . The Lines 17 and 19 can use the same data. The optimized code is shown in Figure . We generated this program by hand since the optimizations have not yet been implemented in the Fortran 90D/HPF compiler. Currently our compiler performs statement level optimizations. It does not perform basic-block level optimizations.

To validate the performance of optimization on GE which is a part of the FortranD/HPF benchmark test suite [66], we tested three codes on the iPSC/860 and plotted the results. 1-) The code is given Figure . This is shown as the dotted line in Figure , and represent the compiler generated code. 2-) The code in Figure . This appears as the dashed line in Figure and represents the hand optimized code on the compiler generated code as discussed above. 3-) The hand-written GE with Fortran 77+MP. This appears as the solid line in Figure . The code is written outside of the compiler group at NPAC to be unbiased.

The programs were compiled using Parasoft Express Fortran compiler which calls Portland Group if77 release 4.0 compiler with all optimization turned on (-O4). We can observe that the performance of the compiler generated code is within 10%of the hand-written code. This is due to the fact that the compiler generated code produces extra communication calls that can be eliminated using optimizations. The hand optimized code gives performance close to the hand-written code. From this experiment we conclude that Fortran 90D/HPF compiler, incorporated with node-level optimizations, can compete with hand-crafted code on some significant algorithms, such as GE.



Next: Experimental Results Up: Optimizations Previous: Forall Mask Insertion


zbozkus@
Thu Jul 6 21:09:19 EDT 1995