Next: Why Use Runtime Up: Communication Model Previous: Communication Model

Computation Partitioning

Once data is distributed, there are several alternatives to assign computations to processing elements for each instance of a forall statement. One of the most common methods is to use the owner computes rule. Using the owner computes rule, the computation is assigned to the processor owning the lhs data element. This rule is easy to implement and performs well in a large number of cases. Most the current implementations of parallelizing compilers uses the owner computes rule [53][19]. However, it may not be possible to apply the owner computes rule for every case without extensive overhead.

Figure shows the possible data and iteration distributions for the assignment caused by iteration instance . Cases 1 and 2 illustrate the order of communication and computation arising from the owner computes rule. Essentially, all the communications to fetch the off-processor data required to execute an iteration instance are performed before the computation is performed. The generated code will have the following communication and computation order.


       Communications  ! some global communication primitives
       Computation     ! local computation

The following examples describe how Fortran 90D/HPF performs computation partitioning.

Example 1 (canonical form) Consider the following statement, taken from the Jacobi relaxation program:


      forall (i=1:N, j=1:N)
     &   B(i,j) = 0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))

In the above example, as in a large number of scientific computations, the forall statement can be written in the canonical form. In this form, the subscript value in the lhs is identical to the forall iteration variable. In such cases, the iterations can be easily distributed using the owner computes rule. Furthermore, it is also simpler to detect structured communication by using this form ( This will be considered further in Section .).

Example 2 (non-canonical form) Consider the following statement, taken from an FFT program:


      forall (i=1:incrm, j=1:nx/2)
     &   x(i+j*incrm*2+incrm) = x(i+j*incrm*2) - term2(i+j*incrm*2+incrm)

The lhs array index is not in the canonical form. In this case, the compiler equally distributes the iteration space on the number of processors on which the array is distributed. Hence, the total number of iterations will still be the same as the number of array elements being assigned. However, this type of forall statement results in either Case 3 or Case 4 in Figure 2. The generated code will be in the following order:


       Communications  ! some global communication primitives to read 
       Computation     ! local computation
       Communication   ! a communication primitive to write

For reasonably simple expressions, the compiler transforms such index expressions into the canonical form by performing symbolic expression operations [54]. However, it may not always be possible to perform such transformations for complex expressions.

Example 3 (vector-valued index) Consider the statement


      forall (i=1:N) A(U(i)) = B(V(i)) + C(i)

The iteration causes an assignment to element , where may only be known at run-time. Therefore, if iterations are statically assigned at compile time to various PEs, iteration is likely to be assigned to a PE other than the one owning . This is also illustrated in cases 3 and 4 of Figure . In this case, the compiler distributes the computation with respect to the owner of .

Having presented the computation partitioning alternatives for various reference patterns of arrays on the , we now present a primitive to perform global to local transformations for loop bounds.


        ! computes local lb, ub, st from global ones	
        set_BOUND(llb,lub,lst,glb,gub,gst,DIST,dim)

The set_BOUND primitive takes a global computation range with global lower bound, upper bound and stride. It distributes this global range statically among the group of processors specified by the dim parameter on the logical processor dimension. The DIST parameter gives the distribution attribute such as block or cyclic. The set_BOUND primitive computes and returns the local computation range in local lower bound, local upper bound and local stride for each processor. The algorithm to implement this primitive can be found in [50].

The other functionality of the set_BOUND primitive is to mask inactive processors by returning appropriate local bounds. For example, such a case may arise when the global bounds do not specify the entire range of the array. If the inputs for this primitive are compile-time constants, the compiler can calculate the local bounds at compile-time.

In summary, our computation and data distributions have two implications.



Next: Why Use Runtime Up: Communication Model Previous: Communication Model


zbozkus@
Thu Jul 6 21:09:19 EDT 1995