Once data is distributed, there are several alternatives to assign computations to processing elements for each instance of a forall statement. One of the most common methods is to use the owner computes rule. Using the owner computes rule, the computation is assigned to the processor owning the lhs data element. This rule is easy to implement and performs well in a large number of cases. Most the current implementations of parallelizing compilers uses the owner computes rule [53][19]. However, it may not be possible to apply the owner computes rule for every case without extensive overhead.
Figure shows the possible data and iteration distributions for
the
assignment caused by iteration instance
.
Cases 1 and 2 illustrate the order of communication and computation
arising from the owner computes rule.
Essentially, all the communications to fetch the off-processor data required
to execute an iteration instance are performed
before the computation is performed.
The generated code will have the following communication and computation order.
Communications ! some global communication primitives
Computation ! local computation
The following examples describe how Fortran 90D/HPF performs computation partitioning.
Example 1 (canonical form) Consider the following statement, taken from the Jacobi relaxation program:
forall (i=1:N, j=1:N)
& B(i,j) = 0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
In the above example, as in a large number of
scientific computations,
the forall statement can be written in the canonical form.
In this form, the subscript value in the lhs is identical to the
forall iteration variable.
In such cases, the iterations can be easily distributed using
the owner computes rule. Furthermore, it is also simpler to detect
structured communication by using this form
( This will be considered further in Section .).
Example 2 (non-canonical form) Consider the following statement, taken from an FFT program:
forall (i=1:incrm, j=1:nx/2)
& x(i+j*incrm*2+incrm) = x(i+j*incrm*2) - term2(i+j*incrm*2+incrm)
The lhs array index is not in the canonical form.
In this case, the compiler equally distributes the iteration space
on the number of processors on which the array is distributed.
Hence, the total number of iterations will still be the same as the number of
array elements being assigned.
However, this type of forall statement results in either Case 3 or Case 4
in Figure 2.
The generated code will be in the following order:
Communications ! some global communication primitives to read
Computation ! local computation
Communication ! a communication primitive to write
For reasonably simple expressions, the compiler transforms such index expressions into the canonical form by performing symbolic expression operations [54]. However, it may not always be possible to perform such transformations for complex expressions.
Example 3 (vector-valued index) Consider the statement
forall (i=1:N) A(U(i)) = B(V(i)) + C(i)
The iteration causes an assignment to element
,
where
may only be known at run-time.
Therefore, if iterations are statically assigned at compile time to various
PEs, iteration
is likely to be assigned to a PE other than
the one owning
.
This is also illustrated in cases 3 and 4 of Figure
.
In this case, the compiler distributes the computation
with respect to the owner of
.
Having presented the computation partitioning alternatives for various
reference patterns of arrays on the
, we now present a primitive to perform global to local transformations
for loop bounds.
! computes local lb, ub, st from global ones
set_BOUND(llb,lub,lst,glb,gub,gst,DIST,dim)
The set_BOUND primitive takes a global computation range with global lower bound, upper bound and stride. It distributes this global range statically among the group of processors specified by the dim parameter on the logical processor dimension. The DIST parameter gives the distribution attribute such as block or cyclic. The set_BOUND primitive computes and returns the local computation range in local lower bound, local upper bound and local stride for each processor. The algorithm to implement this primitive can be found in [50].
The other functionality of the set_BOUND primitive is to mask
inactive processors
by returning appropriate local bounds. For example, such a case may arise
when the global bounds do not specify the entire range of the array.
If the inputs for this primitive are compile-time constants,
the compiler can calculate the local bounds at compile-time.
In summary, our computation and data distributions have two implications.