The class LocBlocksIndex is a superclass of Index that provides members to enumerate local blocks of a range, rather than individual local elements. Library functions return base addresses of the array sections associated with these blocks. Elements within the block are then enumerated with a simple, efficient for loop, computing offsets from base addresses using linear expressions. The performance-critical inner loops can be compiled with high efficiency. If the block size is large enough, most of the cost of the library calls is amortized.
LocBlocksIndex is a subclass of Location and of Block. The outer level in the translation of a overall construct now looks something like
LocBlocksIndex i(x) ; apgStack.push(&apg) ; apg.restrict(i.dim) ; for(i.beginLocBlk() ; i.test() ; i.nextLocBlk()) { ... deal with block `i' } apgStack.pop(&apg) ;
In this translation scheme we have an outer loop enumerating the locally held blocks of the range. In general this loop is needed because Adlib supports higher-level distribution formats like block-cyclic. These allow multiple blocks of a single range to reside on the same processor. In a context where it is known in advance that the local process holds a single block (for example, if the range involved has level 1), the translation scheme given in the next section allows further optimizations.
In general code to ``deal with block i'', takes the form
... precompute some bases and increments for block for(int l = 0 ; l < i.count ; l++) { ... }
Before filling the details of this code, we need to put the source code into a normalized form. The body of a overall construct parametrized by an index i may use i in several contexts:
... a(i, j) ...
can be replaced with
float* a_dat = a.dat() ; ... a_dat [a.map(0).offset(i) + a.map(1).offset(j)] ...
Of course the inquiries dat() and map() can be lifted outside any loop. Similarly, the section construction
... a.sect(i, y) ...
can be replaced with
float* a_dat = a.dat() ; ... Section1<float>(y, a.grp() / i, a.map(1), a_dat + a.map(0).offset(i)) ...
By applying these transformations, and replacing expression of the form y.idx(i) by linear expressions in the template global subscript i.tem, the set of uses of i inside the loop can be reduced to four cases
Figure 7.1: Translation of offset computation.
Figure 7.2: Translation of global subscript computation.
Figure 7.3: Translation for computation of group
restriction.
Array2<float> a(x, y) ; Array1<float> b(y) ; ... Index i(x), j(y) ; OVERALL(i) { OVERALL(j) { a(i, j) = 2 * b(j) + i ; } ALLOVER(j) ; } ALLOVER(i) ;
It can be normalized to the form
Array2<float> a(x, y) ; Array1<float> b(y) ; ... float* a_dat = a.dat() ; float* b_dat = b.dat() ; Index i(x), j(y) ; OVERALL(i) { OVERALL(j) { a_dat [a.map(0).offset(i) + a.map(1).offset(j)] = b_dat [b.map(0).offset(j)] + x.idx(i) ; } ALLOVER(j) ; } ALLOVER(i) ;
A translation of the loop nest is given in figure 7.4.
Figure 7.4: Translation of example.
Note that in this example the manipulations of apg could have been omitted, because there are no collective operations inside the loop that depend on the state of apg.
As a final straightforward optimization, when we have perfectly nested overall constructs, the loop nesting can be changed to put all intra-block loops innermost. In that case the inner loops become
for(int l = 0 ; l < i.count ; l++) { for(int m = 0 ; m < j.count ; m++) { a_dat [a_off0_bas + a_off0_stp * l + a_off1_bas + a_off1_stp * m] = b_dat [b_off0_bas + b_off0_stp * m] + i.glb_bas + i.glb_stp * l ; } }
All subscript expressions are linear in the loop induction variables, and we expect very good code generation from these loops.
The translation scheme described in this section is summarized in figure 7.5.
Figure 7.5: Summary of
LocBlocksIndex-based translation scheme for overall construct.