Translation of overall construct using LocBlocksIndex

Next: Translation using the kernel Up: class LocBlocksIndex Previous: Methods

Translation of overall construct using `LocBlocksIndex`

The class LocBlocksIndex is a superclass of Index that provides members to enumerate local blocks of a range, rather than individual local elements. Library functions return base addresses of the array sections associated with these blocks. Elements within the block are then enumerated with a simple, efficient for loop, computing offsets from base addresses using linear expressions. The performance-critical inner loops can be compiled with high efficiency. If the block size is large enough, most of the cost of the library calls is amortized.

LocBlocksIndex is a subclass of Location and of Block. The outer level in the translation of a overall construct now looks something like

  LocBlocksIndex i(x) ;

  apgStack.push(&apg) ;
  apg.restrict(i.dim) ;
  for(i.beginLocBlk() ; i.test() ; i.nextLocBlk()) {
    ... deal with block `i'
  }
  apgStack.pop(&apg) ;

In this translation scheme we have an outer loop enumerating the locally held blocks of the range. In general this loop is needed because Adlib supports higher-level distribution formats like block-cyclic. These allow multiple blocks of a single range to reside on the same processor. In a context where it is known in advance that the local process holds a single block (for example, if the range involved has level 1), the translation scheme given in the next section allows further optimizations.

In general code to ``deal with block i'', takes the form

    ... precompute some bases and increments for block

    for(int l = 0 ; l < i.count ; l++) {
      ...
    }

Before filling the details of this code, we need to put the source code into a normalized form. The body of a overall construct parametrized by an index i may use i in several contexts:

As a subscript in a local subscripting operation.
In an expression such as x.idx(i) used to obtain the global subscript of the current iteration relative to the index range.
In an expression such as y.idx(i) used to obtain the global subscript of the current iteration relative to some super-range of the index range.
As a scalar subscript in a section subscripting operation.
In a group restriction operation of the form p / i, where p is some group.

As described in chapter 4, subscripting operations on arrays can be replaced with lower-level operations on the Map and Group classes. For example, if a is an array of float, the reference

  ... a(i, j) ...

can be replaced with

  float* a_dat  = a.dat() ;
  
  ... a_dat [a.map(0).offset(i) + a.map(1).offset(j)] ...

Of course the inquiries dat() and map() can be lifted outside any loop. Similarly, the section construction

  ... a.sect(i, y) ...

can be replaced with

  float* a_dat  = a.dat() ;

  ... Section1<float>(y, a.grp() / i,
                      a.map(1), a_dat + a.map(0).offset(i)) ...

By applying these transformations, and replacing expression of the form y.idx(i) by linear expressions in the template global subscript i.tem, the set of uses of i inside the loop can be reduced to four cases

As an argument of Map :: offset.
In x.idx(i), yielding the global subscript.
In i.tem, yielding the template global subscript.
In a group restriction operation of the form p / i.

We will deal with these cases in turn. [Replace follwing enumeration and associated figures with straight descriptive text.]

Figure 7.1: Translation of offset computation.

Figure 7.2: Translation of global subscript computation.

Figure 7.3: Translation for computation of group restriction.

: Offset computation. Within a particular block the expression u.offset(i) can be rewritten as a linear function of the inner induction variable. The base for this function is given by the value of offset for the first Location of the range, which is the Location component of i. Its increment is returned by the member Map :: step. Figure 7.1 illustrates the translation. The call to offset has been removed from the inner loop. A good compiler will generate very efficient code for the linear expression that replaces it.
: Global subscript computation is slightly simpler. The base and increment for this expression are already contained in the Block component of i, computed by its iterator members. Figure 7.2 illustrates the translation.
: Group restriction is translated trivially, as illustrated in figure 7.3.

Now, consider this example from the preamble to this chapter:

  Array2<float> a(x, y) ;
  Array1<float> b(y) ;
  ...
  Index i(x), j(y) ;
  OVERALL(i) {
    OVERALL(j) {
      a(i, j) = 2 * b(j) + i ;
    } ALLOVER(j) ;
  } ALLOVER(i) ;

It can be normalized to the form

  Array2<float> a(x, y) ;
  Array1<float> b(y) ;
  ...
  float* a_dat = a.dat() ;

  float* b_dat = b.dat() ;

  Index i(x), j(y) ;
  OVERALL(i) {
    OVERALL(j) {
      a_dat [a.map(0).offset(i) + a.map(1).offset(j)] =
          b_dat [b.map(0).offset(j)] + x.idx(i) ;
    } ALLOVER(j) ;
  } ALLOVER(i) ;

A translation of the loop nest is given in figure 7.4.

Figure 7.4: Translation of example.

Note that in this example the manipulations of apg could have been omitted, because there are no collective operations inside the loop that depend on the state of apg.

As a final straightforward optimization, when we have perfectly nested overall constructs, the loop nesting can be changed to put all intra-block loops innermost. In that case the inner loops become

      for(int l = 0 ; l < i.count ; l++) {
        for(int m = 0 ; m < j.count ; m++) {
          a_dat [a_off0_bas + a_off0_stp * l +
                 a_off1_bas + a_off1_stp * m] =
                    b_dat [b_off0_bas + b_off0_stp * m] +
                    i.glb_bas + i.glb_stp * l ;
        }
      }

All subscript expressions are linear in the loop induction variables, and we expect very good code generation from these loops.

The translation scheme described in this section is summarized in figure 7.5.

Figure 7.5: Summary of LocBlocksIndex-based translation scheme for overall construct.

Next: Translation using the kernel Up: class LocBlocksIndex Previous: Methods

Guansong Zhang
Fri Oct 9 12:29:23 EDT 1998