The program for the Fox algorithm is completed by the definition of matmul. First in HPJava:
void matmul (float[[*,*]] c, float[[*,*]] b, float[[*,*]] c) { for (int i=0; i<B; i++) for (int j=0; j<B; j++) for (int k=0; k<B; k++) c[i,j]+=a[i,k]*b[k,j]; }
Translated naively to the adJava interface, this becomes:
public static void matmul(Section2ccF c, Section2ccF a, Section2ccF b) { for (int i=0; i<B; i++) for (int j=0; j<B; j++) for (int k=0; k<B; k++) c.dat()[c.pos(i, j)] += a.dat()[a.pos(i, k)] * b.dat()[b.pos(k, j)]; }
The methods dat and pos were introduced earlier.
It is clear that the segment of code above will have very poor run-time performance, because it involves many method invocations for each array element access. Because the array data is actually stored in a certain regularly strided section of a Java array, these calls are not really necessary. All that is needed is to find the address of the first array element, then write the other addresses as a linear expression in the loop variable and this initial value. The code above can be rewritten in the form given in figure 6. This optimization again exposes various low-level functions in the runtime--we omit details (see [3]). The effect is to compute the parameters of the linear expressions for the local address offsets. This allows inlining of the element calls. In this case the resulting expressions are linear in the induction variables of the for loops. If necessary the multiplications can be eliminated by standard compiler optimizations.
Figure 6: Optimized translation of matmul
This segment of Java code will certainly run much faster. The drawback is that, compared with the first Java procedure, the optimized code is less readable. This is a simple example of the need for compiler intervention if the HPJava style of programming is to be made acceptable. Similar and more compelling examples arise in optimization of the overall construct. As described in [15] and illustrated in the example of the last section, a trivial implementation of the general overall construct is by a for loop surrounding an at construct. More sensibly, all the machines across a process dimension should simultaneously execute the body for all locally held locations in the relevant distributed range. Computation of the local offset of the array element can again be reduced to a linear expression in a loop variable instead of a function call.