Recall that if A and B are square matrix of order n, then
C=AB is also a square matrix of order n, and is obtained
by taking the dot product of the ith row of A and jth column of
B. That is,
Fox's algorithm [3] for multiplication organize A, B and C into submatrix on a P by P process array. It take P steps, in each step, broadcast corresponding submatrix of A on each row of the processes, do local computation and then shift array B for the next step computation. Write this algorithm in HPJava, we still use Adlib.remap to broadcast submatrix, matmul is a subroutine for local matrix multiplication. Adlib.shift is used to shift array B, and Adlib.copy copies data back after shift, it can also be implemented as nested over and for loops.
Group p=new Procs2(P,P); Range x=p.dim(0); Range y=p.dim(1); on(p) { float [[#,#, , ]] a = new float [[x,y,B,B]]; float [[#,#, , ]] b = new float [[x,y,B,B]]; //input a, b here; float [[#,#, , ]] c = new float [[x,y,B,B]]; float [[#,#, , ]] temp = new float [[x,y,B,B]]; for (int k=0; k<p; k++) { over(Location i=x|:) { float [[,]] sub = new float [[B,B]]; //Broadcast submatrix in 'a' ... Adlib.remap(sub, a[[i, (i+k)%P, z, z]]); over(Location j=y|:) { //Local matrix multiplication ... matmul(c[[i, j, z, z]], sub, b[[i, j, z, z]]); } } //Cyclic shift 'b' in 'y' dimension ... Adlib.shift(tmp, b, 1, 0, 0); // dst, src, shift, dim, mode; Adlib.copy(b, tmp); } }