One disadvantage of the program in Figure 3.15 is that it allocates two very large temporary arrays, ta and tb. Because these are both replicated in one dimension, they can easily consume more memory than the original data. This problem can be solved by only storing copies of elements in restricted bands of the original matrices at any one time. Figure 4.5 gives a modified algorithm where the maximum band width is B.
For simplicity we assumed here that B is a compile-time constant. Alternatively we can compute this value dynamically. The volume() method on Range is used internally by array constructors to control allocation of memory for array elements. It defines the largest block of locations for the current range held by any processor. Hence an upper bound on the number of elements held by any processor for ta and tb combined is
B * x.volume() + B * y.volume()If MAX_TEMPORARY_SIZE is a constant defining a limit on the total volume of memory we ever wish to allocate for temporary arrays, a suitable formula for B might be
B = MAX_TEMPORARY_SIZE / (x.volume() + y.volume())With a few refinements like this, the algorithm of Figure 4.5 is a credible basis for a library matrix multiplication routine, applicable to generic distributed arrays.