Suppose we want to multiply the matrices A and B together to form matrix |
C: |
We will assume all matrices are square - the algorithm can be generalized to deal with rectangular matrices. |
The input matrices, A and B, are decomposed into rectangular sub-blocks. |
If we have N processors we have rows and columns of sub-blocks. This means that N should be a perfect square. (Of course this condition can be relaxed -- it is used for simplicity of discussion) |
One sub-block is assigned to each processor in a grid corresponding to the sub-block structure. |
The algorithm ensures that the output matrix C has the same decomposition as A and B. |
If A B and C are M by M with M2 = N m2 and thus M = m |
Then each processor stores an m by m subblock |
Processors are labelled by a two vector (l,k) l,k = 1 ... |