1 | Suppose we want to multiply the matrices A and B together to form matrix |
2 | C: |
3 | We will assume all matrices are square - the algorithm can be generalized to deal with rectangular matrices. |
4 | The input matrices, A and B, are decomposed into rectangular sub-blocks. |
5 | If we have N processors we have rows and columns of sub-blocks. This means that N should be a perfect square. (Of course this condition can be relaxed -- it is used for simplicity of discussion) |
6 | One sub-block is assigned to each processor in a grid corresponding to the sub-block structure. |
7 | The algorithm ensures that the output matrix C has the same decomposition as A and B. |
8 | If A B and C are M by M with M2 = N m2 and thus M = m |
9 | Then each processor stores an m by m subblock |
10 | Processors are labelled by a two vector (l,k) l,k = 1 ... |