We must choose a decomposition for the matrix A which is load balanced throughout the algorithm, and which minimizes communication.
-
Contiguous blocks or rows or columns?
-
This won't work since it is not load balanced. Once processing of a block of rows or columns is completed, the corresponding processor will have nothing to do. For example in a decomposition of rows by block, when the computational window has reached the point as shaded in the diagram, processor 0 is idle for the rest of the algorithm.
|