1 | Many processors are completely idle, while the last column of processors still has the same running time. |
2 | Better Decomposition is Block by Rows: After iterating over i columns, each processor is still working on n-i+1 row elements. Better load balance and better running time, as time of longest running processor is reduced each iteration. |