A significant advantage of coding in Fortran 90D/HPF is the ability to specify different distribution directives and measure performance differences without extensive recoding. Block distribution strategy for allocating elements to a processor is ideal for computations that reference adjacent elements along an axis, as is the case in many relaxation methods [67]. The number of references to non-local variables for a given number of local variables is minimized when the volume to surface ratio is maximized. However, block distribution may result in poor load balance. Some experimentation along these lines was performed on the Gauss benchmark (gauss), which is a program designed to measure the performance of a Gaussian elimination algorithm.
Figure gives the main factorization
loop of gauss which converts matrix
to upper triangular form.
This Gaussian elimination algorithm is sub-optimal due to a mask
in the inner loop which prevents vectorization.
Figure (a) shows the updated values of matrix
in the shaded region after the factorization loop.
Since the compiler uses the owner computes rule to assign computations,
only owners of data in the shaded region will participate in the computation.
The remaining processors are masked out of the computation. Figures
(b) and (c) show the computation distribution on four processors
in block and cyclic fashions respectively. X axes shows that
how the data is distributed on four processor grids.
In this particular benchmark,
cyclic distribution results in better load balancing than block distribution.
Figure presents the performance using cyclic as well as
block distributions on Intel Paragon.
As expected, the cyclic distribution exhibits
better performance because of load balancing.
The communication requirements for these distributions are identical.
Both use multicast.