Next: Hand-written Comparison Up: Experimental Results Previous: Scalability of Intrinsics

An Experiment with Distributions

A significant advantage of coding in Fortran 90D/HPF is the ability to specify different distribution directives and measure performance differences without extensive recoding. Block distribution strategy for allocating elements to a processor is ideal for computations that reference adjacent elements along an axis, as is the case in many relaxation methods [67]. The number of references to non-local variables for a given number of local variables is minimized when the volume to surface ratio is maximized. However, block distribution may result in poor load balance. Some experimentation along these lines was performed on the Gauss benchmark (gauss), which is a program designed to measure the performance of a Gaussian elimination algorithm.

Figure gives the main factorization loop of gauss which converts matrix to upper triangular form. This Gaussian elimination algorithm is sub-optimal due to a mask in the inner loop which prevents vectorization.

Figure (a) shows the updated values of matrix in the shaded region after the factorization loop. Since the compiler uses the owner computes rule to assign computations, only owners of data in the shaded region will participate in the computation. The remaining processors are masked out of the computation. Figures (b) and (c) show the computation distribution on four processors in block and cyclic fashions respectively. X axes shows that how the data is distributed on four processor grids. In this particular benchmark, cyclic distribution results in better load balancing than block distribution.

Figure presents the performance using cyclic as well as block distributions on Intel Paragon. As expected, the cyclic distribution exhibits better performance because of load balancing. The communication requirements for these distributions are identical. Both use multicast.


zbozkus@
Thu Jul 6 21:09:19 EDT 1995