The numerical algorithm used to solve for the stream function
values was the Red-Black SOR method. This method is the
logical choice as it combines the quick convergence of the
Gauss-Seidel SOR method with a degree of parallelism not
found in the GS method, which is inherently sequential.
The
tolerance condition was set to so that
the timings would not take excessively long.
Once the program was running smoothly, the HPF DISTRIBUTE directive was changed to reflect different data distribution choices. Because of the symmetry of the problem, it is not necessary to fiddle with the first dimension of the distribution directive, as the same effect can be achieved more simply be collapsing this dimension and distributing only over the columns. For example, the distribution (CYCLIC(25),BlOCK) is the same as (*,CYCLIC(25)). However, the first distribution may confuse some compilers (especially the DEC!) and generally takes longer to compile. Collapsing the first dimension (notice the * in the directive) makes compilation easy and achieves exactly the same, if not superior performance. The times for each distribution tested (using four processors) are listed in Figure 2.
Figure 2: Variation of execution time with data distribution scheme.
All times are the average of five runs on four processors.
Indeed, a (BLOCK,BLOCK) distribution yields a poor time, because two processors are stuck with a full workload while the other two have only the lightly populated upper part of the vent. A (BLOCK,*) distribution has the exact same load balancing problem, so it should have the same or worse time, as its edge over area is not as good as (BLOCK,BLOCK). The timing results bear this out, as the performance goes down slightly when (BLOCK,*) is used.
The first distribution that begins to ease the load balancing problem is (*,BLOCK), the distribution of blocks of columns. Now, the processor with the most data does not have as much as it did with the other distributions (indeed, this is what fixing a load balancing problem really means!). The expected improvement indeed pans out, with this scheme being noticeably faster than the two above it. A change to CYCLIC(40) groups the rows in blocks of 40 each and distributes these blocks cyclically. This distribution gives only a slight benefit in terms of load balancing, and it also gives a worse edge over area effect. As such, the performance decreases from the (*,BLOCK) pattern.
To make a larger benefit in terms of load balancing the directive was changed to (*,CYCLIC(20)). At this point, each processor holds nearly the same amount as its neighbors, and thus load balancing is nearly achieved. The best run time yet is the result, a mere 14.8 seconds. Continuing down to (*,CYCLIC(10)) has a detrimental effect, as load balancing couldn't really be improved upon and the edge over area ratio worsens considerably.\ Making the size of the CYCLIC blocks any smaller only worsens this effect.