We believe this converges as the below average load processors stay fixed, and extra load will flow "downhill" until it gets there. This is probably more slowly converging than necessary - there are lots of possible algorithms, including arbitrarily stopping after some number of iterations. |
The load balancing iterations may take up to Nproc sends and receives of only a small number of values. If there is no significant load imbalance, this may be much less, making this method have less message-passing than the CommAll algorithm. |
Variation: In this application and in general, the units of computation do not have to be contiguous. Each processor can have a list of the indices that it is working on and send load to neighbors with wraparound, or communicate load to other choices of processors. |