We can solve the small message size and poor communication/computation ratio by a simple change in algorithm which still preserves owner's compute rule. |
Reverse loops over i and j so that j is outer loop and i inner loop |
First for each i in processor, set MPGrav(i) = 0 |
Now looping over j, increment each MPGrav(i) by contribution of those j that are local to processor |
Now fetch those j which are off processor and communicate as before M(j), Xuse(j) |
For this j, run over all i in processor, incrementing MPGrav(i) by the contribution of this j |
This version of algorithm reduces communication by a factor of the grain size n =N/Nproc |
and total communication overhead is approximately |
which is small for n > 100 |