1 | We can solve the small message size and poor communication/computation ratio by a simple change in algorithm which still preserves owner's compute rule. |
2 | Reverse loops over i and j so that j is outer loop and i inner loop |
3 | First for each i in processor, set MPGrav(i) = 0 |
4 | Now looping over j, increment each MPGrav(i) by contribution of those j that are local to processor |
5 | Now fetch those j which are off processor and communicate as before M(j), Xuse(j) |
6 | For this j, run over all i in processor, incrementing MPGrav(i) by the contribution of this j |
7 | This version of algorithm reduces communication by a factor of the grain size n =N/Nproc |
8 | and total communication overhead is approximately |
9 | which is small for n > 100 |