The parallelism is perfect and both communication and computation are load balanced |
However messages are small (as described 4 words - a 3 vector Xuse and mass M) and |
communication » 4 tcomm(for small messages) |
where estimate ~15 tfloat could be higher depending on how 1/ri,j3 calculated. As involves division and square root, it is expensive if done directly. Sometimes better to calculate by table lookup but this involves slow memory access |
As typically tcomm/tfloat » 10 even for quite large messages (and worse for small ones), the above estimate suggests that communication dominates ... |