The algorithm as described before still has small messages but this can be addressed for both "very bad" and "much better" algorithm by "blocking" j loop so that one fetches not 1 but J values of j at a time.
|
This implies messages can be "arbitarily" large and user can choose J so that:
-
Messages are long enough to avoid latency (start-up) performance degradation
-
Messages are short enough so that don't use too much space in memory of each processor (otherwise choose J= N - Nproc)
|
See later comments on cache use and pipelining of messages for further related issues
|