1 |
As owner's compute rule is obeyed, a good parallelizing compiler should be able to "automatically" find the "much better" algorithm as inverting loops and blocking are standard optimization strategies
|
2 |
Note that "much better" parallel algorithm is also correct sequential algorithm as naturally uses each j value N-1 times as j block fixed in cache and i values are cycled through
-
Now size of block J controlled by cache size and not processor memory size as in parallel case
-
Note each i value used J times, each j value N-1 times
|
3 |
General lesson is that amount of computation and amount of data re-use are as important as amount of communication
|