In our description of "much better" and "best" algorithm, we assumed that one broadcasts each J block to each processor
|
There are some different ways of setting this up which can be more efficient on some architectures
-
Especially on classic architectures of times gone by where there different costs for different communication paths
|
The data parallel part of foils in fact describes the natural pipeline algorithm which rotates J blocks through processors one step at a time
|
This has the feature (different from previous explanation) that each processor is handling a different set of j's at a given stage in computation.
|