1 |
Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT's at phase d-1 and ending with combining two N/2 point FFT's at phase 0.
-
However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later
|
2 |
We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.
|
3 |
Let Nproc = 2P be the number of processors in the parallel computer
|
4 |
Consider the d digits in binary representation of m
-
The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P
|
5 |
P digits labeling 2P processors
|
6 |
d-P digits labeling position
|
7 |
in processor
|