Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT's at phase d-1 and ending with combining two N/2 point FFT's at phase 0.
-
However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later
|
We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.
|
Let Nproc = 2P be the number of processors in the parallel computer
|
Consider the d digits in binary representation of m
-
The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P
|
P digits labeling 2P processors
|
d-P digits labeling position
|
in processor
|