Basic Parallel FFT Algorithms I
Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT’s at phase d-1 and ending with combining two N/2 point FFT’s at phase 0.
- However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later
We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.
Let Nproc = 2P be the number of processors in the parallel computer
Consider the d digits in binary representation of m
- The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P
P digits labeling 2P processors
d-P digits labeling position