Basic Parallel FFT Algorithms I

Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT’s at phase d-1 and ending with combining two N/2 point FFT’s at phase 0.
- However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later

We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.

Let Nproc = 2P be the number of processors in the parallel computer

Consider the d digits in binary representation of m
- The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P

P digits labeling 2P processors

d-P digits labeling position

in processor

Previous slide Next slide Back to first slide View graphic version