Sequential and Parallel Performance II
This implies a reordering of computations, so that L digits are done at a time and one fixes storage indices d-1, d-2, d-3, .. L and calculates 2L point FFT on lower indices
- This 2L point FFT can be calculated by iterating 2 point FFT’s as in Cooley Tukey or by some custom FFT of this size as in FFTW
- Clearly this makes excellent use of cache and each data point is used L times (multiplied by number of times used in basic single digit FFT -- 2.5 times on average per 1 floating point word)
So basic operation needed is one of reordering:
- Move L digits from current positions to another set of L positions
- Assume these are not overlapping
- Time taken is 2(d-P).2(complex floats) . (1-2L) with one memory read and one write each
This re-ordering is also algorithm used to undo bit reversal if needed