Performance of Simplest Parallel DIT FFT IV
One can however for DIT (and DIF) get much better performance if one avoids the “owner computes rule” and changes the processor which calculates a given FFT component.
This can seen by examining a typical step in a phase of parallel FFT where communication is needed in current algorithm. Consider any two processors -- called a and b --- which need to swap data in current algorithm
- a has vector fa and b has a vector fb -- both of length N/Nproc
- Currently we swap N/Nproc numbers by sending fb to a and fa to b