Performance of Simplest Parallel DIT FFT III
Tparallel (phases d-P? p< d) = N(Tcalc+Tcomm)/(Nproc) and again this is perfectly parallel (load balanced)
Tcomm is time to swap one complex word between 2 processors (to reduce latency this would be done as a block transfer of N/Nproc words)
Tcalc = T+ + T* as each processor must multiply and add/subtract. Note this implies that one loses parallelism in multiplication as both processors must do it.
Communication overhead fcomm = Tparallel *Nproc/Tsequential -1