Given by Geoffrey C. Fox at Computational Science Class CPS615 on Winter Semester 2000. Foils prepared April 22 2000
Outside Index
Summary of Material
We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation |
The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes |
We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT. |
For binary case, we go through parallelism and use of cache giving a performance analysis |
These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT |
Outside Index Summary of Material
CPS615 Computational Science |
Spring Semester 2000 |
Geoffrey Fox |
Northeast Parallel Architectures Center |
Syracuse University |
111 College Place |
Syracuse NY |
gcf@npac.syr.edu |
gcf@cs.fsu.edu |
We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation |
The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes |
We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT. |
For binary case, we go through parallelism and use of cache giving a performance analysis |
These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT |
Lecture 24 Demmel 1999 CS267 UC Berkeley |
FFTW (Fastest Fourier Transform in the West) at http://www.fftw.org
|
http://www.eptools.com by Engineering Productivity Tools Inc. has a good tutorial on the FFT
|
Chapter 11 of "Solving Problems on Concurrent Processors, Volume 1", Prentice Hall 1988. (Chapter written by John Salmon)
|
Express any 2D function defined in 0 ? x,y ? 1 as a series ?(x,y) = Sj Sk ?jk sin(p jx) sin(p ky)
|
The inverse of this is: ?jk = 4 ?(x,y) sin(p jx) sin(p ky) |
Poisson's equation ?2 ? /? x2 + ? 2 ? /? y2 = f(x,y) becomes
|
This implies PDE can be solved exactly algebraically, ?jk = fjk / (-p2j2 - p2k2) |
So solution of Poisson's equation involves the following steps |
1) Find the Fourier coefficients fjk of f(x,y) by performing integral |
2) Form the Fourier coefficients of ? by ?jk = fjk / (-p2j2 - p2k2) |
3) Construct the solution by performing sum ?(x,y) |
There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral
|
Let f be in 1D a typical function defined on a grid and labeled with index m= 0 ... N-1 |
Let i = sqrt(-1) and index matrices and vectors from 0. |
The Discrete Fourier Transform G(f) of an N-element vector f is another vector of length N given by Matrix Vector Multiplication ? f |
Where ? is the N*N matrix with matrix elements: |
?km = v (-k*m) |
and v is: |
v = e (2pi/N) = cos(2p/N) + i*sin(2p/N) |
This is a complex number with whose Nth power is 1 and is therefore called the Nth root of unity |
E.g., for N = 4: |
v = 0+1*i, v2 = -1+0*i, v3 = 0-1*i, v4 = 1+0*i, |
f can be reconstructed from its discrete Fourier transformation G(f) by |
f = ?*. G(f)/N where * denotes complex conjugation
|
Most applications require both calculating FFT's and reconstructing functions from their FFT's
|
For solving the Poisson equation and various other applications, we use variations on the FFT
|
The basic discrete FFT is: |
This in a naïve analysis takes O(N2) complex floating point additions and multiplications. However we can group terms together to reduce this to a time O(NlogN). |
This can be done in many ways and starts with a factorization N=N1N2 (product of integers) and a recursive iteration of factorization to values N1, N2 where special algorithms exist.
|
There are two special cases of particular importance |
Decimation in frequency (DIF) N1=2 N2=N/2 where frequency space is GN( k,f ) |
Decimation in time (DIT) N1=N/2 N2=2 where time domain is f(m) |
These are used in cases where N=2d is a power of 2 and decimations can be applied recursively d times. |
You can apply N=2d to a general value of N by taking say N=773 and adding 251 zeros to make it equal to 1024. |
This can be wasteful especially in multi dimensional transforms where you waste space in x y and z directions |
The binary FFT illustrates issues and has much simpler algebra -- FFTW uses a special purpose compiler to cope with the general case |
Consider the exponential term and divide indices k and m into two terms |
Then term is just what we need for an FFT of size N/2 |
We now will see that factor S with give us a sum of two such FFT's with particular algebraic forms for the functions to be Fourier transformed |
Now is +1 when =0
|
Further |
Define two functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f |
fE fO are gotten by summing and multiplying as instructed by function S on previous foil |
We can now apply the same idea recursively; chop off the lowest binary digit of k so as to derive a formula for each FFT in terms of two more half the size. |
Let C(N) be computational complexity of the FFT of size N |
Let Tbutterfly = 2T+ + T* be the time to calculate fE and fO for one value of m- |
Here T+ is time for a complex addition and T* the time for a complex multiplication. |
We assume that complex number TN(m-) has been precalculated and is available by table lookup |
Then C(N) = 2C(N/2) + (N/2) Tbutterfly |
This can be solved as C(N) = NlogNTbutterfly/2 |
Consider the exponential term again and divide indices k and m into two terms in a different way |
Then again term is just what we need for an FFT of size N- =N/2 |
We now will see that factor S with give us a sum of two different (from DIF) N/2 length FFT's |
Now is +1 when =0
|
Further |
Define two (different from DIF) functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f |
fE fO are gotten by summing and multiplying as instructed by function S on previous foil |
The recursive structure of the binary FFT algorithm is a complete binary tree of logN levels shown below for N=16 |
At each phase, one relates the FFT of M numbers to FFT of even and odd subsets fE fO of this set -- these are of size M/2 |
Diagram shows binary digit (XXXX) of relevant elements |
e.g. XXX1 are all odd indices m; XX11 is set 4m-+3 where m- lies between 0 and 3 |
DIT and DIF have identical computational complexity analyses at the superficial level we did so far
|
First we will discuss a way of thinking about binary FFT's which will make the future discussion much clearer |
Critical idea is to represent indices by their binary representation shown below as d digit words for k and m: |
e.g. |
k as a binary word |
m as a binary word |
Here of course vector size N = 2d |
For DIT, the first step, manipulates (in "butterfly fashion"), the lowest component m0 for time and creates the highest binary digit kd-1 for frequency domain. |
Note that most FFT implementation overwrite f[m] by its FFT GN(k,f) as they go along |
This implies that we start with f(m) where m labeled like |
And end with that start of a recursive FFT labeled like: |
Note characteristic bit reversal in storage of GN(k,f) |
"Wrong End" |
We iterate this idea with phases p running from 0 to d-1. At the p'th phase, one is manipulating the p'th digit of f(m) and forming the d-1-p'th index of GN(k,f). |
Thus starting with f(m) with binary labeling: |
One ends up with GN(k,f) with totally reversed index scheme |
Note that in recursive mode, one starts forming recursion with phase p=0 but first computations are done at phase p=d-1 where one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0 |
Data dependencies in 1D FFT show a characteristic butterfly structure |
We show 4 processing phases p |
At phase p for DIT, one manipulates p'th digit of f(m) |
|
Index m |
with |
binary |
labels |
Phase 3 2 1 0 |
Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT's at phase d-1 and ending with combining two N/2 point FFT's at phase 0.
|
We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil. |
Let Nproc = 2P be the number of processors in the parallel computer |
Consider the d digits in binary representation of m
|
P digits labeling 2P processors |
d-P digits labeling position |
in processor |
Remembering that first computations are performed at phase d-1, we see that one needs communication for the first few (P phases) as dependency lines cross boundaries. For the final phases all the information needed for FFT component is stored within processor. |
So communication needed for phases d-1 through d-P
|
No communication at all for phases d-P-1 through 0 |
Note end result -- the FFT GN(k,f) -- is bit reversed: we will discuss later the communication cost of undoing bit reversal but
|
Binary |
Representation |
of Index m |
Consider 4 Processors and natural block decomposition |
note phases 2 and 3 have communication as dependency lines cross processor boundaries |
1 |
2 |
3 |
Processor |
Boundary |
Processor |
Boundary |
Processor |
Boundary |
Processor |
Number |
As discussed we have two types of phases |
The sequential time Tsequential (phase p) is identical for each phase p at NTbutterfly/2 |
Tparallel (phases 0? p? d-P-1) = NTbutterfly/(2Nproc) as perfectly parallel (load balanced) |
At the remaining stages, one must communicate in each of computations where fE and fO (DIT) are exchanged between two processors. This must be done for every one of the N/Nproc points stored in each processor |
This shows how processor holding fE fetches fO and forms in-place (i.e. in location originally holding fE) the FFT component with kd-q-1 = 0 (stored in place mq=0). |
Similarily processor holding fO forms the FFT component with kd-q-1 = 1. |
Tparallel (phases d-P? p< d) = N(Tcalc+Tcomm)/(Nproc) and again this is perfectly parallel (load balanced) |
Tcomm is time to swap one complex word between 2 processors (to reduce latency this would be done as a block transfer of N/Nproc words) |
Tcalc = T+ + T* as each processor must multiply and add/subtract. Note this implies that one loses parallelism in multiplication as both processors must do it. |
Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 |
One can however for DIT (and DIF) get much better performance if one avoids the "owner computes rule" and changes the processor which calculates a given FFT component. |
This can seen by examining a typical step in a phase of parallel FFT where communication is needed in current algorithm. Consider any two processors -- called a and b --- which need to swap data in current algorithm
|
Alternatively we could send even members of fa to b and odd indexed entries in fb to a.
|
Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 is now given by |
We have avoided load imbalance and halved the communication |
compared to simple algorithm. In later foils we will find even better |
methods that get rid of log2N term |
For DIF, the phases are of course reversed in the way the step through the binary digits of f(m) and GN(k,f) |
One still has d phases labeled by p |
At first recursion p=0 one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0 |
The first DIF computations are done at end of recursion and this is "natural" m=0 and kd-1 index for GN(k,f) |
DIF has an identical parallel performance to DIT although some of the steps differ in detail |
Now to understand performance of general algorithms in parallel or in caches, one must understand storage of the arrays. |
Note first that labeling is not necessarily same as storage |
if p(l) is any permutation of l=0...d-1, then we can store |
This can be used to improve sequential performance by ensuring that when a given digit l is being processed, its storage digit p(l) is in range 0? p(l) < L, where the cache can hold 2L entries |
This implies a reordering of computations, so that L digits are done at a time and one fixes storage indices d-1, d-2, d-3, .. L and calculates 2L point FFT on lower indices
|
So basic operation needed is one of reordering:
|
This re-ordering is also algorithm used to undo bit reversal if needed |
Interesting point about swapping time is that it is independent of L except for the factor (1-2L) which is a factor of two increase from L=1 to largest L |
After spending this time, one can then do "in-cache" computations even in an iterative algorithm |
Computations are larger by a factor L (roughly log of Cache Size) compared to cost of re-ordering |
We can use the same idea to improve the parallel algorithm |
Now the top P=logNproc digits label processor number and we transfer these to digits stored inside processor. This involves communication but after this transfer, all computations are internal to a processor
|
An interesting point about the resultant communication overhead is that one no longer gets the logNproc dependence in the numerator |
Remember that in D dimensions, we found fcomm proportional to n-1/D where n is grain size |
For FTT fcomm is proportional to 1/log2n which is "standard" formula with D=? |
In days gone by, We used to use hypercube topology for parallel machines |
Such machines have 2P processor nodes labeled |
and the communication channels are such that every node x is connected to P other nodes -- connections run to those nodes whose binary representations differs in just one digit. For instance |
and |
are connected |
The relevance of this to Cooley Tukey FFT can be seen from basic recursive step which precisely involves linking Fourier coefficients which also just differ by one in a single binary digit |
Hypercube had other useful features e.g. the 64 node 6 dimensional hypercube includes meshes (8 by 8, 4 by 4 by 4, 2 by 8 by 4 etc.) |
However nowadays one builds general purpose switches (communication systems) and does not focus on particular communication topologies |
Hypercube Topology for 8 machines |
Seitz and |
Fox with |
Caltech |
64 node |
Hypercube |
1983 |
Multi-dimensional FFT's are very common and the most important for parallel computing as they are more computationally intense
|
Take a 2d2 by 2d1 FFT in two dimensions and label it by a d digit binary number with d=d1+d2 |
Then all discussions are essentially identically to one dimensional FFT with dimension d |
The algorithm is different but communication issues and strategy of re-ordering binary index are same |
In particular one would normally first do the d1 transform in say low d1 bits and then swap upper d2 bits into lower part of d digit label -- now do d2 digit transform. This way one again only gets communication in swap and otherwise one has perfectly one dimensional parallel FFT's inside each processor |