Full HTML for

Basic foilset Parallel FFT and use in PDE Solvers

Given by Geoffrey C. Fox at Computational Science Class CPS615 on Winter Semester 2000. Foils prepared April 22 2000
Outside Index Summary of Material


We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation
The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes
We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT.
For binary case, we go through parallelism and use of cache giving a performance analysis
These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT

Table of Contents for full HTML of Parallel FFT and use in PDE Solvers

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 Parallel FFT and Use in PDE Solvers
2 Abstract of CPS615 FFT Lectures
3 References for Fast Fourier Transform
4 Solving Poisson's Equation with the FFT I
5 Solving Poisson's Equation with the FFT II
6 Discrete Fourier Transform
7 Inverse and other FFT Transforms
8 The 1D FFT in explicit detail I
9 The 1D FFT in explicit detail II
10 DIF Manipulation of the exponential I
11 DIF Manipulation of the exponential II
12 Recursive Formula for DIF
13 The Computational Complexity of DIF FFT
14 DIT Manipulation of the exponential I
15 DIT Manipulation of the exponential II
16 Recursive Formula for DIT
17 Recursive Structure for DIT
18 What's Going On with DIT/DIF I
19 What's Going On with DIT/DIF II
20 What's Going On with DIT/DIF III
21 Butterfly Pattern in 1D DIT FFT
22 Basic Parallel FFT Algorithms I
23 Basic Parallel FFT Algorithms II
24 Parallelism in 1D FFT
25 Performance of Simplest Parallel DIT FFT I
26 Performance of Simplest Parallel DIT FFT II
27 Performance of Simplest Parallel DIT FFT III
28 Performance of Simplest Parallel DIT FFT IV
29 Performance of Simplest Parallel DIT FFT V
30 What's Going On with DIT/DIF IV
31 Sequential and Parallel Performance I
32 Sequential and Parallel Performance II
33 Sequential and Parallel Performance III
34 Sequential and Parallel Performance IV
35 Sequential and Parallel Performance V
36 Parallel FFT and Hypercubes I
37 Parallel FFT and Hypercubes II
38 Multi Dimensional FFT's

Outside Index Summary of Material



HTML version of Basic Foils prepared April 22 2000

Foil 1 Parallel FFT and Use in PDE Solvers

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
CPS615 Computational Science
Spring Semester 2000
Geoffrey Fox
Northeast Parallel Architectures Center
Syracuse University
111 College Place
Syracuse NY
gcf@npac.syr.edu
gcf@cs.fsu.edu

HTML version of Basic Foils prepared April 22 2000

Foil 2 Abstract of CPS615 FFT Lectures

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation
The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes
We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT.
For binary case, we go through parallelism and use of cache giving a performance analysis
These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT

HTML version of Basic Foils prepared April 22 2000

Foil 3 References for Fast Fourier Transform

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Lecture 24 Demmel 1999 CS267 UC Berkeley
FFTW (Fastest Fourier Transform in the West) at http://www.fftw.org
  • Seems to be best (sequential) FFT routine and has a lot of links to other sequential and parallel one and multidimensional FFT's
http://www.eptools.com by Engineering Productivity Tools Inc. has a good tutorial on the FFT
  • Aimed at Signal Processing but basic description generally useful
Chapter 11 of "Solving Problems on Concurrent Processors, Volume 1", Prentice Hall 1988. (Chapter written by John Salmon)
  • Fox, Johnson, Lyzenga, Otto, Salmon, Walker

HTML version of Basic Foils prepared April 22 2000

Foil 4 Solving Poisson's Equation with the FFT I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Express any 2D function defined in 0 ? x,y ? 1 as a series ?(x,y) = Sj Sk ?jk sin(p jx) sin(p ky)
  • Here ?jk are called Fourier coefficient of ?(x,y)
The inverse of this is: ?jk = 4 ?(x,y) sin(p jx) sin(p ky)
Poisson's equation ?2 ? /? x2 + ? 2 ? /? y2 = f(x,y) becomes
  • Sj Sk (-p2j2 - p2k2) ?jk sin(p jx) sin(p ky)
  • = Sj Sk fjk sin(p jx) sin(p ky)
  • where fjk are Fourier coefficients of f(x,y)
  • and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky)
This implies PDE can be solved exactly algebraically, ?jk = fjk / (-p2j2 - p2k2)

HTML version of Basic Foils prepared April 22 2000

Foil 5 Solving Poisson's Equation with the FFT II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
So solution of Poisson's equation involves the following steps
1) Find the Fourier coefficients fjk of f(x,y) by performing integral
2) Form the Fourier coefficients of ? by ?jk = fjk / (-p2j2 - p2k2)
3) Construct the solution by performing sum ?(x,y)
There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral
  • Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx)
  • Let us first consider 1D discrete version of this case
  • PDE case normally deals with discretized functions as these needed for other parts of problem

HTML version of Basic Foils prepared April 22 2000

Foil 6 Discrete Fourier Transform

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Let f be in 1D a typical function defined on a grid and labeled with index m= 0 ... N-1
Let i = sqrt(-1) and index matrices and vectors from 0.
The Discrete Fourier Transform G(f) of an N-element vector f is another vector of length N given by Matrix Vector Multiplication ? f
Where ? is the N*N matrix with matrix elements:
?km = v (-k*m)
and v is:
v = e (2pi/N) = cos(2p/N) + i*sin(2p/N)
This is a complex number with whose Nth power is 1 and is therefore called the Nth root of unity
E.g., for N = 4:
v = 0+1*i, v2 = -1+0*i, v3 = 0-1*i, v4 = 1+0*i,

HTML version of Basic Foils prepared April 22 2000

Foil 7 Inverse and other FFT Transforms

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
f can be reconstructed from its discrete Fourier transformation G(f) by
f = ?*. G(f)/N where * denotes complex conjugation
  • ?* is an N by N matrix and G(f) is a vector of length N
Most applications require both calculating FFT's and reconstructing functions from their FFT's
  • However these are essentially the same algorithm as seen above and so we only need to illustrate one case
  • Issues with parallelism and optimal performance are identical
For solving the Poisson equation and various other applications, we use variations on the FFT
  • The sin transform – use imaginary part of G
  • The cos transform – use real part of G

HTML version of Basic Foils prepared April 22 2000

Foil 8 The 1D FFT in explicit detail I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
The basic discrete FFT is:
This in a naïve analysis takes O(N2) complex floating point additions and multiplications. However we can group terms together to reduce this to a time O(NlogN).
This can be done in many ways and starts with a factorization N=N1N2 (product of integers) and a recursive iteration of factorization to values N1, N2 where special algorithms exist.
  • The power of this general approach is made clear at FFTW web site
  • eptools web site describes case N1=2, 4 and a general prime

HTML version of Basic Foils prepared April 22 2000

Foil 9 The 1D FFT in explicit detail II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
There are two special cases of particular importance
Decimation in frequency (DIF) N1=2 N2=N/2 where frequency space is GN( k,f )
Decimation in time (DIT) N1=N/2 N2=2 where time domain is f(m)
These are used in cases where N=2d is a power of 2 and decimations can be applied recursively d times.
You can apply N=2d to a general value of N by taking say N=773 and adding 251 zeros to make it equal to 1024.
This can be wasteful especially in multi dimensional transforms where you waste space in x y and z directions
The binary FFT illustrates issues and has much simpler algebra -- FFTW uses a special purpose compiler to cope with the general case

HTML version of Basic Foils prepared April 22 2000

Foil 10 DIF Manipulation of the exponential I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Consider the exponential term and divide indices k and m into two terms
Then term is just what we need for an FFT of size N/2

HTML version of Basic Foils prepared April 22 2000

Foil 11 DIF Manipulation of the exponential II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We now will see that factor S with give us a sum of two such FFT's with particular algebraic forms for the functions to be Fourier transformed
Now is +1 when =0
  • and -1 when =1
Further

HTML version of Basic Foils prepared April 22 2000

Foil 12 Recursive Formula for DIF

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Define two functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f
fE fO are gotten by summing and multiplying as instructed by function S on previous foil

HTML version of Basic Foils prepared April 22 2000

Foil 13 The Computational Complexity of DIF FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We can now apply the same idea recursively; chop off the lowest binary digit of k so as to derive a formula for each FFT in terms of two more half the size.
Let C(N) be computational complexity of the FFT of size N
Let Tbutterfly = 2T+ + T* be the time to calculate fE and fO for one value of m-
Here T+ is time for a complex addition and T* the time for a complex multiplication.
We assume that complex number TN(m-) has been precalculated and is available by table lookup
Then C(N) = 2C(N/2) + (N/2) Tbutterfly
This can be solved as C(N) = NlogNTbutterfly/2

HTML version of Basic Foils prepared April 22 2000

Foil 14 DIT Manipulation of the exponential I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Consider the exponential term again and divide indices k and m into two terms in a different way
Then again term is just what we need for an FFT of size N- =N/2

HTML version of Basic Foils prepared April 22 2000

Foil 15 DIT Manipulation of the exponential II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We now will see that factor S with give us a sum of two different (from DIF) N/2 length FFT's
Now is +1 when =0
  • and -1 when =1
Further

HTML version of Basic Foils prepared April 22 2000

Foil 16 Recursive Formula for DIT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Define two (different from DIF) functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f
fE fO are gotten by summing and multiplying as instructed by function S on previous foil

HTML version of Basic Foils prepared April 22 2000

Foil 17 Recursive Structure for DIT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
The recursive structure of the binary FFT algorithm is a complete binary tree of logN levels shown below for N=16
At each phase, one relates the FFT of M numbers to FFT of even and odd subsets fE fO of this set -- these are of size M/2
Diagram shows binary digit (XXXX) of relevant elements
e.g. XXX1 are all odd indices m; XX11 is set 4m-+3 where m- lies between 0 and 3

HTML version of Basic Foils prepared April 22 2000

Foil 18 What's Going On with DIT/DIF I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
DIT and DIF have identical computational complexity analyses at the superficial level we did so far
  • We will see some important features when we analyze where the data is stored and issues of good cache use and parallelism
First we will discuss a way of thinking about binary FFT's which will make the future discussion much clearer
Critical idea is to represent indices by their binary representation shown below as d digit words for k and m:
e.g.
k as a binary word
m as a binary word

HTML version of Basic Foils prepared April 22 2000

Foil 19 What's Going On with DIT/DIF II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Here of course vector size N = 2d
For DIT, the first step, manipulates (in "butterfly fashion"), the lowest component m0 for time and creates the highest binary digit kd-1 for frequency domain.
Note that most FFT implementation overwrite f[m] by its FFT GN(k,f) as they go along
This implies that we start with f(m) where m labeled like
And end with that start of a recursive FFT labeled like:
Note characteristic bit reversal in storage of GN(k,f)
"Wrong End"

HTML version of Basic Foils prepared April 22 2000

Foil 20 What's Going On with DIT/DIF III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We iterate this idea with phases p running from 0 to d-1. At the p'th phase, one is manipulating the p'th digit of f(m) and forming the d-1-p'th index of GN(k,f).
Thus starting with f(m) with binary labeling:
One ends up with GN(k,f) with totally reversed index scheme
Note that in recursive mode, one starts forming recursion with phase p=0 but first computations are done at phase p=d-1 where one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0

HTML version of Basic Foils prepared April 22 2000

Foil 21 Butterfly Pattern in 1D DIT FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Data dependencies in 1D FFT show a characteristic butterfly structure
We show 4 processing phases p
At phase p for DIT, one manipulates p'th digit of f(m)
Index m
with
binary
labels
Phase 3 2 1 0

HTML version of Basic Foils prepared April 22 2000

Foil 22 Basic Parallel FFT Algorithms I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT's at phase d-1 and ending with combining two N/2 point FFT's at phase 0.
  • However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later
We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.
Let Nproc = 2P be the number of processors in the parallel computer
Consider the d digits in binary representation of m
  • The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P
P digits labeling 2P processors
d-P digits labeling position
in processor

HTML version of Basic Foils prepared April 22 2000

Foil 23 Basic Parallel FFT Algorithms II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Remembering that first computations are performed at phase d-1, we see that one needs communication for the first few (P phases) as dependency lines cross boundaries. For the final phases all the information needed for FFT component is stored within processor.
So communication needed for phases d-1 through d-P
  • These are phases labeled by the digits used to label Processor and not position in the processor
No communication at all for phases d-P-1 through 0
Note end result -- the FFT GN(k,f) -- is bit reversed: we will discuss later the communication cost of undoing bit reversal but
  • In many cases, bit reversal is unimportant and FFT does not need to be reversed.
  • For instance in solving Poisson's equation, FFT is followed by inverse FFT and in final result, there is no bit reversal

HTML version of Basic Foils prepared April 22 2000

Foil 24 Parallelism in 1D FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Binary
Representation
of Index m
Consider 4 Processors and natural block decomposition
note phases 2 and 3 have communication as dependency lines cross processor boundaries
1
2
3
Processor
Boundary
Processor
Boundary
Processor
Boundary
Processor
Number

HTML version of Basic Foils prepared April 22 2000

Foil 25 Performance of Simplest Parallel DIT FFT I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
As discussed we have two types of phases
The sequential time Tsequential (phase p) is identical for each phase p at NTbutterfly/2
Tparallel (phases 0? p? d-P-1) = NTbutterfly/(2Nproc) as perfectly parallel (load balanced)
At the remaining stages, one must communicate in each of computations where fE and fO (DIT) are exchanged between two processors. This must be done for every one of the N/Nproc points stored in each processor

HTML version of Basic Foils prepared April 22 2000

Foil 26 Performance of Simplest Parallel DIT FFT II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
This shows how processor holding fE fetches fO and forms in-place (i.e. in location originally holding fE) the FFT component with kd-q-1 = 0 (stored in place mq=0).
Similarily processor holding fO forms the FFT component with kd-q-1 = 1.

HTML version of Basic Foils prepared April 22 2000

Foil 27 Performance of Simplest Parallel DIT FFT III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Tparallel (phases d-P? p< d) = N(Tcalc+Tcomm)/(Nproc) and again this is perfectly parallel (load balanced)
Tcomm is time to swap one complex word between 2 processors (to reduce latency this would be done as a block transfer of N/Nproc words)
Tcalc = T+ + T* as each processor must multiply and add/subtract. Note this implies that one loses parallelism in multiplication as both processors must do it.
Communication overhead fcomm = Tparallel *Nproc/Tsequential -1

HTML version of Basic Foils prepared April 22 2000

Foil 28 Performance of Simplest Parallel DIT FFT IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
One can however for DIT (and DIF) get much better performance if one avoids the "owner computes rule" and changes the processor which calculates a given FFT component.
This can seen by examining a typical step in a phase of parallel FFT where communication is needed in current algorithm. Consider any two processors -- called a and b --- which need to swap data in current algorithm
  • a has vector fa and b has a vector fb -- both of length N/Nproc
  • Currently we swap N/Nproc numbers by sending fb to a and fa to b

HTML version of Basic Foils prepared April 22 2000

Foil 29 Performance of Simplest Parallel DIT FFT V

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Alternatively we could send even members of fa to b and odd indexed entries in fb to a.
  • We would take resultant vector members in processors a and b and combine them in pairs to get FFT components
Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 is now given by
We have avoided load imbalance and halved the communication
compared to simple algorithm. In later foils we will find even better
methods that get rid of log2N term

HTML version of Basic Foils prepared April 22 2000

Foil 30 What's Going On with DIT/DIF IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
For DIF, the phases are of course reversed in the way the step through the binary digits of f(m) and GN(k,f)
One still has d phases labeled by p
At first recursion p=0 one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0
The first DIF computations are done at end of recursion and this is "natural" m=0 and kd-1 index for GN(k,f)
DIF has an identical parallel performance to DIT although some of the steps differ in detail

HTML version of Basic Foils prepared April 22 2000

Foil 31 Sequential and Parallel Performance I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Now to understand performance of general algorithms in parallel or in caches, one must understand storage of the arrays.
Note first that labeling is not necessarily same as storage
if p(l) is any permutation of l=0...d-1, then we can store
This can be used to improve sequential performance by ensuring that when a given digit l is being processed, its storage digit p(l) is in range 0? p(l) < L, where the cache can hold 2L entries

HTML version of Basic Foils prepared April 22 2000

Foil 32 Sequential and Parallel Performance II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
This implies a reordering of computations, so that L digits are done at a time and one fixes storage indices d-1, d-2, d-3, .. L and calculates 2L point FFT on lower indices
  • This 2L point FFT can be calculated by iterating 2 point FFT's as in Cooley Tukey or by some custom FFT of this size as in FFTW
  • Clearly this makes excellent use of cache and each data point is used L times (multiplied by number of times used in basic single digit FFT -- 2.5 times on average per 1 floating point word)
So basic operation needed is one of reordering:
  • Move L digits from current positions to another set of L positions
  • Assume these are not overlapping
  • Time taken is 2(d-P).2(complex floats) . (1-2L) with one memory read and one write each
This re-ordering is also algorithm used to undo bit reversal if needed

HTML version of Basic Foils prepared April 22 2000

Foil 33 Sequential and Parallel Performance III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Interesting point about swapping time is that it is independent of L except for the factor (1-2L) which is a factor of two increase from L=1 to largest L
After spending this time, one can then do "in-cache" computations even in an iterative algorithm
Computations are larger by a factor L (roughly log of Cache Size) compared to cost of re-ordering

HTML version of Basic Foils prepared April 22 2000

Foil 34 Sequential and Parallel Performance IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
We can use the same idea to improve the parallel algorithm
Now the top P=logNproc digits label processor number and we transfer these to digits stored inside processor. This involves communication but after this transfer, all computations are internal to a processor
  • Total Communication time is 2(d-P). (1-2L) . Tcomm

HTML version of Basic Foils prepared April 22 2000

Foil 35 Sequential and Parallel Performance V

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
An interesting point about the resultant communication overhead is that one no longer gets the logNproc dependence in the numerator
Remember that in D dimensions, we found fcomm proportional to n-1/D where n is grain size
For FTT fcomm is proportional to 1/log2n which is "standard" formula with D=?

HTML version of Basic Foils prepared April 22 2000

Foil 36 Parallel FFT and Hypercubes I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
In days gone by, We used to use hypercube topology for parallel machines
Such machines have 2P processor nodes labeled
and the communication channels are such that every node x is connected to P other nodes -- connections run to those nodes whose binary representations differs in just one digit. For instance
and
are connected

HTML version of Basic Foils prepared April 22 2000

Foil 37 Parallel FFT and Hypercubes II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
The relevance of this to Cooley Tukey FFT can be seen from basic recursive step which precisely involves linking Fourier coefficients which also just differ by one in a single binary digit
Hypercube had other useful features e.g. the 64 node 6 dimensional hypercube includes meshes (8 by 8, 4 by 4 by 4, 2 by 8 by 4 etc.)
However nowadays one builds general purpose switches (communication systems) and does not focus on particular communication topologies
Hypercube Topology for 8 machines
Seitz and
Fox with
Caltech
64 node
Hypercube
1983

HTML version of Basic Foils prepared April 22 2000

Foil 38 Multi Dimensional FFT's

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *
Full HTML Index
Multi-dimensional FFT's are very common and the most important for parallel computing as they are more computationally intense
  • Images naturally give two dimensional FFT's
Take a 2d2 by 2d1 FFT in two dimensions and label it by a d digit binary number with d=d1+d2
Then all discussions are essentially identically to one dimensional FFT with dimension d
The algorithm is different but communication issues and strategy of re-ordering binary index are same
In particular one would normally first do the d1 transform in say low d1 bits and then swap upper d2 bits into lower part of d digit label -- now do d2 digit transform. This way one again only gets communication in swap and otherwise one has perfectly one dimensional parallel FFT's inside each processor

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Apr 24 2000