Full HTML for Basic Parallel FFT and use in PDE Solvers

Full HTML for

Basic foilset Parallel FFT and use in PDE Solvers

Given by Geoffrey C. Fox at Computational Science Class CPS615 on Winter Semester 2000. Foils prepared April 22 2000
Outside Index Summary of Material

We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation

The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes

We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT.

For binary case, we go through parallelism and use of cache giving a performance analysis

These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT

Table of Contents for full HTML of Parallel FFT and use in PDE Solvers

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

Parallel FFT and Use in PDE Solvers
2

Abstract of CPS615 FFT Lectures
3

References for Fast Fourier Transform
4

Solving Poisson's Equation with the FFT I
5

Solving Poisson's Equation with the FFT II
6

Discrete Fourier Transform
7

Inverse and other FFT Transforms
8

The 1D FFT in explicit detail I
9

The 1D FFT in explicit detail II
10

DIF Manipulation of the exponential I
11

DIF Manipulation of the exponential II
12

Recursive Formula for DIF
13

The Computational Complexity of DIF FFT
14

DIT Manipulation of the exponential I
15

DIT Manipulation of the exponential II
16

Recursive Formula for DIT
17

Recursive Structure for DIT
18

What's Going On with DIT/DIF I
19

What's Going On with DIT/DIF II
20

What's Going On with DIT/DIF III
21

Butterfly Pattern in 1D DIT FFT
22

Basic Parallel FFT Algorithms I
23

Basic Parallel FFT Algorithms II
24

Parallelism in 1D FFT
25

Performance of Simplest Parallel DIT FFT I
26

Performance of Simplest Parallel DIT FFT II
27

Performance of Simplest Parallel DIT FFT III
28

Performance of Simplest Parallel DIT FFT IV
29

Performance of Simplest Parallel DIT FFT V
30

What's Going On with DIT/DIF IV
31

Sequential and Parallel Performance I
32

Sequential and Parallel Performance II
33

Sequential and Parallel Performance III
34

Sequential and Parallel Performance IV
35

Sequential and Parallel Performance V
36

Parallel FFT and Hypercubes I
37

Parallel FFT and Hypercubes II
38

Multi Dimensional FFT's

Outside Index Summary of Material

HTML version of Basic Foils prepared April 22 2000

Foil 1 Parallel FFT and Use in PDE Solvers

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

CPS615 Computational Science

Spring Semester 2000

Geoffrey Fox

Northeast Parallel Architectures Center

Syracuse University

111 College Place

Syracuse NY

gcf@npac.syr.edu

gcf@cs.fsu.edu

HTML version of Basic Foils prepared April 22 2000

Foil 2 Abstract of CPS615 FFT Lectures

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We start by motivating the FFT (Fast Fourier Transform) as a solver for Poisson's equation

The we discuss sequential 1D discrete FFT in both DIF (Decimation in Frequency) and DIT (Decimation in Time) modes

We describe general N=N1*N2 case but soon specialize to binary (Cooley Tukey) FFT.

For binary case, we go through parallelism and use of cache giving a performance analysis

These lectures motivate later lectures on Fast Multipole method as general Green's function solver which is more flexible than FFT

HTML version of Basic Foils prepared April 22 2000

Foil 3 References for Fast Fourier Transform

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Lecture 24 Demmel 1999 CS267 UC Berkeley

http://www.cs.berkeley.edu/~demmel/cs267_Spr99

FFTW (Fastest Fourier Transform in the West) at http://www.fftw.org

Seems to be best (sequential) FFT routine and has a lot of links to other sequential and parallel one and multidimensional FFT's

http://www.eptools.com by Engineering Productivity Tools Inc. has a good tutorial on the FFT

Aimed at Signal Processing but basic description generally useful

Chapter 11 of "Solving Problems on Concurrent Processors, Volume 1", Prentice Hall 1988. (Chapter written by John Salmon)

Fox, Johnson, Lyzenga, Otto, Salmon, Walker

HTML version of Basic Foils prepared April 22 2000

Foil 4 Solving Poisson's Equation with the FFT I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Express any 2D function defined in 0 ? x,y ? 1 as a series ?(x,y) = Sj Sk ?jk sin(p jx) sin(p ky)

Here ?jk are called Fourier coefficient of ?(x,y)

The inverse of this is: ?jk = 4 ?(x,y) sin(p jx) sin(p ky)

Poisson's equation ?2 ? /? x2 + ? 2 ? /? y2 = f(x,y) becomes

Sj Sk (-p2j2 - p2k2) ?jk sin(p jx) sin(p ky)
= Sj Sk fjk sin(p jx) sin(p ky)
where fjk are Fourier coefficients of f(x,y)
and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky)

This implies PDE can be solved exactly algebraically, ?jk = fjk / (-p2j2 - p2k2)

HTML version of Basic Foils prepared April 22 2000

Foil 5 Solving Poisson's Equation with the FFT II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

So solution of Poisson's equation involves the following steps

1) Find the Fourier coefficients fjk of f(x,y) by performing integral

2) Form the Fourier coefficients of ? by ?jk = fjk / (-p2j2 - p2k2)

3) Construct the solution by performing sum ?(x,y)

There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral

Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx)
Let us first consider 1D discrete version of this case
PDE case normally deals with discretized functions as these needed for other parts of problem

HTML version of Basic Foils prepared April 22 2000

Foil 6 Discrete Fourier Transform

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Let f be in 1D a typical function defined on a grid and labeled with index m= 0 ... N-1

Let i = sqrt(-1) and index matrices and vectors from 0.

The Discrete Fourier Transform G(f) of an N-element vector f is another vector of length N given by Matrix Vector Multiplication ? f

Where ? is the N*N matrix with matrix elements:

?km = v (-k*m)

and v is:

v = e (2pi/N) = cos(2p/N) + i*sin(2p/N)

This is a complex number with whose Nth power is 1 and is therefore called the Nth root of unity

E.g., for N = 4:

v = 0+1*i, v2 = -1+0*i, v3 = 0-1*i, v4 = 1+0*i,

HTML version of Basic Foils prepared April 22 2000

Foil 7 Inverse and other FFT Transforms

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

f can be reconstructed from its discrete Fourier transformation G(f) by

f = ?*. G(f)/N where * denotes complex conjugation

?* is an N by N matrix and G(f) is a vector of length N

Most applications require both calculating FFT's and reconstructing functions from their FFT's

However these are essentially the same algorithm as seen above and so we only need to illustrate one case
Issues with parallelism and optimal performance are identical

For solving the Poisson equation and various other applications, we use variations on the FFT

The sin transform � use imaginary part of G
The cos transform � use real part of G

HTML version of Basic Foils prepared April 22 2000

Foil 8 The 1D FFT in explicit detail I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

The basic discrete FFT is:

This in a na�ve analysis takes O(N2) complex floating point additions and multiplications. However we can group terms together to reduce this to a time O(NlogN).

This can be done in many ways and starts with a factorization N=N1N2 (product of integers) and a recursive iteration of factorization to values N1, N2 where special algorithms exist.

The power of this general approach is made clear at FFTW web site
eptools web site describes case N1=2, 4 and a general prime

HTML version of Basic Foils prepared April 22 2000

Foil 9 The 1D FFT in explicit detail II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

There are two special cases of particular importance

Decimation in frequency (DIF) N1=2 N2=N/2 where frequency space is GN( k,f )

Decimation in time (DIT) N1=N/2 N2=2 where time domain is f(m)

These are used in cases where N=2d is a power of 2 and decimations can be applied recursively d times.

You can apply N=2d to a general value of N by taking say N=773 and adding 251 zeros to make it equal to 1024.

This can be wasteful especially in multi dimensional transforms where you waste space in x y and z directions

The binary FFT illustrates issues and has much simpler algebra -- FFTW uses a special purpose compiler to cope with the general case

HTML version of Basic Foils prepared April 22 2000

Foil 10 DIF Manipulation of the exponential I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Consider the exponential term and divide indices k and m into two terms

Then term is just what we need for an FFT of size N/2

HTML version of Basic Foils prepared April 22 2000

Foil 11 DIF Manipulation of the exponential II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We now will see that factor S with give us a sum of two such FFT's with particular algebraic forms for the functions to be Fourier transformed

Now is +1 when =0

and -1 when =1

Further

HTML version of Basic Foils prepared April 22 2000

Foil 12 Recursive Formula for DIF

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Define two functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f

fE fO are gotten by summing and multiplying as instructed by function S on previous foil

HTML version of Basic Foils prepared April 22 2000

Foil 13 The Computational Complexity of DIF FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We can now apply the same idea recursively; chop off the lowest binary digit of k so as to derive a formula for each FFT in terms of two more half the size.

Let C(N) be computational complexity of the FFT of size N

Let Tbutterfly = 2T+ + T* be the time to calculate fE and fO for one value of m-

Here T+ is time for a complex addition and T* the time for a complex multiplication.

We assume that complex number TN(m-) has been precalculated and is available by table lookup

Then C(N) = 2C(N/2) + (N/2) Tbutterfly

This can be solved as C(N) = NlogNTbutterfly/2

HTML version of Basic Foils prepared April 22 2000

Foil 14 DIT Manipulation of the exponential I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Consider the exponential term again and divide indices k and m into two terms in a different way

Then again term is just what we need for an FFT of size N- =N/2

HTML version of Basic Foils prepared April 22 2000

Foil 15 DIT Manipulation of the exponential II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We now will see that factor S with give us a sum of two different (from DIF) N/2 length FFT's

Now is +1 when =0

and -1 when =1

Further

HTML version of Basic Foils prepared April 22 2000

Foil 16 Recursive Formula for DIT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Define two (different from DIF) functions fE fO of half the size of f ( i.e.vectors of length N/2) in terms of whose FFT we will express the FFT of f

fE fO are gotten by summing and multiplying as instructed by function S on previous foil

HTML version of Basic Foils prepared April 22 2000

Foil 17 Recursive Structure for DIT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

The recursive structure of the binary FFT algorithm is a complete binary tree of logN levels shown below for N=16

At each phase, one relates the FFT of M numbers to FFT of even and odd subsets fE fO of this set -- these are of size M/2

Diagram shows binary digit (XXXX) of relevant elements

e.g. XXX1 are all odd indices m; XX11 is set 4m-+3 where m- lies between 0 and 3

HTML version of Basic Foils prepared April 22 2000

Foil 18 What's Going On with DIT/DIF I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

DIT and DIF have identical computational complexity analyses at the superficial level we did so far

We will see some important features when we analyze where the data is stored and issues of good cache use and parallelism

First we will discuss a way of thinking about binary FFT's which will make the future discussion much clearer

Critical idea is to represent indices by their binary representation shown below as d digit words for k and m:

e.g.

k as a binary word

m as a binary word

HTML version of Basic Foils prepared April 22 2000

Foil 19 What's Going On with DIT/DIF II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Here of course vector size N = 2d

For DIT, the first step, manipulates (in "butterfly fashion"), the lowest component m0 for time and creates the highest binary digit kd-1 for frequency domain.

Note that most FFT implementation overwrite f[m] by its FFT GN(k,f) as they go along

This implies that we start with f(m) where m labeled like

And end with that start of a recursive FFT labeled like:

Note characteristic bit reversal in storage of GN(k,f)

"Wrong End"

HTML version of Basic Foils prepared April 22 2000

Foil 20 What's Going On with DIT/DIF III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We iterate this idea with phases p running from 0 to d-1. At the p'th phase, one is manipulating the p'th digit of f(m) and forming the d-1-p'th index of GN(k,f).

Thus starting with f(m) with binary labeling:

One ends up with GN(k,f) with totally reversed index scheme

Note that in recursive mode, one starts forming recursion with phase p=0 but first computations are done at phase p=d-1 where one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0

HTML version of Basic Foils prepared April 22 2000

Foil 21 Butterfly Pattern in 1D DIT FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Data dependencies in 1D FFT show a characteristic butterfly structure

We show 4 processing phases p

At phase p for DIT, one manipulates p'th digit of f(m)

Index m

with

binary

labels

Phase 3 2 1 0

HTML version of Basic Foils prepared April 22 2000

Foil 22 Basic Parallel FFT Algorithms I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Traditionally algorithms are iterative, going across each phase in the tree starting with all the N/2 2 point FFT's at phase d-1 and ending with combining two N/2 point FFT's at phase 0.

However FFTW papers at their web site points out that most such implementations make poor use of cache; recursive version can be better in cache use and we will describe this in general later

We can decompose in many ways but here we take the simplest case of block decomposition illustrated for N=16 on next foil.

Let Nproc = 2P be the number of processors in the parallel computer

Consider the d digits in binary representation of m

The bottom d-P digits md-P-1 through m0 representation location inside processor labeled by top P digits md-1 through md-P

P digits labeling 2P processors

d-P digits labeling position

in processor

HTML version of Basic Foils prepared April 22 2000

Foil 23 Basic Parallel FFT Algorithms II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Remembering that first computations are performed at phase d-1, we see that one needs communication for the first few (P phases) as dependency lines cross boundaries. For the final phases all the information needed for FFT component is stored within processor.

So communication needed for phases d-1 through d-P

These are phases labeled by the digits used to label Processor and not position in the processor

No communication at all for phases d-P-1 through 0

Note end result -- the FFT GN(k,f) -- is bit reversed: we will discuss later the communication cost of undoing bit reversal but

In many cases, bit reversal is unimportant and FFT does not need to be reversed.
For instance in solving Poisson's equation, FFT is followed by inverse FFT and in final result, there is no bit reversal

HTML version of Basic Foils prepared April 22 2000

Foil 24 Parallelism in 1D FFT

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Binary

Representation

of Index m

Consider 4 Processors and natural block decomposition

note phases 2 and 3 have communication as dependency lines cross processor boundaries

Processor

Boundary

Processor

Boundary

Processor

Boundary

Processor

Number

HTML version of Basic Foils prepared April 22 2000

Foil 25 Performance of Simplest Parallel DIT FFT I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

As discussed we have two types of phases

The sequential time Tsequential (phase p) is identical for each phase p at NTbutterfly/2

Tparallel (phases 0? p? d-P-1) = NTbutterfly/(2Nproc) as perfectly parallel (load balanced)

At the remaining stages, one must communicate in each of computations where fE and fO (DIT) are exchanged between two processors. This must be done for every one of the N/Nproc points stored in each processor

HTML version of Basic Foils prepared April 22 2000

Foil 26 Performance of Simplest Parallel DIT FFT II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

This shows how processor holding fE fetches fO and forms in-place (i.e. in location originally holding fE) the FFT component with kd-q-1 = 0 (stored in place mq=0).

Similarily processor holding fO forms the FFT component with kd-q-1 = 1.

HTML version of Basic Foils prepared April 22 2000

Foil 27 Performance of Simplest Parallel DIT FFT III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Tparallel (phases d-P? p< d) = N(Tcalc+Tcomm)/(Nproc) and again this is perfectly parallel (load balanced)

Tcomm is time to swap one complex word between 2 processors (to reduce latency this would be done as a block transfer of N/Nproc words)

Tcalc = T+ + T* as each processor must multiply and add/subtract. Note this implies that one loses parallelism in multiplication as both processors must do it.

Communication overhead fcomm = Tparallel *Nproc/Tsequential -1

HTML version of Basic Foils prepared April 22 2000

Foil 28 Performance of Simplest Parallel DIT FFT IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

One can however for DIT (and DIF) get much better performance if one avoids the "owner computes rule" and changes the processor which calculates a given FFT component.

This can seen by examining a typical step in a phase of parallel FFT where communication is needed in current algorithm. Consider any two processors -- called a and b --- which need to swap data in current algorithm

a has vector fa and b has a vector fb -- both of length N/Nproc
Currently we swap N/Nproc numbers by sending fb to a and fa to b

HTML version of Basic Foils prepared April 22 2000

Foil 29 Performance of Simplest Parallel DIT FFT V

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Alternatively we could send even members of fa to b and odd indexed entries in fb to a.

We would take resultant vector members in processors a and b and combine them in pairs to get FFT components

Communication overhead fcomm = Tparallel *Nproc/Tsequential -1 is now given by

We have avoided load imbalance and halved the communication

compared to simple algorithm. In later foils we will find even better

methods that get rid of log2N term

HTML version of Basic Foils prepared April 22 2000

Foil 30 What's Going On with DIT/DIF IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

For DIF, the phases are of course reversed in the way the step through the binary digits of f(m) and GN(k,f)

One still has d phases labeled by p

At first recursion p=0 one is manipulating f(m) with highest binary index md-1 and forming GN(k,f) with binary index k0

The first DIF computations are done at end of recursion and this is "natural" m=0 and kd-1 index for GN(k,f)

DIF has an identical parallel performance to DIT although some of the steps differ in detail

HTML version of Basic Foils prepared April 22 2000

Foil 31 Sequential and Parallel Performance I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Now to understand performance of general algorithms in parallel or in caches, one must understand storage of the arrays.

Note first that labeling is not necessarily same as storage

if p(l) is any permutation of l=0...d-1, then we can store

This can be used to improve sequential performance by ensuring that when a given digit l is being processed, its storage digit p(l) is in range 0? p(l) < L, where the cache can hold 2L entries

HTML version of Basic Foils prepared April 22 2000

Foil 32 Sequential and Parallel Performance II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

This implies a reordering of computations, so that L digits are done at a time and one fixes storage indices d-1, d-2, d-3, .. L and calculates 2L point FFT on lower indices

This 2L point FFT can be calculated by iterating 2 point FFT's as in Cooley Tukey or by some custom FFT of this size as in FFTW
Clearly this makes excellent use of cache and each data point is used L times (multiplied by number of times used in basic single digit FFT -- 2.5 times on average per 1 floating point word)

So basic operation needed is one of reordering:

Move L digits from current positions to another set of L positions
Assume these are not overlapping
Time taken is 2(d-P).2(complex floats) . (1-2L) with one memory read and one write each

This re-ordering is also algorithm used to undo bit reversal if needed

HTML version of Basic Foils prepared April 22 2000

Foil 33 Sequential and Parallel Performance III

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Interesting point about swapping time is that it is independent of L except for the factor (1-2L) which is a factor of two increase from L=1 to largest L

After spending this time, one can then do "in-cache" computations even in an iterative algorithm

Computations are larger by a factor L (roughly log of Cache Size) compared to cost of re-ordering

HTML version of Basic Foils prepared April 22 2000

Foil 34 Sequential and Parallel Performance IV

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

We can use the same idea to improve the parallel algorithm

Now the top P=logNproc digits label processor number and we transfer these to digits stored inside processor. This involves communication but after this transfer, all computations are internal to a processor

Total Communication time is 2(d-P). (1-2L) . Tcomm

HTML version of Basic Foils prepared April 22 2000

Foil 35 Sequential and Parallel Performance V

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

An interesting point about the resultant communication overhead is that one no longer gets the logNproc dependence in the numerator

Remember that in D dimensions, we found fcomm proportional to n-1/D where n is grain size

For FTT fcomm is proportional to 1/log2n which is "standard" formula with D=?

HTML version of Basic Foils prepared April 22 2000

Foil 36 Parallel FFT and Hypercubes I

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

In days gone by, We used to use hypercube topology for parallel machines

Such machines have 2P processor nodes labeled

and the communication channels are such that every node x is connected to P other nodes -- connections run to those nodes whose binary representations differs in just one digit. For instance

and

are connected

HTML version of Basic Foils prepared April 22 2000

Foil 37 Parallel FFT and Hypercubes II

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

The relevance of this to Cooley Tukey FFT can be seen from basic recursive step which precisely involves linking Fourier coefficients which also just differ by one in a single binary digit

Hypercube had other useful features e.g. the 64 node 6 dimensional hypercube includes meshes (8 by 8, 4 by 4 by 4, 2 by 8 by 4 etc.)

However nowadays one builds general purpose switches (communication systems) and does not focus on particular communication topologies

Hypercube Topology for 8 machines

Seitz and

Fox with

Caltech

64 node

Hypercube

1983

HTML version of Basic Foils prepared April 22 2000

Foil 38 Multi Dimensional FFT's

From Parallel FFT and use in PDE Solvers Computational Science Class CPS615 -- Winter Semester 2000. *

Full HTML Index

Multi-dimensional FFT's are very common and the most important for parallel computing as they are more computationally intense

Images naturally give two dimensional FFT's

Take a 2d2 by 2d1 FFT in two dimensions and label it by a d digit binary number with d=d1+d2

Then all discussions are essentially identically to one dimensional FFT with dimension d

The algorithm is different but communication issues and strategy of re-ordering binary index are same

In particular one would normally first do the d1 transform in say low d1 bits and then swap upper d2 bits into lower part of d digit label -- now do d2 digit transform. This way one again only gets communication in swap and otherwise one has perfectly one dimensional parallel FFT's inside each processor

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Mon Apr 24 2000