Assignment 4

This assignment asks you both to implement a parallel 3D fast fourier transform and to model your implementation.

Again, you'll be working in groups of two to three people. This assignment is due on (or by) Friday, March 21st (the last day before spring break). The TAs will then test and time the codes and the results of the contest will be announced after spring break.


Getting Started

A list of references which should be enough to get you started is provided here . The references listed range from extremely basic introductions to the fourier and fast fourier transforms to publically available codes for the FFT to recent papers on efficient parallelization of the FFT.


Part I: Observing the Fourier Transform in Action!

Check out a Java demo of a 1D Fourier Transform located here.

A.The inverse Fourier transform performs the frequency to time domain transformation. The Discrete Inverse FT is computed by (where h_k is the kth value in the time domain):

h_k = 1/N * SUM(H_n* e^(-2*pi*i*k*n/N)) where n varies from 0 to N-1

Go to the 1D Fourier Transform Java Demo. Clear all points to zero, and then place a single spike in the frequency domain (F(K) real plane at the frequency 1 (the second point, since the first point is the 0 frequency). Notice that the inverse of this single spike is a cos wave in the real plane and a negative sin wave in the imaginary plane. Given the formula above, return the symbolic answer for the real and imaginary time-domain functions as a function of k. (HINT: H_n is nonzero only when n == 1.)  Note that e^(i*x) = cos(x) + i*sin(x). Assume that the amplitude of the spike is A.

B. What is the DFT of a constant function, e.g. y = 4 + 0*i , and Why? (You can use the Java demo as a guide.)

C. The formula for the 1D DFT is:

H_n = SUM(h_k*e^(2*pi*i*k*n/N)) where k varies from 0 to N-1. Using the Java Demo, zero everything and then manually draw a cosine wave with amplitude A in the real plane of the time domain.

Notice that this is transformed into two spikes. Why is this transformed into two spikes? What frequency does each spike represent? Are they the same or different frequencies? Given that the amplitude of the cosine wave you drew was A, what is the amplitude of each spike? Show your math. (HINT: It might be easier for you to think in terms of the inverse DFT, e.g. given these two spikes, what is the inverse transform in each of the imaginary and real planes?) Note that h_k in this example is simply (A*cos(2*pi*k/N)).


Part II: Observing network behavior

Check out the LogP paper located on the links page.

In order to estimate the running time of your 3D FFT Algorithm, you will have to create a model of the underlying network. One such model is the LogP model, which models the overhead, latency, and gap of short messages. There is an extension to LogP, called LogGP which models not only the cost for short messages but also the cost for large (bulk) messages as well.

After you have picked the language you are going to use for the 3D FFT implementation, start out by measuring the latency and bandwidth for the communication primitives in your language.

For Split-C,

For MPI

For HPF

For Sun Threads

This should give you a feel for the costs of the different communication primitives in your language which will help you:


Part III: Parallel Fast Fourier Transform

The basic rules for the contest are as follows:

Specification:

  • Interface:
  •         void fft3d(int n, 
                       double (*gen_re)(int n, int i, int j, int k), 
                       double (*gen_im)(int n, int i, int j, int k), 
                       char *outfile);
            
  • Input:
  • n: # of points
  • gen_re: pointer to a function which takes n, i, j, and k and returns the real part of the initial value of A[i*n^2 + j*n + k]
  • gen_im: pointer to a function which takes n, i, j, and k and returns the imaginary part of the initial value of A[i*n^2 + j*n + k]
  • outfile: name of an output file into which your code will output the 3D discrete fourier transform of the input, in normal order.
  • Assumptions:
  • the input is complex and the output will be complex.
  • the output should be in the following form:
  • The first line contains a single integer. This is n.
  • It's followed by 2*n^3 numbers, one pair of numbers per line. Line (n^2*i+n*j+k) contains the pair of numbers:
    Re(A(i,j,k)) Im(A(i,j,k))
  • n is a power of two
  • the number of processors will be a power of two.
  • Contest:

  • Your code will probably consist of the following basic steps:
  • read in (or calculate) the input data
  • calculate the roots of unity that you'll need and store them
  • perform the fourier transform (returning the data in whatever way is most convenient)
  • transform the data as needed so that it's back to the original order of the input
  • write out the data
  • Timings:
  • only the third step (performing the fourier transform) will be timed. Note that this means you're free to return the output in bit-reversed order for timing purposes, although for testing purposes we'll require you to write the output to a file in normal order.
  • The contest will be run on 64=4^3 processors
  • Language: You may write the code in the parallel language of your choice.
  • Turn in a description of the different parallel algorithms/implementations you considered as well as an explanation of why you chose the algorithm/implementation you did. Describe anything special your code does (for example, does it work on 3D blocks that aren't cubes?).

    You should hand in a pointer to a directory that contains your code, or alternatively a pointer to a web page that contains links to your code (and your Makefile(s)).

    We'd like the following format:

    To build your code for timing only (no output is needed): make fast

    In other words, your makefile should have a dependency for 'fast' which will end up building your code without any output.

    To build your code for correctness: make

    NOTE: The code should be the same. IOW - don't use two separate algorithms for performance and correctness. We just need an easy way to compile your code when we are testing for performance vs. correctness.

    Your program should output the FFT data and/or timing information to stdout or a file. For example, your program should be able to run either via: a.out > fft.output or a.out fft.output

    It's your choice - you don't have to support both options. Just document your code using a README file in the code directory telling us which method you've chosen, and any other environment variables that you need set, etc.


    PART IV: Performance Model

    Using your performance assessment of the communication primitives, and a model of the speed of the processor, develop a model of the run time of your 3D FFT given N. Plot the estimated running time and the real running time for two different input sizes. What caused your model to behave incorrectly?


    Testing

  • A serial 3D discrete fourier transform (not a fast fourier transform, so it does the 1D dft in O(n^2) time) written in C is provided for you to test your codes against. Obviously the code provided is slow, and will not be able to solve very large problems. The code is here .
  • A tool for visualizing your solutionwill also be provided. Check the newsgroup for a pointer.