Each processor 1. reads in data in rows from the file which is assigned to it. 2. does 1D FFTs at each rows. 3. sends them to every other processors to do transpose. 4. receives modules from other processors. 5. does 1D FFTs at each rows. 6. write the result back to file.To see more implementation issues, please refer to this link.