Full HTML for

Basic foilset Further PDE Solvers for the NAS Benchmarks

Given by Geoffrey C. Fox at CPSP713 Case studies in Computational Science on Spring Semester 1996. Foils prepared 15 March 1996
Outside Index Summary of Material


This is third of three foilsets on CFD and NAS Benchmarks
This completes analysis of parallel ADI from first BT application benchmark
Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization
Details of SSOR and its parallelization with different decompositions
Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra
Brief remarks on other more sophisticated modern solvers
  • ILU and Incomplete Cholesky
  • Domain Decomposition
  • Multigrid

Table of Contents for full HTML of Further PDE Solvers for the NAS Benchmarks

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 CPS713 Case Study (II) CFD and Numerical Relativity
Benchmarks -- Part III
Further PDE Solvers

2 Abstract of CPS713NAS- Part III --NAS Benchmarks -- Further PDE Solvers
3 General Analysis of Parallel ADI Performance -- the Pipeline start-up
4 General Analysis of Parallel ADI Performance -- Dependence of Computation and Communication
5 SP -- The Second NAS Benchmark -- Diagonalized ADI
6 LU or SSOR -- The Third NAS Benchmark -- Roughly Gauss Seidel
7 Symmetric Successive Overrelaxation Structure
8 The Relaxation in Symmetric Successive Overrelaxation
9 Parallelization of LU NAS Benchmark and Gauss Seidel Iteration
10 Ideas behind Wavefront or Hyperplane Parallelization of LU NAS Benchmark
11 Basic Formulae for Hyperplane Parallelization of LU NAS Benchmark
12 Cyclic -- Decomposition I for Hyperplane Method for LU NAS Benchmark
13 Block Cyclic -- Decomposition II for Hyperplane Method for LU NAS Benchmark
14 Block -- Decomposition III for Hyperplane Method for LU NAS Benchmark
15 ill Chosen Scattered -- Decomposition IV for Hyperplane Method for LU NAS Benchmark
16 Comments on Three NAS Benchmarks
17 Overview of Iterative Solvers for Partial Differential Equations
18 Iterative Methods for Solving Partial Differential Equations
19 SLOR or Successive Line Over Relaxation Exemplified for Laplace's Equation
20 Solution of SLOR -- Successive Line Over Relaxation
21 Red-Black Point Iteration Schemes
22 Red-Black Line or Zebra Schemes
23 Preconditioners and Other Partial Differential Equation Solution Schemes
24 Incomplete Cholesky Factorization or ILU -- Incomplete LU Decomposition
25 Physical Picture of Domain Decomposition
26 Mathematical Formulation of Domain Decomposition Preconditioning
27 Multigrid Method
28 Physical Picture of the Multigrid Algorithm on a 16 by 16 Grid
29 Mathematical Structure of Exemplar 16 by 16 Multigrid Solution

Outside Index Summary of Material



HTML version of Basic Foils prepared 15 March 1996

Foil 1 CPS713 Case Study (II) CFD and Numerical Relativity
Benchmarks -- Part III
Further PDE Solvers

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Geoffrey Fox
NPAC
111 College Place
Syracuse NY 13244-4100

HTML version of Basic Foils prepared 15 March 1996

Foil 2 Abstract of CPS713NAS- Part III --NAS Benchmarks -- Further PDE Solvers

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This is third of three foilsets on CFD and NAS Benchmarks
This completes analysis of parallel ADI from first BT application benchmark
Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization
Details of SSOR and its parallelization with different decompositions
Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra
Brief remarks on other more sophisticated modern solvers
  • ILU and Incomplete Cholesky
  • Domain Decomposition
  • Multigrid

HTML version of Basic Foils prepared 15 March 1996

Foil 3 General Analysis of Parallel ADI Performance -- the Pipeline start-up

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Consider N 3NAS grid decomposed onto P processors
We must start each ADI solve at the beginning i,j,k=1 or the end i,j,k= NNAS
For each sweep (where elements not all stored in same processor) start half the solves at the beginning and half at the end
  • in each case work towards the middle
After at most NNAS/2 steps, we will have "made it" to the middle on the initial solve and all processors will be active.
This algorithm is sensitive to latency as in racing to get middle fast(est), we have only performed part of one solve in each processor.
  • We communicate only one (5 component) element
Maximizing message size by partial solution of several solves before communication:
  • Decreases communication cost as fewer messages
  • But worsens load balance as delays start of processors near middle

HTML version of Basic Foils prepared 15 March 1996

Foil 4 General Analysis of Parallel ADI Performance -- Dependence of Computation and Communication

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Just consider dependence on NNAS and number of Processors P
  • Ignore constants
Pipeline Start-up latency is proportional to NNAS
Total Computation 3N 3NAS/P
Communication:
  • 3 Dimensional Decomposition: 3N 2NAS / P2/3
  • 1 Dimensional Decomposition: N 2NAS
Alternative appreoach is Redistribution to arrange that ADI always inside node
  • Redistribution Time proportional to N 3NAS / P
  • No Pipeline Start-Up

HTML version of Basic Foils prepared 15 March 1996

Foil 5 SP -- The Second NAS Benchmark -- Diagonalized ADI

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This has two essential differences from the first BT benchmark
Firstly the fourth order dissipation terms are kept in the implicit (LHS) part of the equations
  • Remember these were only kept in explicit right hand side for first BT benchmark
  • This step makes ADI equations PENTAdiagonal (2 nonzeros each side of diagonal) as opposed to TRIdiagonal in BT
Next step is an approximation to the resultant 5 by 5 matrices on LHS in 3 ADI sweeps. These are approximately diagonalized in a manner described in paper which ignores smaller operators and dependence of diagonalization on position
This converts 5 by 5 BLOCK Tridiagonal matrices into a set of 1 by 1 (scalar) Pentadiagonal equations
General parallel analysis is similar to BT as still ADI but probably less computation per communication and so lower efficiency.

HTML version of Basic Foils prepared 15 March 1996

Foil 6 LU or SSOR -- The Third NAS Benchmark -- Roughly Gauss Seidel

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Take the basic implicit equation K DU = RHS
and Split K = D + Lo + Up where
  • D Diagonal
  • Lo Lower triangular (3 distinct bands)
  • Up Upper triangular (3 distinct bands)
Note Lo and Up are of size Dt and so we can follow ADI trick and write
K = (D + Lo) (I + D -1 Up)
  • Ignoring terms of order Dt 2
and we get the two sweeps:
(D + Lo) DU1 = RHS
(I + D -1 Up) DU = DU1

HTML version of Basic Foils prepared 15 March 1996

Foil 7 Symmetric Successive Overrelaxation Structure

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
The two sweeps of LU benchmark are just as you get in Gauss Seidel if you
  • First solve equations 1,2.... Ntot (Ntot=N 3NAS)
  • Then sweep backwards solving Ntot, Ntot-1, ... 1
This is classic SSOR structure shown below:

HTML version of Basic Foils prepared 15 March 1996

Foil 8 The Relaxation in Symmetric Successive Overrelaxation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
The equations are in fact more complex than I wrote as one introduces an additional relaxation parameter w.
This does not change Computational Structure of Equations

HTML version of Basic Foils prepared 15 March 1996

Foil 9 Parallelization of LU NAS Benchmark and Gauss Seidel Iteration

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This is interesting and has identical issues to those of parallelizing Gauss-Seidel Iterative solver where one preserves natural update order and does not use red-black artifice
Note that for Elliptic solvers natural lexicographic ordering
  • Index in x direction runs fastest, then that in h and then that in z
Has similar convergence properties to red-black ordering
As red-black has simple parallelization, one always uses this
However according to:
  • Solution of Regular Sparse Triangular Linear Systems on Vector and Distributed-Memory Processors by E. Barszcz, R. Fatoohi, V. Venkatakrishnan, S. Weeratunga, RNR-93-007 (NASA Ames Memo)
  • This looks at LU NAS Benchmark for Cray YMP, CM-2, and iPSC860.
This is NOT true for hyperbolic solvers where lexicographic ordering seems superior to red-black ordering.

HTML version of Basic Foils prepared 15 March 1996

Foil 10 Ideas behind Wavefront or Hyperplane Parallelization of LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Let us discuss parallelization of first (D + Lo) DU1 Lower Triangular equation as Upper triangular case is trivially the same by reversing order of indices
RNR-93-007 distinguishs Wavefront (2D) and hyperplane(3D) methods and concludes hyperplane best.
  • We agree but will draw all pictures in two dimensions -- they have obvious three dimensional extensions
Key observation is that:
  • After Processor 1 has finished case i=j=k=1
  • Processor 1 can proceed with case i=2 j=k=1 while
  • Processor 2 can compute i=1 k=2 and j=1
  • This (only taking k=1 for Processor 1 and k=2 for Processor 2) gives a pipeline with Processor 2 following one grid point behind Processor 1.

HTML version of Basic Foils prepared 15 March 1996

Foil 11 Basic Formulae for Hyperplane Parallelization of LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Essential idea is that we divide equations into sweeps labelled by m
  • m = i + j + k
Once one has solved sweeps 1 through m-1, ALL equations of sweep m can be solved INDEPENDENTLY and hence with natural parallelization
We will use Owner-Compute strategy and assign each grid-point (as normal) to a given processor
Several choices for assignments of grid points to processors and not clear that RNR-93-007 chooses best strastegy
Natural Strategy is NOT BLOCK ( divide grid points into P contigous cubic domains) but rather BLOCK-CYCLIC
  • BLOCK decomposition has load-balancing problems
  • BLOCK-CYCLIC has better load-balancing but more communication
  • P denotes number of Processors

HTML version of Basic Foils prepared 15 March 1996

Foil 12 Cyclic -- Decomposition I for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Consider just two Processors P=2
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points
Example I: Cyclic Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 13 Block Cyclic -- Decomposition II for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Consider just two Processors P=2
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points
Example II: Block Cyclic Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 14 Block -- Decomposition III for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Consider just two Processors P=2
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points
Example III: Block Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 15 ill Chosen Scattered -- Decomposition IV for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Consider just two Processors P=2
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points
Example IV: Scattered Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 16 Comments on Three NAS Benchmarks

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
It is not clear to me how realistic these algorithms although they do clearly illustrate sparse matrix operations of many important CFD algorithms
ADI (BT and SP) is an important CFD algorithm
The SP set are based on complex approximations to BT which may not be significant in today's world where we have powerful computers
SSOR is an important Elliptic (steady state solver) but I find its application to implicit (hyperbolic equations) a little peculiar
  • Normally Dt plays role of relaxation parameter
  • Note benchmark solver involves choice w=1.2 -- not the value near 2 expected for elliptic solver
Note LU Benchmark does NOT involve "LU Decomposition Linear Equation Solver" which has very different computational complexity and issues

HTML version of Basic Foils prepared 15 March 1996

Foil 17 Overview of Iterative Solvers for Partial Differential Equations

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This fills in gaps which we did not cover in either CPS615 or NAS Benchmark CFD modules of CPS713
References are:
  • Chapter 12 of Volume 1 of
  • Numerical Computation of Internal and External Flows
  • Fundamentals of Numerical Discretization
  • C. Hirsch , Wiley 1988
  • "Templates for the Solution of Linear Systems:
  • Building Blocks for Iterative Methods",
  • Richard Barrett, Michael Berry,Tony Chan,
  • James Demmell,June Donato,Jack Dongarra,
  • Victor Eijkhout,Roldan Pozo,Charles Romine,
  • Henk van der Vorst
  • SIAM, Philadelphia 1994

HTML version of Basic Foils prepared 15 March 1996

Foil 18 Iterative Methods for Solving Partial Differential Equations

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Point Jacobi
Point Gauss-Seidel
Over Relaxation
  • for Jacobi
  • for Gauss-Seidel called SOR -- Successive Overrelaxation
  • SSOR or Symmetric Successive Over Relaxation
Conjugate Gradient
  • This important method has several variants for Special circumstances which are well covered in the Template book
ALL of these except SSOR covered in CPS615 and in Template book
  • SSOR exemplified in CPS713 NAS Benchmark Discussion

HTML version of Basic Foils prepared 15 March 1996

Foil 19 SLOR or Successive Line Over Relaxation Exemplified for Laplace's Equation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This is a variant of Gauss-Seidel where we solve not for single points at a time but for lines
We illustrate for Laplace case where simple Gauss-Seidel reads:
With two terms already known at n+1'th iteration and two yet to be determined
We now write the more general equation:
where we solve for all values of i at fixed j using known value at n+1'th iteration at index j-1 and using old n'th iteration value at index j+1

HTML version of Basic Foils prepared 15 March 1996

Foil 20 Solution of SLOR -- Successive Line Over Relaxation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
The SLOR equations can be solved as they are a set of NNAS tridiagonal equations soluble in 5NNAS operations
One can now use the usual over relaxation procedure
Note that upto a constant SLOR has the same complexity as original SOR
For Lapace's equation, one can show that if you choose the appropriate optimal w for both that SLOR is twice as fast as SOR -- as a SLOR calculation is somewhat more computation, one will not gain this full factor

HTML version of Basic Foils prepared 15 March 1996

Foil 21 Red-Black Point Iteration Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
These can be applied to Jacobi (which Hirsch discusses) or Gauss-Seidel which we discussed in CPS615 because the red-black ordering is easier to parallelize

HTML version of Basic Foils prepared 15 March 1996

Foil 22 Red-Black Line or Zebra Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
These Combine the ideas of SLOR with those of Red-Black Ordering
They provide naturally parallel SLOR Schemes

HTML version of Basic Foils prepared 15 March 1996

Foil 23 Preconditioners and Other Partial Differential Equation Solution Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
There are a set of sophisticated methods of great importance as PDE solvers which are more powerful than simple iterative solvers
Alternating Direction Iterative ADI Solvers
  • This has already been discussed under the SP and BT NAS Benchmarks
  • It can be viewed as a general preconditioning method when matrix to be inverted can be written as a diagonal matrix plus the sum of easily solved matrices with small coefficients
Incomplete Cholesky Factorization
And two more which deserve major discussion but which we will overview very superficially
  • Domain Decomposition
  • Multigrid Methods

HTML version of Basic Foils prepared 15 March 1996

Foil 24 Incomplete Cholesky Factorization or ILU -- Incomplete LU Decomposition

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This used to be a major preconditioning technique for sparse matrices:
Then apply conventional methods to matrix P -1A
  • i.e. use P as a preconditioner gotten by only keeping those matrix elements of L and U where A was nonzero
Very difficult to parallelize as computation of involves a lot of complex dependencies (Full LU parallelization easier)

HTML version of Basic Foils prepared 15 March 1996

Foil 25 Physical Picture of Domain Decomposition

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This is intuitively obvious but because of this a very powerful approach

HTML version of Basic Foils prepared 15 March 1996

Foil 26 Mathematical Formulation of Domain Decomposition Preconditioning

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
Mathematically for two domains this corresponds to preconditioning

HTML version of Basic Foils prepared 15 March 1996

Foil 27 Multigrid Method

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This is a particular example of multi-scale or multi-resolution approach which we also find in CPS713 with cluster algorithms in statistical physics
All Physical Systems have different effects going on at different physical length scales
The best algorithms build in this in and tailor algorithm for features at each scale
In Iterative solvers, we saw in eigenvalue analysis that
  • Short wavelength features converge fast
  • Slow convergence corresponded to long wavelength effects
In Multigrid, you precondition matrix to remove long wavelength effects so local iterative solver only has to address short wavelengths
This is done recursively so you successively remove effects at wavelengths that increase at each step by a factor of two

HTML version of Basic Foils prepared 15 March 1996

Foil 28 Physical Picture of the Multigrid Algorithm on a 16 by 16 Grid

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
This example shows four grids
  • 16 by 16 (label 1)
  • 8 by 8 (label 2)
  • 4 by 4 (label 3)
  • 2 by 2 (label 4)

HTML version of Basic Foils prepared 15 March 1996

Foil 29 Mathematical Structure of Exemplar 16 by 16 Multigrid Solution

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *
Full HTML Index
First Solve the 2 by 2 matrix A4 on the coarsest grid
Use A4 to precondition and solve grid 3 to get 4 by 4 matrix A3
Use A3 to precondition and solve grid 2 to get 8 by 8 matrix A2
Use A2 to precondition and solve grid 1 to get 16 by 16 matrix A1
Note both Domain Decomposition and Multigrid have Good Physical Motivation but they are quite different
  • Multigrid works from the bottom (fine scale) up
  • Domain Decomposition from the Macroscopic Structure down

Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sun Feb 22 1998