Full HTML for Basic Further PDE Solvers for the NAS Benchmarks

Full HTML for

Basic foilset Further PDE Solvers for the NAS Benchmarks

Given by Geoffrey C. Fox at CPSP713 Case studies in Computational Science on Spring Semester 1996. Foils prepared 15 March 1996
Outside Index Summary of Material

This is third of three foilsets on CFD and NAS Benchmarks

This completes analysis of parallel ADI from first BT application benchmark

Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization

Details of SSOR and its parallelization with different decompositions

Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra

Brief remarks on other more sophisticated modern solvers

ILU and Incomplete Cholesky
Domain Decomposition
Multigrid

Table of Contents for full HTML of Further PDE Solvers for the NAS Benchmarks

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

CPS713 Case Study (II) CFD and Numerical Relativity
Benchmarks -- Part III
Further PDE Solvers
2

Abstract of CPS713NAS- Part III --NAS Benchmarks -- Further PDE Solvers
3

General Analysis of Parallel ADI Performance -- the Pipeline start-up
4

General Analysis of Parallel ADI Performance -- Dependence of Computation and Communication
5

SP -- The Second NAS Benchmark -- Diagonalized ADI
6

LU or SSOR -- The Third NAS Benchmark -- Roughly Gauss Seidel
7

Symmetric Successive Overrelaxation Structure
8

The Relaxation in Symmetric Successive Overrelaxation
9

Parallelization of LU NAS Benchmark and Gauss Seidel Iteration
10

Ideas behind Wavefront or Hyperplane Parallelization of LU NAS Benchmark
11

Basic Formulae for Hyperplane Parallelization of LU NAS Benchmark
12

Cyclic -- Decomposition I for Hyperplane Method for LU NAS Benchmark
13

Block Cyclic -- Decomposition II for Hyperplane Method for LU NAS Benchmark
14

Block -- Decomposition III for Hyperplane Method for LU NAS Benchmark
15

ill Chosen Scattered -- Decomposition IV for Hyperplane Method for LU NAS Benchmark
16

Comments on Three NAS Benchmarks
17

Overview of Iterative Solvers for Partial Differential Equations
18

Iterative Methods for Solving Partial Differential Equations
19

SLOR or Successive Line Over Relaxation Exemplified for Laplace's Equation
20

Solution of SLOR -- Successive Line Over Relaxation
21

Red-Black Point Iteration Schemes
22

Red-Black Line or Zebra Schemes
23

Preconditioners and Other Partial Differential Equation Solution Schemes
24

Incomplete Cholesky Factorization or ILU -- Incomplete LU Decomposition
25

Physical Picture of Domain Decomposition
26

Mathematical Formulation of Domain Decomposition Preconditioning
27

Multigrid Method
28

Physical Picture of the Multigrid Algorithm on a 16 by 16 Grid
29

Mathematical Structure of Exemplar 16 by 16 Multigrid Solution

Outside Index Summary of Material

HTML version of Basic Foils prepared 15 March 1996

Foil 1 CPS713 Case Study (II) CFD and Numerical Relativity
Benchmarks -- Part III
Further PDE Solvers

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Geoffrey Fox

NPAC

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 15 March 1996

Foil 2 Abstract of CPS713NAS- Part III --NAS Benchmarks -- Further PDE Solvers

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This is third of three foilsets on CFD and NAS Benchmarks

This completes analysis of parallel ADI from first BT application benchmark

Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization

Details of SSOR and its parallelization with different decompositions

Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra

Brief remarks on other more sophisticated modern solvers

ILU and Incomplete Cholesky
Domain Decomposition
Multigrid

HTML version of Basic Foils prepared 15 March 1996

Foil 3 General Analysis of Parallel ADI Performance -- the Pipeline start-up

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Consider N 3NAS grid decomposed onto P processors

We must start each ADI solve at the beginning i,j,k=1 or the end i,j,k= NNAS

For each sweep (where elements not all stored in same processor) start half the solves at the beginning and half at the end

in each case work towards the middle

After at most NNAS/2 steps, we will have "made it" to the middle on the initial solve and all processors will be active.

This algorithm is sensitive to latency as in racing to get middle fast(est), we have only performed part of one solve in each processor.

We communicate only one (5 component) element

Maximizing message size by partial solution of several solves before communication:

Decreases communication cost as fewer messages
But worsens load balance as delays start of processors near middle

HTML version of Basic Foils prepared 15 March 1996

Foil 4 General Analysis of Parallel ADI Performance -- Dependence of Computation and Communication

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Just consider dependence on NNAS and number of Processors P

Ignore constants

Pipeline Start-up latency is proportional to NNAS

Total Computation 3N 3NAS/P

Communication:

3 Dimensional Decomposition: 3N 2NAS / P2/3
1 Dimensional Decomposition: N 2NAS

Alternative appreoach is Redistribution to arrange that ADI always inside node

Redistribution Time proportional to N 3NAS / P
No Pipeline Start-Up

HTML version of Basic Foils prepared 15 March 1996

Foil 5 SP -- The Second NAS Benchmark -- Diagonalized ADI

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This has two essential differences from the first BT benchmark

Firstly the fourth order dissipation terms are kept in the implicit (LHS) part of the equations

Remember these were only kept in explicit right hand side for first BT benchmark
This step makes ADI equations PENTAdiagonal (2 nonzeros each side of diagonal) as opposed to TRIdiagonal in BT

Next step is an approximation to the resultant 5 by 5 matrices on LHS in 3 ADI sweeps. These are approximately diagonalized in a manner described in paper which ignores smaller operators and dependence of diagonalization on position

This converts 5 by 5 BLOCK Tridiagonal matrices into a set of 1 by 1 (scalar) Pentadiagonal equations

General parallel analysis is similar to BT as still ADI but probably less computation per communication and so lower efficiency.

HTML version of Basic Foils prepared 15 March 1996

Foil 6 LU or SSOR -- The Third NAS Benchmark -- Roughly Gauss Seidel

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Take the basic implicit equation K DU = RHS

and Split K = D + Lo + Up where

D Diagonal
Lo Lower triangular (3 distinct bands)
Up Upper triangular (3 distinct bands)

Note Lo and Up are of size Dt and so we can follow ADI trick and write

K = (D + Lo) (I + D -1 Up)

Ignoring terms of order Dt 2

and we get the two sweeps:

(D + Lo) DU1 = RHS

(I + D -1 Up) DU = DU1

HTML version of Basic Foils prepared 15 March 1996

Foil 7 Symmetric Successive Overrelaxation Structure

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

The two sweeps of LU benchmark are just as you get in Gauss Seidel if you

First solve equations 1,2.... Ntot (Ntot=N 3NAS)
Then sweep backwards solving Ntot, Ntot-1, ... 1

This is classic SSOR structure shown below:

HTML version of Basic Foils prepared 15 March 1996

Foil 8 The Relaxation in Symmetric Successive Overrelaxation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

The equations are in fact more complex than I wrote as one introduces an additional relaxation parameter w.

This does not change Computational Structure of Equations

HTML version of Basic Foils prepared 15 March 1996

Foil 9 Parallelization of LU NAS Benchmark and Gauss Seidel Iteration

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This is interesting and has identical issues to those of parallelizing Gauss-Seidel Iterative solver where one preserves natural update order and does not use red-black artifice

Note that for Elliptic solvers natural lexicographic ordering

Index in x direction runs fastest, then that in h and then that in z

Has similar convergence properties to red-black ordering

As red-black has simple parallelization, one always uses this

However according to:

Solution of Regular Sparse Triangular Linear Systems on Vector and Distributed-Memory Processors by E. Barszcz, R. Fatoohi, V. Venkatakrishnan, S. Weeratunga, RNR-93-007 (NASA Ames Memo)
This looks at LU NAS Benchmark for Cray YMP, CM-2, and iPSC860.

This is NOT true for hyperbolic solvers where lexicographic ordering seems superior to red-black ordering.

HTML version of Basic Foils prepared 15 March 1996

Foil 10 Ideas behind Wavefront or Hyperplane Parallelization of LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Let us discuss parallelization of first (D + Lo) DU1 Lower Triangular equation as Upper triangular case is trivially the same by reversing order of indices

RNR-93-007 distinguishs Wavefront (2D) and hyperplane(3D) methods and concludes hyperplane best.

We agree but will draw all pictures in two dimensions -- they have obvious three dimensional extensions

Key observation is that:

After Processor 1 has finished case i=j=k=1
Processor 1 can proceed with case i=2 j=k=1 while
Processor 2 can compute i=1 k=2 and j=1
This (only taking k=1 for Processor 1 and k=2 for Processor 2) gives a pipeline with Processor 2 following one grid point behind Processor 1.

HTML version of Basic Foils prepared 15 March 1996

Foil 11 Basic Formulae for Hyperplane Parallelization of LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Essential idea is that we divide equations into sweeps labelled by m

m = i + j + k

Once one has solved sweeps 1 through m-1, ALL equations of sweep m can be solved INDEPENDENTLY and hence with natural parallelization

We will use Owner-Compute strategy and assign each grid-point (as normal) to a given processor

Several choices for assignments of grid points to processors and not clear that RNR-93-007 chooses best strastegy

Natural Strategy is NOT BLOCK ( divide grid points into P contigous cubic domains) but rather BLOCK-CYCLIC

BLOCK decomposition has load-balancing problems
BLOCK-CYCLIC has better load-balancing but more communication
P denotes number of Processors

HTML version of Basic Foils prepared 15 March 1996

Foil 12 Cyclic -- Decomposition I for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Consider just two Processors P=2

Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points

Example I: Cyclic Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 13 Block Cyclic -- Decomposition II for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Consider just two Processors P=2

Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points

Example II: Block Cyclic Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 14 Block -- Decomposition III for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Consider just two Processors P=2

Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points

Example III: Block Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 15 ill Chosen Scattered -- Decomposition IV for Hyperplane Method for LU NAS Benchmark

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Consider just two Processors P=2

Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points

Example IV: Scattered Distribution in One Dimension

HTML version of Basic Foils prepared 15 March 1996

Foil 16 Comments on Three NAS Benchmarks

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

It is not clear to me how realistic these algorithms although they do clearly illustrate sparse matrix operations of many important CFD algorithms

ADI (BT and SP) is an important CFD algorithm

The SP set are based on complex approximations to BT which may not be significant in today's world where we have powerful computers

SSOR is an important Elliptic (steady state solver) but I find its application to implicit (hyperbolic equations) a little peculiar

Normally Dt plays role of relaxation parameter
Note benchmark solver involves choice w=1.2 -- not the value near 2 expected for elliptic solver

Note LU Benchmark does NOT involve "LU Decomposition Linear Equation Solver" which has very different computational complexity and issues

HTML version of Basic Foils prepared 15 March 1996

Foil 17 Overview of Iterative Solvers for Partial Differential Equations

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This fills in gaps which we did not cover in either CPS615 or NAS Benchmark CFD modules of CPS713

References are:

Chapter 12 of Volume 1 of
Numerical Computation of Internal and External Flows
Fundamentals of Numerical Discretization
C. Hirsch , Wiley 1988
"Templates for the Solution of Linear Systems:
Building Blocks for Iterative Methods",
Richard Barrett, Michael Berry,Tony Chan,
James Demmell,June Donato,Jack Dongarra,
Victor Eijkhout,Roldan Pozo,Charles Romine,
Henk van der Vorst
SIAM, Philadelphia 1994

HTML version of Basic Foils prepared 15 March 1996

Foil 18 Iterative Methods for Solving Partial Differential Equations

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Point Jacobi

Point Gauss-Seidel

Over Relaxation

for Jacobi
for Gauss-Seidel called SOR -- Successive Overrelaxation
SSOR or Symmetric Successive Over Relaxation

Conjugate Gradient

This important method has several variants for Special circumstances which are well covered in the Template book

ALL of these except SSOR covered in CPS615 and in Template book

SSOR exemplified in CPS713 NAS Benchmark Discussion

HTML version of Basic Foils prepared 15 March 1996

Foil 19 SLOR or Successive Line Over Relaxation Exemplified for Laplace's Equation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This is a variant of Gauss-Seidel where we solve not for single points at a time but for lines

We illustrate for Laplace case where simple Gauss-Seidel reads:

With two terms already known at n+1'th iteration and two yet to be determined

We now write the more general equation:

where we solve for all values of i at fixed j using known value at n+1'th iteration at index j-1 and using old n'th iteration value at index j+1

HTML version of Basic Foils prepared 15 March 1996

Foil 20 Solution of SLOR -- Successive Line Over Relaxation

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

The SLOR equations can be solved as they are a set of NNAS tridiagonal equations soluble in 5NNAS operations

One can now use the usual over relaxation procedure

Note that upto a constant SLOR has the same complexity as original SOR

For Lapace's equation, one can show that if you choose the appropriate optimal w for both that SLOR is twice as fast as SOR -- as a SLOR calculation is somewhat more computation, one will not gain this full factor

HTML version of Basic Foils prepared 15 March 1996

Foil 21 Red-Black Point Iteration Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

These can be applied to Jacobi (which Hirsch discusses) or Gauss-Seidel which we discussed in CPS615 because the red-black ordering is easier to parallelize

HTML version of Basic Foils prepared 15 March 1996

Foil 22 Red-Black Line or Zebra Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

These Combine the ideas of SLOR with those of Red-Black Ordering

They provide naturally parallel SLOR Schemes

HTML version of Basic Foils prepared 15 March 1996

Foil 23 Preconditioners and Other Partial Differential Equation Solution Schemes

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

There are a set of sophisticated methods of great importance as PDE solvers which are more powerful than simple iterative solvers

Alternating Direction Iterative ADI Solvers

This has already been discussed under the SP and BT NAS Benchmarks
It can be viewed as a general preconditioning method when matrix to be inverted can be written as a diagonal matrix plus the sum of easily solved matrices with small coefficients

Incomplete Cholesky Factorization

And two more which deserve major discussion but which we will overview very superficially

Domain Decomposition
Multigrid Methods

HTML version of Basic Foils prepared 15 March 1996

Foil 24 Incomplete Cholesky Factorization or ILU -- Incomplete LU Decomposition

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This used to be a major preconditioning technique for sparse matrices:

Then apply conventional methods to matrix P -1A

i.e. use P as a preconditioner gotten by only keeping those matrix elements of L and U where A was nonzero

Very difficult to parallelize as computation of involves a lot of complex dependencies (Full LU parallelization easier)

HTML version of Basic Foils prepared 15 March 1996

Foil 25 Physical Picture of Domain Decomposition

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This is intuitively obvious but because of this a very powerful approach

HTML version of Basic Foils prepared 15 March 1996

Foil 26 Mathematical Formulation of Domain Decomposition Preconditioning

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

Mathematically for two domains this corresponds to preconditioning

HTML version of Basic Foils prepared 15 March 1996

Foil 27 Multigrid Method

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This is a particular example of multi-scale or multi-resolution approach which we also find in CPS713 with cluster algorithms in statistical physics

All Physical Systems have different effects going on at different physical length scales

The best algorithms build in this in and tailor algorithm for features at each scale

In Iterative solvers, we saw in eigenvalue analysis that

Short wavelength features converge fast
Slow convergence corresponded to long wavelength effects

In Multigrid, you precondition matrix to remove long wavelength effects so local iterative solver only has to address short wavelengths

This is done recursively so you successively remove effects at wavelengths that increase at each step by a factor of two

HTML version of Basic Foils prepared 15 March 1996

Foil 28 Physical Picture of the Multigrid Algorithm on a 16 by 16 Grid

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

This example shows four grids

16 by 16 (label 1)
8 by 8 (label 2)
4 by 4 (label 3)
2 by 2 (label 4)

HTML version of Basic Foils prepared 15 March 1996

Foil 29 Mathematical Structure of Exemplar 16 by 16 Multigrid Solution

From Further PDE Solvers for the NAS Benchmarks CPSP713 Case studies in Computational Science -- Spring Semester 1996. *

Full HTML Index

First Solve the 2 by 2 matrix A4 on the coarsest grid

Use A4 to precondition and solve grid 3 to get 4 by 4 matrix A3

Use A3 to precondition and solve grid 2 to get 8 by 8 matrix A2

Use A2 to precondition and solve grid 1 to get 16 by 16 matrix A1

Note both Domain Decomposition and Multigrid have Good Physical Motivation but they are quite different

Multigrid works from the bottom (fine scale) up
Domain Decomposition from the Macroscopic Structure down

Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sun Feb 22 1998

Basic foilset Further PDE Solvers for the NAS Benchmarks

Table of Contents for full HTML of Further PDE Solvers for the NAS Benchmarks

Foil 1 CPS713 Case Study (II) CFD and Numerical RelativityBenchmarks -- Part IIIFurther PDE Solvers

Foil 2 Abstract of CPS713NAS- Part III --NAS Benchmarks -- Further PDE Solvers

Foil 3 General Analysis of Parallel ADI Performance -- the Pipeline start-up

Foil 4 General Analysis of Parallel ADI Performance -- Dependence of Computation and Communication

Foil 5 SP -- The Second NAS Benchmark -- Diagonalized ADI

Foil 6 LU or SSOR -- The Third NAS Benchmark -- Roughly Gauss Seidel

Foil 7 Symmetric Successive Overrelaxation Structure

Foil 8 The Relaxation in Symmetric Successive Overrelaxation

Foil 9 Parallelization of LU NAS Benchmark and Gauss Seidel Iteration

Foil 10 Ideas behind Wavefront or Hyperplane Parallelization of LU NAS Benchmark

Foil 11 Basic Formulae for Hyperplane Parallelization of LU NAS Benchmark

Foil 12 Cyclic -- Decomposition I for Hyperplane Method for LU NAS Benchmark

Foil 13 Block Cyclic -- Decomposition II for Hyperplane Method for LU NAS Benchmark

Foil 14 Block -- Decomposition III for Hyperplane Method for LU NAS Benchmark

Foil 15 ill Chosen Scattered -- Decomposition IV for Hyperplane Method for LU NAS Benchmark

Foil 16 Comments on Three NAS Benchmarks

Foil 17 Overview of Iterative Solvers for Partial Differential Equations

Foil 18 Iterative Methods for Solving Partial Differential Equations

Foil 19 SLOR or Successive Line Over Relaxation Exemplified for Laplace's Equation

Foil 20 Solution of SLOR -- Successive Line Over Relaxation

Foil 21 Red-Black Point Iteration Schemes

Foil 22 Red-Black Line or Zebra Schemes

Foil 23 Preconditioners and Other Partial Differential Equation Solution Schemes

Foil 24 Incomplete Cholesky Factorization or ILU -- Incomplete LU Decomposition

Foil 25 Physical Picture of Domain Decomposition

Foil 26 Mathematical Formulation of Domain Decomposition Preconditioning

Foil 27 Multigrid Method

Foil 28 Physical Picture of the Multigrid Algorithm on a 16 by 16 Grid

Foil 29 Mathematical Structure of Exemplar 16 by 16 Multigrid Solution

Foil 1 CPS713 Case Study (II) CFD and Numerical Relativity
Benchmarks -- Part III
Further PDE Solvers