Given by Geoffrey C. Fox at CPSP713 Case studies in Computational Science on Spring Semester 1996. Foils prepared 15 March 1996
Outside Index
Summary of Material
This is second of three foilsets on CFD and NAS Benchmarks |
We describe spatial differencing including numerical dissipation for stability |
General analysis of resultant sparse matrix structure and iterative solvers |
Role of preconditioning and artificial time formulations |
focus on NAS ADI solution and its parallel solution with various distributions and their consequent communication cost |
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
111 College Place |
Syracuse NY 13244-4100 |
This is second of three foilsets on CFD and NAS Benchmarks |
We describe spatial differencing including numerical dissipation for stability |
General analysis of resultant sparse matrix structure and iterative solvers |
Role of preconditioning and artificial time formulations |
focus on NAS ADI solution and its parallel solution with various distributions and their consequent communication cost |
Sections 3.4.2,3.4.3,3.4.4 cover basic spatial discretization |
We are using uniform grid in x,h,z directions with respective step sizes hx,hh,hz. |
These cover unit cube in x,h,z space with total number of points which is for benchmark problems the same NNAS in each spatial direction
|
We can use finite difference methods because of uniform grid and these involve such approximations as:
|
These discretizations need to be extended for general case which is not U itself but rather a function of U and its partial derivatives |
All the derivatives need to be modified near boundaries where some of needed points are missing |
NAS also adds a new numerical dissipation term to improve stability of numerical solution. Previously highest order derivative came from viscous terms like Ñ2U. |
New term is proportional to Ñ4U |
Note the negative sign which gives dissipation i.e. stability |
Equations (3.25) (3.26) and (3.27) summarize the equations after spatial and temporal discreization |
We can write the right hand side of equation (3.26) as the matrix vector product |
The vector Uvec has Nvec = 5 Nx Nh Nz components and RHS is a Nvec by Nvec matrix. The forcing term Hvec is also a vector with Nvec components
|
Formally we can write full equations: |
If we had used the explicit Euler method, then equations would have read the simple:
|
Nonzero structure of typical row of RHS is: |
with 5 nonzeros about diagonal and two bands -- each with two nonzeros -- each side of diagonal |
Each corresponds to a 5 by 5 (number of CFD labels) block matrix which is approximately "full" |
So each of 5 NNAS3 rows of RHS has 65 nonzero elements |
This can be contrasted with simple differencing to solve three dimensional Laplace equation Ñ2Y=0 where nonzero structure of typical row is: |
with 7 nonzeros in each row of row of underlying matrix
|
As with all nonlinear coupled differential equations, one has three distinct iterative processes in solution of NAS equations.
|
1)First Iteration is due to nonlinearity of equations: one linearizes the nonlinear CFD equations as in equation (3.17) by Taylor expansion |
to make equations linear in |
the unknown Un+1. |
2)Second Iteration is that needed for fixed n, to solve the implicit equations for DUn |
governed by matrix |
3)Third Iteration is that in index n needed to evolve in time t. In our case where time is somewhat artificial as only need steady state solution, one can view n as labelling iterative solution to steady state equations of CFD. |
Both Types of equations can be written in matrix form |
Elliptic or Steady State Hyperbolic-Parabolic as in steady state CFD
|
Time dependent Hyperbolic or Parabolic
|
We already showed in CPS615 how one could derive Jacobi, Gauss-Seidel and SOR Iterative schemes from Artificial Diffusion Equations.
|
One derives Iterative methods by splitting matrix K of steady state KU=b, into two
|
The matrix M corresponds to Mim(ex)plicit in typical time dependent hyperbolic numerical solver |
Note M is in simple cases (Jacobi) diagonal and is typically diagonally dominant like hyperbolic solver |
We can write |
As the time discretization of: |
Putting Dt = -1 one gets Richardson's method where wfaketime is an acceleration or over relaxation parameter which upto a factor the same as over relaxation parameter w in SOR. |
This is identical upto scale factor with that of relaxation for Jacobi method and one finds the plot of eigenvalues of iteration matrix |
G= 1 + wfaketimeK as below |
The optimal choice of wfaketime is 0.25 for a normalization where diagonal terms of K for Laplace equation is -4. This is identical to w=1 being optimal for relaxation analysis of Jacobi Over Relaxation. |
We can generalize the simple equations |
To the form |
with arbitary matrix P |
As the left hand side vanishs in equilibrium |
This can be discretized in time to: |
With P playing role of in NAS Benchmark |
Note only the product wDt enters i.e. time step Dt plays same role as relaxation parameter w. |
The matrix form of Preconditioning was to replace |
By the form: |
Which has obvious Similarities with the artificial time form: |
Exploiting analogy between and P, we can relate particular approachs to solving the implicit equations of CFD to particular preconditioning schemes for the steady state equations |
In Dongarra's template book and Hirsch, ADI (Alternating Direction Iteration) is referred to as a preconditioning scheme |
In our treatment of NAS benchmarks, it is derived intuitively from solution of implicit equations
|
ADI exploits the special form of which is |
with one minus a sum of "small" terms -- corresponding to the separate discretizations in x h and z directions. |
This is approximate factorization (Beam-Warming) algorithm with unnumbered equations at start of section 3.4.7 of NAS Benchmark
|
This factorized Operator can now be solved in three steps: |
Each step involves solution of a 5 by 5 block tridiagonal matrix |
The matrix is tridiagonal -- not pentadiagonal -- as numerical dissipation with Ñ4U is only added to explicit RHS term in this approximation |
The original nonfactorized equations were sparse matrices with bandwidth L=(2N 2NAS+1) |
Solving equations with M rows and columns and bandwidth L takes ML2 operations |
This gives original complexity for M=N 3NAS proportional to N 7NAS |
A tridiagonal matrix has bandwidth 3 and so complexity of ADI factorized equations is just proportional to M or N 3NAS |
Effect of 5 by 5 blocks:
|
Here denotes a nonzero element of the tridiagonal matrix |
Each tridiagonal solve is not usefully parallelized |
However this is not important as one has N 2NAS Independent tridiagonal solves for each of three sweeps |
For instance for equation |
with components labelled by (i,j,k) for three x h and z directions, the equations for each distinct value of (j,k) are independent. |
We decompose the three dimensional grid over the Nproc nodes of a parallel machines |
There are the usual choices for this depending on the shape of regions stored in each processor |
Example 1: Distribute over j(h) with all values of i and k (x and z) for each j stored in a node
|
Example 2: Distribute as cubic blocks with all three indices i j and k distributed. |
This type of distribution is normally optimal for nearest neighbor problems as it minimizes surface over volume effects.
|
No Communication at sweep 1(loop over i for fixed j) |
Sweep 2 has communication for each j value in loop for fixed i. |
Communication at each sweep but only at every fourth index value |
Need to Pipeline as at start only edge processors active and need to work from both ends towards the middle
|
Communication is needed when one sweeps in a given index, say i, and that index location changes |
Note Algorithms for sequential tridiagonal solvers are given in the appendix of volume 1 of Hirsch |
For a scalar tridiagonal matrix with N rows, the solution takes time 5N |
For a block tridiagonal matrix, one increases this by a factor between 25 and 125 for 5 by 5 blocks. |
Note communication proportional to "5" i.e. size of vector being calculated as one changes block size |
However calculation proportional to 52 to 53 as one changes block size i.e. it increases faster with block size than communication
|
Consider a 64 by 64 by 64 grid with 64 processors |
Example 1: One Dimensional Decomposition --
|
Example 2: Two Dimensional Decomposition --
|
Example 3: Three Dimensional Decomposition --
|
The Naive Ratio of Communication in three distributions is: |
64 : 16 : 12 for One : Two : Three Dimensional Decompositions |
Three dimensions also has lower pipelining start-up effects |
However -- certainly for one dimensional case -- it is better to redistribute data before sweep that crosses processor boundaries
|