Given by Geoffrey C. Fox at CPSP713 Case studies in Computational Science on Spring Semester 1996. Foils prepared 15 March 1996
Outside Index
Summary of Material
This is third of three foilsets on CFD and NAS Benchmarks |
This completes analysis of parallel ADI from first BT application benchmark |
Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization |
Details of SSOR and its parallelization with different decompositions |
Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra |
Brief remarks on other more sophisticated modern solvers
|
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
111 College Place |
Syracuse NY 13244-4100 |
This is third of three foilsets on CFD and NAS Benchmarks |
This completes analysis of parallel ADI from first BT application benchmark |
Second (SP) and third(LU) benchmarks with diagonalized ADI and SSOR methods and their parallelization |
Details of SSOR and its parallelization with different decompositions |
Relation of SSOR to related iterative solvers -- SLOR, red-black, zebra |
Brief remarks on other more sophisticated modern solvers
|
Consider N 3NAS grid decomposed onto P processors |
We must start each ADI solve at the beginning i,j,k=1 or the end i,j,k= NNAS |
For each sweep (where elements not all stored in same processor) start half the solves at the beginning and half at the end
|
After at most NNAS/2 steps, we will have "made it" to the middle on the initial solve and all processors will be active. |
This algorithm is sensitive to latency as in racing to get middle fast(est), we have only performed part of one solve in each processor.
|
Maximizing message size by partial solution of several solves before communication:
|
Just consider dependence on NNAS and number of Processors P
|
Pipeline Start-up latency is proportional to NNAS |
Total Computation 3N 3NAS/P |
Communication:
|
Alternative appreoach is Redistribution to arrange that ADI always inside node
|
This has two essential differences from the first BT benchmark |
Firstly the fourth order dissipation terms are kept in the implicit (LHS) part of the equations
|
Next step is an approximation to the resultant 5 by 5 matrices on LHS in 3 ADI sweeps. These are approximately diagonalized in a manner described in paper which ignores smaller operators and dependence of diagonalization on position |
This converts 5 by 5 BLOCK Tridiagonal matrices into a set of 1 by 1 (scalar) Pentadiagonal equations |
General parallel analysis is similar to BT as still ADI but probably less computation per communication and so lower efficiency. |
Take the basic implicit equation K DU = RHS |
and Split K = D + Lo + Up where
|
Note Lo and Up are of size Dt and so we can follow ADI trick and write |
K = (D + Lo) (I + D -1 Up)
|
and we get the two sweeps: |
(D + Lo) DU1 = RHS |
(I + D -1 Up) DU = DU1 |
The two sweeps of LU benchmark are just as you get in Gauss Seidel if you
|
This is classic SSOR structure shown below: |
The equations are in fact more complex than I wrote as one introduces an additional relaxation parameter w. |
This does not change Computational Structure of Equations |
This is interesting and has identical issues to those of parallelizing Gauss-Seidel Iterative solver where one preserves natural update order and does not use red-black artifice |
Note that for Elliptic solvers natural lexicographic ordering
|
Has similar convergence properties to red-black ordering |
As red-black has simple parallelization, one always uses this |
However according to:
|
This is NOT true for hyperbolic solvers where lexicographic ordering seems superior to red-black ordering. |
Let us discuss parallelization of first (D + Lo) DU1 Lower Triangular equation as Upper triangular case is trivially the same by reversing order of indices |
RNR-93-007 distinguishs Wavefront (2D) and hyperplane(3D) methods and concludes hyperplane best.
|
Key observation is that:
|
Essential idea is that we divide equations into sweeps labelled by m
|
Once one has solved sweeps 1 through m-1, ALL equations of sweep m can be solved INDEPENDENTLY and hence with natural parallelization |
We will use Owner-Compute strategy and assign each grid-point (as normal) to a given processor |
Several choices for assignments of grid points to processors and not clear that RNR-93-007 chooses best strastegy |
Natural Strategy is NOT BLOCK ( divide grid points into P contigous cubic domains) but rather BLOCK-CYCLIC
|
Consider just two Processors P=2 |
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points |
Example I: Cyclic Distribution in One Dimension |
Consider just two Processors P=2 |
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points |
Example II: Block Cyclic Distribution in One Dimension |
Consider just two Processors P=2 |
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points |
Example III: Block Distribution in One Dimension |
Consider just two Processors P=2 |
Consider a two-dimensional example with m=1..15 sweeps on an 8 by 8 array of grid points |
Example IV: Scattered Distribution in One Dimension |
It is not clear to me how realistic these algorithms although they do clearly illustrate sparse matrix operations of many important CFD algorithms |
ADI (BT and SP) is an important CFD algorithm |
The SP set are based on complex approximations to BT which may not be significant in today's world where we have powerful computers |
SSOR is an important Elliptic (steady state solver) but I find its application to implicit (hyperbolic equations) a little peculiar
|
Note LU Benchmark does NOT involve "LU Decomposition Linear Equation Solver" which has very different computational complexity and issues |
This fills in gaps which we did not cover in either CPS615 or NAS Benchmark CFD modules of CPS713 |
References are:
|
Point Jacobi |
Point Gauss-Seidel |
Over Relaxation
|
Conjugate Gradient
|
ALL of these except SSOR covered in CPS615 and in Template book
|
This is a variant of Gauss-Seidel where we solve not for single points at a time but for lines |
We illustrate for Laplace case where simple Gauss-Seidel reads: |
With two terms already known at n+1'th iteration and two yet to be determined |
We now write the more general equation: |
where we solve for all values of i at fixed j using known value at n+1'th iteration at index j-1 and using old n'th iteration value at index j+1 |
The SLOR equations can be solved as they are a set of NNAS tridiagonal equations soluble in 5NNAS operations |
One can now use the usual over relaxation procedure |
Note that upto a constant SLOR has the same complexity as original SOR |
For Lapace's equation, one can show that if you choose the appropriate optimal w for both that SLOR is twice as fast as SOR -- as a SLOR calculation is somewhat more computation, one will not gain this full factor |
These can be applied to Jacobi (which Hirsch discusses) or Gauss-Seidel which we discussed in CPS615 because the red-black ordering is easier to parallelize |
These Combine the ideas of SLOR with those of Red-Black Ordering |
They provide naturally parallel SLOR Schemes |
There are a set of sophisticated methods of great importance as PDE solvers which are more powerful than simple iterative solvers |
Alternating Direction Iterative ADI Solvers
|
Incomplete Cholesky Factorization |
And two more which deserve major discussion but which we will overview very superficially
|
This used to be a major preconditioning technique for sparse matrices: |
Then apply conventional methods to matrix P -1A
|
Very difficult to parallelize as computation of involves a lot of complex dependencies (Full LU parallelization easier) |
This is intuitively obvious but because of this a very powerful approach |
Mathematically for two domains this corresponds to preconditioning |
This is a particular example of multi-scale or multi-resolution approach which we also find in CPS713 with cluster algorithms in statistical physics |
All Physical Systems have different effects going on at different physical length scales |
The best algorithms build in this in and tailor algorithm for features at each scale |
In Iterative solvers, we saw in eigenvalue analysis that
|
In Multigrid, you precondition matrix to remove long wavelength effects so local iterative solver only has to address short wavelengths |
This is done recursively so you successively remove effects at wavelengths that increase at each step by a factor of two |
This example shows four grids
|
First Solve the 2 by 2 matrix A4 on the coarsest grid |
Use A4 to precondition and solve grid 3 to get 4 by 4 matrix A3 |
Use A3 to precondition and solve grid 2 to get 8 by 8 matrix A2 |
Use A2 to precondition and solve grid 1 to get 16 by 16 matrix A1 |
Note both Domain Decomposition and Multigrid have Good Physical Motivation but they are quite different
|