CPS713 Case Study II) CFD and Numerical Relativity Module on NAS Benchmarks --- Part II Spatial Differencing and ADI Geoffrey Fox NPAC 111 College Place Syracuse NY 13244-4100 Abstract of CPS713NAS-II: ADI and Spatial Differencing Module This is second of three foilsets on CFD and NAS Benchmarks We describe spatial differencing including numerical dissipation for stability General analysis of resultant sparse matrix structure and iterative solvers Role of preconditioning and artificial time formulations focus on NAS ADI solution and its parallel solution with various distributions and their consequent communication cost Spatial Differencing in NAS Benchmarks Sections 3.4.2,3.4.3,3.4.4 cover basic spatial discretization We are using uniform grid in x,h,z directions with respective step sizes hx,hh,hz. These cover unit cube in x,h,z space with total number of points which is for benchmark problems the same NNAS in each spatial direction NNAS = 64 for class A and 102 for class B We can use finite difference methods because of uniform grid and these involve such approximations as: Of course each U should have a time superscript n which is left off for clarity Numerical Dissipation in NAS Benchmarks These discretizations need to be extended for general case which is not U itself but rather a function of U and its partial derivatives All the derivatives need to be modified near boundaries where some of needed points are missing NAS also adds a new numerical dissipation term to improve stability of numerical solution. Previously highest order derivative came from viscous terms like Ñ2U. New term is proportional to Ñ4U Note the negative sign which gives dissipation i.e. stability Equations (3.25) (3.26) and (3.27) summarize the equations after spatial and temporal discreization Matrix Formalism for NAS Benchmarks We can write the right hand side of equation (3.26) as the matrix vector product The vector Uvec has Nvec = 5 Nx Nh Nz components and RHS is a Nvec by Nvec matrix. The forcing term Hvec is also a vector with Nvec components The components of vectors and rows and columns of matrices are labelled by i,j,k and CFD index index running from 1 to 5 Formally we can write full equations: If we had used the explicit Euler method, then equations would have read the simple: which only needs sparse matrix vector multiplication as its computational kernel. Structure of Sparse Matrices in NAS Benchmarks Nonzero structure of typical row of RHS is: with 5 nonzeros about diagonal and two bands -- each with two nonzeros -- each side of diagonal Each corresponds to a 5 by 5 (number of CFD labels) block matrix which is approximately "full" So each of 5 NNAS3 rows of RHS has 65 nonzero elements This can be contrasted with simple differencing to solve three dimensional Laplace equation Ñ2Y=0 where nonzero structure of typical row is: with 7 nonzeros in each row of row of underlying matrix Extra complexity of CFD makes parallel algorithms have better performance as calculation/communication ratio increased Three Types of Iteration needed in Solution of NAS Benchmarks As with all nonlinear coupled differential equations, one has three distinct iterative processes in solution of NAS equations. Similar structure is seen for set of coupled Ordinary differential equations describing electrical transmission networks. 1)First Iteration is due to nonlinearity of equations: one linearizes the nonlinear CFD equations as in equation (3.17) by Taylor expansion to make equations linear in the unknown Un+1. 2)Second Iteration is that needed for fixed n, to solve the implicit equations for DUn governed by matrix 3)Third Iteration is that in index n needed to evolve in time t. In our case where time is somewhat artificial as only need steady state solution, one can view n as labelling iterative solution to steady state equations of CFD. Comparison of Structure of Elliptic(steady state) and Hyperbolic Equations Both Types of equations can be written in matrix form Elliptic or Steady State Hyperbolic-Parabolic as in steady state CFD Note is typically not diagonally dominant as off diagonal terms are of same size as diagonal e,g. in two dimensions Lapace Iteration matrix is: with sum of elements in general row zero Time dependent Hyperbolic or Parabolic Note is typically nearly diagonal as in NAS case: Relation between Discretized time in Hyperbolic/Parabolic Equations and Iteration Index for Solution of Steady State Equs We already showed in CPS615 how one could derive Jacobi, Gauss-Seidel and SOR Iterative schemes from Artificial Diffusion Equations. Other important steady-state methods such as Conjugate Gradient are not elegantly looked at this fashion but basic idea still relevant. One derives Iterative methods by splitting matrix K of steady state KU=b, into two K = M - N , so that one finds M U = N U + b or the iteration MU(n+1)= NU(n) + b which M (U(n+1)-U(n)) = -KU(n) + b The matrix M corresponds to Mim(ex)plicit in typical time dependent hyperbolic numerical solver Note M is in simple cases (Jacobi) diagonal and is typically diagonally dominant like hyperbolic solver Richardson's Method and General Preconditioning Formalism in Artificial Time Framework We can write As the time discretization of: Putting Dt = -1 one gets Richardson's method where wfaketime is an acceleration or over relaxation parameter which upto a factor the same as over relaxation parameter w in SOR. Analysis of Richardson's Method for Laplace or Poisson Equation This is identical upto scale factor with that of relaxation for Jacobi method and one finds the plot of eigenvalues of iteration matrix G= 1 + wfaketimeK as below The optimal choice of wfaketime is 0.25 for a normalization where diagonal terms of K for Laplace equation is -4. This is identical to w=1 being optimal for relaxation analysis of Jacobi Over Relaxation. Preconditioning in the Artificial Time Approach to Iteration Methods We can generalize the simple equations To the form with arbitary matrix P As the left hand side vanishs in equilibrium This can be discretized in time to: With P playing role of in NAS Benchmark Note only the product wDt enters i.e. time step Dt plays same role as relaxation parameter w. Relation of Artificial Time and Matrix Preconditioning The matrix form of Preconditioning was to replace By the form: Which has obvious Similarities with the artificial time form: Exploiting analogy between and P, we can relate particular approachs to solving the implicit equations of CFD to particular preconditioning schemes for the steady state equations Alternating Direction Iteration -- ADI In Dongarra's template book and Hirsch, ADI (Alternating Direction Iteration) is referred to as a preconditioning scheme In our treatment of NAS benchmarks, it is derived intuitively from solution of implicit equations Our derivation is probably clearer as it naturally gives "correct" relaxation parameter wDt ADI exploits the special form of which is with one minus a sum of "small" terms -- corresponding to the separate discretizations in x h and z directions. ADI for NAS Benchmarks -- The first BT or Block Tridiagonal Benchmark This is approximate factorization (Beam-Warming) algorithm with unnumbered equations at start of section 3.4.7 of NAS Benchmark Taking general form With Equation to be Solved: We write this in factorized fashion: Explicit Equations for ADI with NAS Benchmarks This factorized Operator can now be solved in three steps: Each step involves solution of a 5 by 5 block tridiagonal matrix The matrix is tridiagonal -- not pentadiagonal -- as numerical dissipation with Ñ4U is only added to explicit RHS term in this approximation Computational Complexity of Solution of ADI Equations The original nonfactorized equations were sparse matrices with bandwidth L=(2N 2NAS+1) Solving equations with M rows and columns and bandwidth L takes ML2 operations This gives original complexity for M=N 3NAS proportional to N 7NAS A tridiagonal matrix has bandwidth 3 and so complexity of ADI factorized equations is just proportional to M or N 3NAS Effect of 5 by 5 blocks: This implies that every operation is that of 5 by 5 matrix vector multiplication or matrix/vector addition i.e. one increases estimate by roughly 53 or 125. Solution of Tridiagonal Equations Here denotes a nonzero element of the tridiagonal matrix Parallelization of ADI Each tridiagonal solve is not usefully parallelized However this is not important as one has N 2NAS Independent tridiagonal solves for each of three sweeps For instance for equation with components labelled by (i,j,k) for three x h and z directions, the equations for each distinct value of (j,k) are independent. Communication Cost for Parallel ADI in NAS Benchmarks We decompose the three dimensional grid over the Nproc nodes of a parallel machines There are the usual choices for this depending on the shape of regions stored in each processor Example 1: Distribute over j(h) with all values of i and k (x and z) for each j stored in a node Then one has communication at sweep 2 but none at sweep 1 or 3. Example 2: Distribute as cubic blocks with all three indices i j and k distributed. This type of distribution is normally optimal for nearest neighbor problems as it minimizes surface over volume effects. This is not quite clear for ADI, as each sweep only in one dimension Examples of Possible Distributions in Two Dimension for Parallel ADI Distribute over j(h) with all values of i(x) in each node No Communication at sweep 1(loop over i for fixed j) Sweep 2 has communication for each j value in loop for fixed i. Examples of Possible Distributions in Two Dimension for Parallel ADI Distribute both i and j ( x and h) in square subdomains Communication at each sweep but only at every fourth index value Need to Pipeline as at start only edge processors active and need to work from both ends towards the middle e.g. four computation steps before middle processors can get started This contrasts with sixteen computation steps in each processor Basis of Estimates for Communication Costs for ADI Communication is needed when one sweeps in a given index, say i, and that index location changes Note Algorithms for sequential tridiagonal solvers are given in the appendix of volume 1 of Hirsch For a scalar tridiagonal matrix with N rows, the solution takes time 5N For a block tridiagonal matrix, one increases this by a factor between 25 and 125 for 5 by 5 blocks. Note communication proportional to "5" i.e. size of vector being calculated as one changes block size However calculation proportional to 52 to 53 as one changes block size i.e. it increases faster with block size than communication This illustrates that extra CFD components (1 goes to 5) increases efficiency of parallel CFD codes compared to comparable scalar equations Examples of Different Distributions for Parallel ADI Solvers Consider a 64 by 64 by 64 grid with 64 processors Example 1: One Dimensional Decomposition -- Each processor holds 1 by 64 by 64 grid points Communication at one sweep only of size 1 x 642(number of equations per node) x 64 (Number of times processor boundaries crossed) Example 2: Two Dimensional Decomposition -- Each Processor holds 8 by 8 by 64 grid points Communication at two sweeps of total size 2 (Number of Sweeps) x 642 (Number of Equations) x 8 (Number of Processor Boundaries Crossed) Example 3: Three Dimensional Decomposition -- Each Processor holds 16 by 16 by 16 grid points Communication at every sweep of total size 3 (Number of Sweeps) x 642 (Number of Equations) x 4 (Number of Processor boundaries Crossed Comparisons of Different Distributions for Parallel ADI Solvers The Naive Ratio of Communication in three distributions is: 64 : 16 : 12 for One : Two : Three Dimensional Decompositions Three dimensions also has lower pipelining start-up effects However -- certainly for one dimensional case -- it is better to redistribute data before sweep that crosses processor boundaries Total Algorithm Communication cost 643 Units For a redistribution that reduces algorithm communication cost to zero, one has redistribution cost that is proportional to 642 (Number of elements per processor) times a number which depends on communication network -- this number is loge64=6 for a hypercube