CPS713 Case Study II) CFD and Numerical Relativity
	Module  on NAS Benchmarks --- Part II
	Spatial Differencing and ADI
		Geoffrey Fox
		NPAC
		111 College Place
		Syracuse NY 13244-4100
	Abstract of CPS713NAS-II: ADI and Spatial Differencing Module
		This is second of three foilsets on CFD and NAS Benchmarks
		We describe spatial differencing including numerical dissipation for stability
		General analysis of resultant sparse matrix structure and iterative solvers
		Role of preconditioning and artificial time formulations
		focus on NAS ADI solution and its parallel solution with various distributions and their consequent communication cost
	Spatial Differencing in NAS Benchmarks
		Sections 3.4.2,3.4.3,3.4.4 cover basic spatial discretization
		We are using uniform grid in x,h,z directions with respective step sizes hx,hh,hz.  
		These cover unit cube in x,h,z space with total number of points which is for benchmark problems the same NNAS in each spatial direction
			NNAS = 64 for class A and 102 for class B
		We can use finite difference methods because of uniform grid and these involve such approximations as:
		
		
		
		

			Of course each U should have a time superscript n which is left off for clarity
	Numerical Dissipation in NAS Benchmarks
		These discretizations need to be extended for general case which is not U itself but rather a function of U and its partial derivatives
		All the derivatives need to be modified near boundaries where some of needed points are missing
		NAS also adds a new numerical dissipation term to improve stability of numerical solution. Previously highest order derivative came from viscous terms like Ņ2U.
		New term is proportional to Ņ4U
		
		

		Note the negative sign which gives dissipation i.e. stability
		Equations (3.25) (3.26) and (3.27) summarize the equations after spatial and temporal discreization
	Matrix Formalism for NAS Benchmarks
		We can write the right hand side of equation (3.26) as the matrix vector product

		The vector Uvec has Nvec = 5 Nx Nh Nz components and RHS is a Nvec by Nvec matrix. The forcing term Hvec is also a vector with Nvec components
			The components of vectors and rows and columns of matrices are labelled by i,j,k and CFD index index running from 1 to 5
		Formally we can write full equations:
		

		If we had used the explicit Euler method, then equations would have read the simple:

			which only needs sparse matrix vector multiplication as its computational kernel.
	Structure of Sparse Matrices in NAS Benchmarks
		Nonzero structure of typical row of RHS is:
		
		with 5 nonzeros about diagonal and two bands -- each with two nonzeros -- each side of diagonal
		Each   corresponds to a 5 by 5 (number of CFD labels) block matrix which is approximately "full"
		So each of 5 NNAS3  rows of RHS has 65 nonzero elements

		This can be contrasted with simple differencing to solve three dimensional Laplace equation Ņ2Y=0 where nonzero structure of typical row is:

		with 7 nonzeros in each row of row of underlying matrix
			Extra complexity of CFD makes parallel algorithms have better performance as calculation/communication ratio increased
	Three Types of Iteration needed in Solution of NAS Benchmarks
		As with all nonlinear coupled differential equations, one has three distinct iterative processes in solution of NAS equations.
			Similar structure is seen for set of coupled Ordinary differential equations describing electrical transmission networks.
		1)First Iteration is  due to nonlinearity of equations: one linearizes the nonlinear CFD equations as in equation (3.17) by Taylor expansion
		to make equations linear in 
		the unknown Un+1. 
		2)Second Iteration is that needed for fixed n, to solve the implicit equations for DUn
		governed by matrix
		3)Third Iteration is that in index n needed to evolve in time t. In our case where time is somewhat artificial as only need steady state solution, one can view n as labelling iterative solution to steady state equations of CFD. 
	Comparison of Structure of Elliptic(steady state) and Hyperbolic Equations
		Both Types of equations can be written in matrix form
		Elliptic or Steady State Hyperbolic-Parabolic as in steady state CFD
		

			Note                                is typically not diagonally dominant
			
			as off diagonal terms are of same size as diagonal e,g. in two dimensions Lapace Iteration matrix is:
			with sum of elements in
			general row zero

		Time dependent Hyperbolic or Parabolic
		
		

			Note                     is typically nearly diagonal as in NAS case: 
	Relation between Discretized time in Hyperbolic/Parabolic Equations and Iteration Index for Solution of Steady State Equs
		We already showed in CPS615 how one could derive Jacobi, Gauss-Seidel and SOR Iterative schemes from Artificial Diffusion Equations.
			Other important steady-state methods such as Conjugate Gradient are not elegantly looked at this fashion but basic idea still relevant.
		One derives Iterative methods by splitting matrix K of steady state KU=b, into two
			K = M - N , so that one finds M U = N U + b
			or the iteration MU(n+1)= NU(n) + b
			which M (U(n+1)-U(n)) = -KU(n) + b
		The matrix M corresponds to Mim(ex)plicit in typical time dependent hyperbolic numerical solver
		Note M is in simple cases (Jacobi) diagonal and is typically diagonally dominant like hyperbolic solver
	Richardson's Method and General Preconditioning Formalism in Artificial Time Framework
		We can write
		 
		As the time discretization of:
		
		

		Putting Dt = -1 one gets Richardson's method where wfaketime is an acceleration or over relaxation parameter which upto a factor the same as over relaxation parameter w in SOR. 
	 Analysis of Richardson's Method for Laplace or Poisson Equation 
		This is identical upto scale factor with that of relaxation for Jacobi method and one finds the plot of eigenvalues of iteration matrix
		G= 1 + wfaketimeK as below
		
		
		
		
		
		

		The optimal choice of wfaketime  is 0.25 for a normalization where diagonal terms of K for Laplace equation is -4. This is identical to w=1 being optimal for relaxation analysis of Jacobi Over Relaxation.
	Preconditioning in the Artificial Time Approach to Iteration Methods
		We can generalize the simple equations

		To the form
		

		with arbitary matrix P
		As the left hand side vanishs in equilibrium
		This can be discretized in time to:
		

		With P playing role of                  in NAS Benchmark
		Note only the product wDt enters i.e. time step Dt plays same role as relaxation parameter w.
	Relation of Artificial Time and Matrix Preconditioning
		The matrix form of Preconditioning was to replace
		

		By the form:

		Which has obvious Similarities with the artificial time form:
		

		Exploiting analogy between             and P, we can relate particular approachs to solving the implicit equations of CFD to particular preconditioning schemes for the steady state equations
	Alternating Direction Iteration -- ADI
		In Dongarra's template book and Hirsch, ADI (Alternating Direction Iteration) is referred to as a preconditioning scheme
		In our treatment of NAS benchmarks, it is derived intuitively from solution of implicit equations
			Our derivation is probably clearer as it naturally gives "correct" relaxation parameter wDt
		ADI exploits the special form of                 which is
		
		
		
		with one minus a sum of "small" terms -- corresponding to the separate discretizations in x h and z directions. 
	ADI for NAS Benchmarks -- The first BT or Block Tridiagonal Benchmark
		This is approximate factorization (Beam-Warming) algorithm with unnumbered equations at start of section 3.4.7 of NAS Benchmark
			Taking general
			 form
			With Equation 
			to be Solved:
			We write this in factorized fashion:
	Explicit Equations for ADI with NAS Benchmarks
		This factorized Operator can now be solved in three steps:
		
		
		
		
		
		

		Each step involves solution of a 5 by 5 block tridiagonal matrix
		The matrix is tridiagonal -- not pentadiagonal -- as numerical dissipation with Ņ4U is only added to explicit RHS term in this approximation
	Computational Complexity of Solution of ADI Equations
		The original nonfactorized equations were sparse matrices with bandwidth L=(2N 2NAS+1)
		Solving equations with M rows and columns and bandwidth L takes ML2 operations 
		This gives original complexity for M=N 3NAS proportional to N 7NAS
		A tridiagonal matrix has bandwidth 3 and so complexity of ADI factorized equations is just proportional to M or N 3NAS

		Effect of 5 by 5 blocks:
			This implies that every operation is that of 5 by 5 matrix vector multiplication or matrix/vector addition i.e. one increases estimate by roughly 53 or 125. 
	Solution of Tridiagonal Equations
		Here          denotes a nonzero element of the tridiagonal matrix
	Parallelization of ADI
		Each tridiagonal solve is not usefully parallelized 
		However this is not important as one has N 2NAS Independent tridiagonal solves for each of three sweeps
		For instance for equation
		
		

		with components labelled by (i,j,k) for three x h and z directions, the equations for each distinct value of (j,k) are independent.
	 Communication Cost for Parallel ADI in NAS Benchmarks
		We decompose the three dimensional grid over the Nproc nodes of a parallel machines 
		There are the usual choices for this depending on the shape of regions stored in each processor

		Example 1: Distribute over j(h) with all values of i and k (x and z) for each j stored in a node
			Then one has communication at sweep 2 but none at sweep 1 or 3.
		Example 2: Distribute as cubic blocks with all three indices i j and k distributed.
		This type of distribution is normally optimal for nearest neighbor problems as it minimizes surface over volume effects. 
			This is not quite clear for ADI, as each sweep only in one dimension
	Examples of Possible Distributions in Two Dimension for Parallel ADI
	Distribute over j(h) with all values of i(x) in each node 
		
		No Communication at sweep 1(loop over i for fixed j)
		Sweep 2 has communication for each j value in loop for fixed i. 
	Examples of Possible Distributions in Two Dimension for Parallel ADI Distribute both i and j ( x and h) in square subdomains
		Communication at each sweep but only at every fourth index value
		Need to Pipeline as at start only edge processors active and need to work from both ends towards the middle
			e.g. four computation steps before middle processors can get started
			This contrasts with sixteen computation steps in each processor
	Basis of Estimates for Communication Costs for ADI
		Communication is needed when one sweeps in a given index, say i, and that index location changes
		Note Algorithms for sequential tridiagonal solvers are given in the appendix of volume 1 of Hirsch
		For a scalar tridiagonal matrix with N rows, the solution takes time 5N
		For a block tridiagonal matrix, one increases this by a factor between 25 and 125 for 5 by 5 blocks. 
		Note communication proportional to "5" i.e. size of vector being calculated as one changes block size
		However calculation proportional to 52 to 53 as one changes block size i.e. it increases faster with block size than communication
			This illustrates that extra CFD components (1 goes to 5) increases efficiency of parallel CFD codes compared to comparable scalar equations
	Examples of Different Distributions for Parallel ADI Solvers
		Consider a 64 by 64 by 64 grid with 64 processors 
		Example 1: One Dimensional Decomposition --
			Each processor holds 1 by 64 by 64 grid points 
			Communication at one sweep only of size
			1 x 642(number of equations per node)
			 x 64 (Number of times processor boundaries crossed)
		Example 2: Two Dimensional Decomposition --
			Each Processor holds 8 by 8 by 64 grid points
			Communication at two sweeps of total size
			2 (Number of Sweeps) x 642 (Number of Equations)
			x 8 (Number of Processor Boundaries Crossed)
		Example 3: Three Dimensional Decomposition --
			Each Processor holds 16 by 16 by 16 grid points
			Communication at every sweep of total size
			3 (Number of Sweeps) x 642 (Number of Equations)
			x 4 (Number of Processor boundaries Crossed
	Comparisons of Different Distributions for Parallel ADI Solvers
		The Naive Ratio of Communication in three distributions is:
		64 : 16 : 12 for One : Two : Three Dimensional Decompositions
		Three dimensions also has lower pipelining start-up effects
		However -- certainly for one dimensional case -- it is better to redistribute data before sweep that crosses processor boundaries
			Total Algorithm Communication cost 643 Units
			For a redistribution that reduces algorithm communication cost to zero, one has redistribution cost that is proportional to 642 (Number of elements per processor) times a number which depends on communication network -- this number is loge64=6 for a hypercube