U(i,j) = ( U(i-1,j) + U(i+1,j) + U(i,j-1) + U(i,j+1) + b(i,j) )/4We let U(i,j,m) be our approximation for U(i,j) after the m-th iteration. We can use any initial guess U(i,j,0); U(i,j,0)=0 will do. To get a formula for U(i,j,m+1) in terms of the previous iteration U(i,j,m), we simply use
U(i,j,m+1) = ( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m) + b(i,j) )/4In other words, U(i,j,m+1) is just a weighted average of its four neighboring values and b(i,j).
We analyze the convergence rate of Jacobi as follows. We present only the final results; for details, see any good text on numerical linear algebra. To state the result we define the error as the root-sum-of-squares of the error at each grid point:
error(m) = sqrt( sum_{ij} ( U(i,j,m) - U(i,j) )^2 )It turns out that Jacobi decreases the error by a constant factor rho_Jacobi(n) < 1, depending on dimension n, at each step:
error(m) <= rho_Jacobi(n) * error(m-1) <= ( rho_Jacobi(n) )^m * error(0)where
rho_Jacobi(n) = cos(pi/(n+1))When n is large, the interesting case, we can approximate rho_Jacobi(n) by
rho_Jacobi(n) ~ 1 - .5 * (pi/(n+1))^2which gets closer to 1 as n grows. So to decrease the error by a factor of 2, say, requires taking m iterations where
.5 > ( rho_Jacobi(n) )^m ~ ( 1 - .5 * (pi/(n+1))^2 )^m ~ 1 - .5 * (pi/(n+1))^2 * m ... when n is largeor
m ~ ((n+1)/pi)^2In other words, convergence gets slower for large problems, with the number of steps proportional to N = n^2. This make the serial complexity
Serial complexity of Jacobi = number_of_steps * cost_per_step = O(N) * O(N) = O(N^2)This is easy to implement in parallel, because each grid cell value may be updated independently. On a PRAM (parallel random access memory), this means the cost per step is O(1), lowering the PRAM complexity to O(N). On a realistic parallel machine, one would simply assign grid point (i,j) to processors in a block fashion (see class notes), and have each processor update the values of U(i,j,m+1) at the grid point it owns. This requires U values on the boundary of the blocks to be communicated to neighboring processors. When n >> p, so that each processor owns a large number n^2/p of grid points, the amount of data communicated, n/p words with each neighbor, will be relatively small.
To express the parallel running time in detail, we let
p = number of processors f = time per flop alpha = startup for a message beta = time per word in a messageThen
Time(Jacobi) = number_of_steps * cost_per_step = O(N) * ( (N/p)*f + alpha + (n/p)*beta ) = O(N^2/p)*f + O(N)*alpha + O(N^(3/2) /p)*betawhere O(N/p) flops are needed to update local values, alpha is the start up cost of messages to each neighbor, and O(n/p) boundary values are communicated to neighbors.
NOTE: The phenomenon of convergence slowing down, and needing to take more steps for large problems, is a common phenomenon for many methods and problems. It is not true, for example, for Multigrid method, however, which takes a constant number of iterations to decrease the error by a constant factor, and so makes it very attractive.
The first improvement is motivated by examining the serial implementation, and noticing that while evaluating
U(i,j,m+1) = ( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m) + b(i,j) )/4new values for the solution at some of the grid points on the right hand side are already available, and may be used. This leads to the algorithm:
U(i,j,m+1) = (U(i-1,j,latest) + U(i+1,j,latest) + U(i,j-1,latest) + U(i,j+1,latest) + b(i,j) )/4where "latest" is either m or m+1, depending on whether that particular grid point has been updated already. This depends on the order in which we loop through the (i,j) pairs. For example, if we use the same rowwise ordering, we get the Gauss-Seidel algorithm:
for i=1 to n for j=1 to n U(i,j,m+1) = (U(i-1,j,m+1) + U(i+1,j,m) + U(i,j-1,m+1) + U(i,j+1,m) + b(i,j) )/4 end for end forThis particular update order cannot be straightforwardly parallelized, because there is a recurrence, with each U(i,j,m+1) depending on the immediately previously updated value U(i-1,j,m+1).
So instead we loop through the grid points in the Red-Black order (named after the resemblance to a chess board) shown below.
The idea is to update all the black points, followed by all the red points. The black grid points are connected only to red grid points, and so can all be updated simultaneously using the most recent red values. Then the red values can be updated using the most recent black values. This yields the following algorithm:
for all black (i,j) grid points U(i,j,m+1) = (U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m) + b(i,j))/4 end for for all red (i,j) grid points U(i,j,m+1) = (U(i-1,j,m+1) + U(i+1,j,m+1) + U(i,j-1,m+1) + U(i,j+1,m+1) + b(i,j))/4 end forThis can be implemented in two parallel steps, one for updating the red points, and one for the black.
By itself, this idea does not accelerate convergence. It turns out to converge twice as fast (rho_Gauss_Seidel(n) = rho_Jacobi(n)^2), but on a PRAM each parallel Gauss-Seidel step takes the same time as two parallel Jacobi steps, so there is no overall benefit.
The second improvement is called Successive Overrelaxation, and depends on the Red-Black ordering for its speedup. Let us rewrite the algorithm so far as
U(i,j,m+1) = some_expression ... an abbreviation for the ... update scheme above = U(i,j,m) + (some_expression - U(i,j,m)) = U(i,j,m) + correction(i,j,m)The idea behind overrelaxation is that if correction(i,j,m) is a good direction in which to move U(i,j,m) to make it a better solution, one should move even farther in that direction, by a factor w > 1:
U(i,j,m+1) = U(i,j,m) + w * correction(i,j,m)w (omega) is called the overrelaxation factor (w < 1 would correspond to underrelaxation, which is not worth doing). The word successive refers to using the latest data to evaluate the correction.
Altogether, the SOR(w) algorithm is as follows:
for all black (i,j) grid points U(i,j,m+1) = U(i,j,m) + w * ( U(i-1,j,m) + U(i+1,j,m) + U(i,j-1,m) + U(i,j+1,m) + b(i,j) - 4 * U(i,j,m) )/4 end for for all red (i,j) grid points U(i,j,m+1) = U(i,j,m) + w * ( U(i-1,j,m+1) + U(i+1,j,m+1) + U(i,j-1,m+1) + U(i,j+1,m+1) + b(i,j) - 4 * U(i,j,m) )/4 end forAs before, we measure the error of iteration m by
error(m) = sqrt( sum_{ij} ( U(i,j,m) - U(i,j) )^2 )As with Jacobi, SOR(w) decreases the error by a constant factor rho_SOR(n,w) < 1 at each step:
error(m) <= rho_SOR(n,w) * error(m-1) <= ( rho_SOR(n,w) )^m * error(0)We want to pick w to minimize rho_SOR(n,w). There turns out to be a simple expression for the minimizing w,
w_opt = 2 / ( 1 + sin(pi/(n+1)) )which is between 1 and 2. The corresponding rho_SOR(n,w_opt) is
rho_SOR(n,w_opt) = ( cos(pi/(n+1)) / ( 1 + sin(pi/(n+1)) ) )^2As with Jacobi, rho_SOR(n,w_opt) approaches 1 as n grows, so that convergence slows down, but it slows down much less than for Jacobi:
rho_SOR(n,w_opt) ~ 1 - 2 * pi/(n+1)Therefore, reducing the error by a factor of 2, say, requires taking m iterations where
.5 > ( rho_SOR(n,w_opt) )^m ~ ( 1 - 2 * pi/(n+1) )^m ~ 1 - 2 * pi/(n+1) * m ... when n is largeor
m ~ (n+1)/(2 * pi)This is approximately the square root of the number of steps Jacobi requires.
A complexity analysis similar to that for Jacobi shows that
Time(SOR) = number_of_steps * cost_per_step = O(sqrt(N)) * ( (N/p)*f + alpha + (n/p)*beta ) = O(N^(3/2) /p)*f + O(sqrt(N))*alpha + O(N /p) * betawhere O(N/p) flops are needed to update local values, alpha is the start up cost of messages to each neighbor, and O(n/p) boundary values are communicated to neighbors.
x = initial guess for inv(A)b r = b - A * x ... residual, to be made small p = r ... initial "search direction" repeat v = A * p ... matrix-vector multiply a = ( r' * r ) / ( p' * v ) ... dot product (in BLAS1) x = x + a * p ... compute updated solution ... using saxpy (in BLAS1) new_r = new_r - a * v ... compute updated residual ... using saxpy (in BLAS1) g = ( new_r' * new_r ) / ( r' * r ) ... dot product (in BLAS1) p = new_r + g * p ... compute updated search direction ... using saxpy (in BLAS1) r = new_r until ( (new_r' * new_r) small enough )Here is a rough motivation for this algorithm. CG maintains 3 vectors at each step, the approximate solution x, its residual r=(A * x) - b, and a search direction p, which is also called a conjugate gradient. At each step, x is improved by searching for a better solution in the direction p, yielding an improved solution x + (a * p). This direction p is called a gradient because we are in fact doing gradient descent on a certain measure of the error (namely sqrt(r' * inv(A) * r)). The directions pi and pj from steps i and j of the algorithm are called conjugate, or more precisely A-conjugate, because they satisfy (pi' * A * pj) = 0 if i not equal to j. One can also show that after i iterations xi is the "optimal" solution among all possible linear combinations of the form
a0 * x + a1 * (A * x) + a2 * (A^2 * x) + a3 * (A^3 * x) + ... + ai * (A^i * x)
For most matrices, the majority of work is in the sparse matrix-vector multiplication v = (A * p) in the first step. For Poisson's equation, where we can think of p and v living on a square grid, this means computing
v(i,j) = 4 * p(i,j) - p(i-1,j) - p(i+1,j) - p(i,j-1) - p(i,j+1)which is nearly identical to the inner loop of Jacobi or SOR in the way it is parallelized. We will deal with more general techniques for sparse-matrix-vector multiplication in a later lecture.
The other operations in CG are easy to parallelize. The saxpy operations are embarrassingly parallel, and require no communication. The dot-products require local dot-products and then a global add using, say, a tree to form the sum in log(p) steps.
The rate of convergence of CG depends on a number (kappa) called the condition number of A. It is defined at the ratio of the largest to the smallest eigenvalue of A (and so is always at least 1). A roughly equivalent quantity is norm(inv(A)) * norm(A), where the norm of a matrix is the magnitude of the largest entry. The larger the condition number, the slower the convergence. One can show that the number of CG iterations required to reduce the error by a constant g < 1 is proportional to the square root of kappa. For Poissons's equation, kappa is O(N), so n = sqrt(N) iterations are needed. This means SOR and CG take about the same number of steps. An advantage of CG over SOR is that CG is applicable to a much larger class of problems.
There are methods similar to CG that are suitable for matrices A that are not symmetric or positive definite. Like CG, most of their work involves computing the matrix vector product A * p (or possibly A' * p), as well as dot-products and saxpy's ( BLAS1).
For more information on CG, check the class notes as well as the excellent review "An Introduction to Conjugate Gradient Method Without the Agonizing Pain", by J. Shewchuk in here in the reports section.
Jacobi, SOR (Successive OverRelaxation) and CG (Conjugate Gradients) can be thought of as performing most nearest neighbor operations on the grid (CG requires summing across the grid too). In other words, information at one grid point can only propagate to an adjacent grid point in 1 iteration. Since the solution U(i,j) at each grid point depends on the value of b(l,m) at every other grid point, it take n steps for information to propagate all the way across the grid. This would lead us not to expect a good solution in many fewer than n = sqrt(N) steps, which is what these methods require.
BLAS1 is level 1 [B]asic [L]inear [A]lgebra [S]ubprograms perform basic vector-vector operations. Only the single-precision real and complex data types are supported. The following three types of vector-vector operations are available:
Examples: the dot product of two vectors (SDOT), the addition of a multiple of one vector to another (SAXPY), and the index of the element with largest absolute value (ISAMAX).
NIST GAMS has more info on BLAS1 here which will take you to Netlib.
Also the the Computational Science Education Project has a good overview of Iterative Methods for Linear Systems.