Full HTML for Basic Parallel Full Matrix Algorithms

Full HTML for

Basic foilset Parallel Full Matrix Algorithms

Given by Geoffrey C. Fox, Nancy J. McCracken at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 6 December 96
Outside Index Summary of Material

This CPS615 module covers basic full matrix parallel algorithms with a discussion of matrix multiplication, LU decomposition with latter covered for banded as well as true full case

Matrix multiplication covers the approach given in "Solving Problems on Concurrent Processors" as well as Cannon's algorithm.

We review those applications -- especially Computational electromagnetics and Chemistry -- where full matrices are commonly used

Of course sparse matrices are far more important than full matrices!

Table of Contents for full HTML of Parallel Full Matrix Algorithms

Denote Foils where Image Critical

Denote Foils where HTML is sufficient

1

Full Matrices
CPS615 Computational Science for Simulation Applications
December 4, 1995
2

Abstract of Full Matrix CPS615 Module
3

Review of Matrices seen in PDE's
4

Examples of Full Matrices in Chemistry
5

Operations used with Hamiltonian operator
6

Examples of Full Matrices in Electromagnetics
7

Computational Electromagnetics Formalism I
8

Computational Electromagnetics Formalism II
9

Comments on Computational Electromagnetics
10

Summary of Use of Full Matrices in Chemistry
11

Notes on the use of full matrices
12

Full Matrix Multiplication
13

Sub-block definition of Matrix Multiply
14

Some References
15

The First Algorithm
(Broadcast, Multiply, and Roll)
16

The first stage -- index n=0 in sub-block sum -- of the algorithm on N=16 example
17

The second stage -- n=1 in sum over subblock indices -- of the algorithm on N=16 example
18

Second stage, continued
19

Look at the whole algorithm on one element
20

Cartesian Topology in MPI -- General
21

Cartesian Topology in MPI -- Matrix Multiplication
22

Cartesian Topology in MPI -- MPI_CART_SUB
23

Cartesian Topology in MPI -- MPI_CART_shift
24

Cartesian Topology in MPI -- MPI_SENDRECV_REPLACE
25

Cartesian Topology in MPI -- MPI_Cart_Create
26

Matrix Multiplication MPI Style Pseudocode
27

Matrix Multiplication Pseudocode, continued
28

Broadcast in the Full Matrix Case
29

Implementation of Naive and Log Broadcast
30

The Pipe Broadcast Operation
31

Schematic of Pipe Broadcast Operation
32

Performance Analysis of Matrix Multiplication
33

Cannon's Algorithm for Matrix Multiplication
34

Cannon's Algorithm
35

The Set-up Stage of the Algorithm
36

The first iteration of the algorithm
37

Performance Analysis of Cannon's Algorithm
38

Full LU Decomposition
39

Some References
40

Sequential LU Algorithm
41

Sequential LU Algorithm, continued
42

Sequential Pseudocode
43

Parallel Decomposition
44

Better Parallel Decomposition
45

Parallel Pseudocode
46

Performance Analysis of the Parallel LU Decomposition
47

Banded LU Decomposition
48

Some References
49

Banded Matrix Decomposition

Outside Index Summary of Material

HTML version of Basic Foils prepared 6 December 96

Foil 1 Full Matrices
CPS615 Computational Science for Simulation Applications
December 4, 1995

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Geoffrey Fox

Nancy McCracken

NPAC

Syracuse University

111 College Place

Syracuse NY 13244-4100

HTML version of Basic Foils prepared 6 December 96

Foil 2 Abstract of Full Matrix CPS615 Module

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This CPS615 module covers basic full matrix parallel algorithms with a discussion of matrix multiplication, LU decomposition with latter covered for banded as well as true full case

Matrix multiplication covers the approach given in "Solving Problems on Concurrent Processors" as well as Cannon's algorithm.

We review those applications -- especially Computational electromagnetics and Chemistry -- where full matrices are commonly used

Of course sparse matrices are far more important than full matrices!

HTML version of Basic Foils prepared 6 December 96

Foil 3 Review of Matrices seen in PDE's

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

We have studied partial differential equations, for example

and shown how they translate into matrix equations.

These matrices were "sparse". The operator only linked a few states ( values in , components of ).

"Full" matrices are those with "essentially all" elements nonzero. More precisely, it is not worth exploiting the zero's.

i.e. treat zero's as "ordinary" numbers with a*0=0 and a+0=a implemented using floating point unit and not by omitting computation!

HTML version of Basic Foils prepared 6 December 96

Foil 4 Examples of Full Matrices in Chemistry

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Full matrices come from "operators" which link several (~all) basis states used in expressing general state.

Examples from Chemistry:

Operator is Hamiltonian H with matrix elements,

States can be labelled by many quantities:

positions of electrons
number of electrons
orbits of electrons
vibrational modes
chemical compound
"channel" etc.

HTML version of Basic Foils prepared 6 December 96

Foil 5 Operations used with Hamiltonian operator

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Often you want to find eigenstates

where operator H is diagonal with respect to basis set . However, this is usually impossible.

Often one knows that

for example, if the compound is ,then is the "free" Hamiltonian for isolated molecules

and is the interaction, i.e. the forces between atoms.

Simple states diagonalize , but will be nonzero for "most" i and j.

HTML version of Basic Foils prepared 6 December 96

Foil 6 Examples of Full Matrices in Electromagnetics

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

According to survey by Edelman, the dominant use of "large" matrix inversion on Supercomputers is the Method of Moments for Computational Electromagnetics. This method of solving matrix equations was invented by Harrington at Syracuse University in about 1967.

Suppose we have

where the are suitable expansion functions for which can be calculated. Then

The easiest method would be to use eigenfunctions

HTML version of Basic Foils prepared 6 December 96

Foil 7 Computational Electromagnetics Formalism I

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This is a very important method although you can't find eigenvectors often.

For example, in "scattering problems", which are usual in electromagnetics, eigenvalues are continuous. In the case,

for any , is an eigenfunction with eigenvalue:

So we look at problems where is not an eigenfunction.

where we need .

HTML version of Basic Foils prepared 6 December 96

Foil 8 Computational Electromagnetics Formalism II

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Choose "suitable" set of weight functions

becomes with

the vector a given by:
is matrix with matrix elements
is vector with coefficients

HTML version of Basic Foils prepared 6 December 96

Foil 9 Comments on Computational Electromagnetics

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Read the literature (for example, Computer Physics Communications, Nov. 1991) for choices of and . Clearly one will take as functions for which can be easily calculated.

Comments:

With N expansion functions - the computational work is
if We have N grid points
- with best methods, work is
- with worst methods, work is
However, wave equations have "oscillating" solutions. These could be very hard to represent numerically with finite differences and grid points but could be just fine if oscillation embodied as oscillating expansion functions.

HTML version of Basic Foils prepared 6 December 96

Foil 10 Summary of Use of Full Matrices in Chemistry

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Eigenvalues/vectors - to find "bound states", "equilibrium states" - for example, using MOPAC

Equation Solutions - for reasons similar to those just discussed

Multiplication to "change basis"

HTML version of Basic Foils prepared 6 December 96

Foil 11 Notes on the use of full matrices

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

From the previous discussion, we see that matrix multiplication is actually very rarely used in scientific computing for large N. (Yet it's a favorite computer science algorithm!)

Full matrix equation solvers are sometimes used but not very common. Of course equation solvers are incredibly important for sparse matrices because if the matrix is large

the physics of the problem will make it sparse
it will be insoluble unless it is sparse

In solving , formally , but this is NOT normally the best numerical method.

If is sparse, both and the better LU decomposition are NOT sparse.

HTML version of Basic Foils prepared 6 December 96

Foil 12 Full Matrix Multiplication

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Suppose we want to multiply the matrices A and B together to form matrix

C:

We will assume all matrices are square - the algorithm can be generalized to deal with rectangular matrices.

The input matrices, A and B, are decomposed into rectangular sub-blocks.

If we have N processors we have rows and columns of sub-blocks. This means that N should be a perfect square. (Of course this condition can be relaxed -- it is used for simplicity of discussion)

One sub-block is assigned to each processor in a grid corresponding to the sub-block structure.

The algorithm ensures that the output matrix C has the same decomposition as A and B.

If A B and C are M by M with M2 = N m2 and thus M = m

Then each processor stores an m by m subblock

Processors are labelled by a two vector (l,k) l,k = 1 ...

HTML version of Basic Foils prepared 6 December 96

Foil 13 Sub-block definition of Matrix Multiply

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Let be the sub-block at position (l,k), then the problem can be stated in block matrix form:

Example of sub-block structure of matrix A where N = 16 ( = 4):

HTML version of Basic Foils prepared 6 December 96

Foil 14 Some References

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988, p. 167, for the broadcast, multiply, roll algorithm.

Cannon, L., Ph.D. thesis, Montana State University, Bozeman, MN, 1969.

Demmel, Heath, and van der Vorst, "Parallel Numerical Linear Algebra", 1993, for a detailed survey of algorithms.

Agarwal, Gustavson, and Zubair, "A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication", IBM Journal of Research and Development, which discusses issues of overlapping computing and communication in block matrix multiply.

HTML version of Basic Foils prepared 6 December 96

Foil 15 The First Algorithm
(Broadcast, Multiply, and Roll)

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

See Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988

HTML version of Basic Foils prepared 6 December 96

Foil 16 The first stage -- index n=0 in sub-block sum -- of the algorithm on N=16 example

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Broadcast sub-blocks of A along rows to form matrix T

Multiply sub-blocks and add into C

HTML version of Basic Foils prepared 6 December 96

Foil 17 The second stage -- n=1 in sum over subblock indices -- of the algorithm on N=16 example

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Roll B up one row (picture

shows result after roll with

correct elements of B in place)

Broadcast the next sub-blocks

of A along the rows

HTML version of Basic Foils prepared 6 December 96

Foil 18 Second stage, continued

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Multiply sub-blocks and add into C

HTML version of Basic Foils prepared 6 December 96

Foil 19 Look at the whole algorithm on one element

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 6 December 96

Foil 20 Cartesian Topology in MPI -- General

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Under the new MPI cartesian topology, there are two functions which convert between the logical process coordinates, such as [i,j] in our 2D topology, to the (possibly new) 1D topolgy rank number required by the send and receive functions.

MPI_CART_COORDS(comm2d, myrank, 2, coords)

Takes a processor number myrank, in a 2 dim communicator called comm2d, and returns the 2 element vector cooresponding to the [i,j] position of this processor in the new system.

MPI_CART_RANK(comm2d, [i,j], myrank)

Takes a 2 element int array with logical process coordinates i and j and returns a 1D processor rank in the variable myrank.

HTML version of Basic Foils prepared 6 December 96

Foil 21 Cartesian Topology in MPI -- Matrix Multiplication

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Actually in our MatMul program, nobody needs to call MPI_CART_COORDS

since they each know their i,j position, but each processor should

call MPI_CART_RANK to get a (possibly new) myrank.

One can also define subgroups that form lower-dimensional cartesian subgrids. In our case, we want to define the concept of "rows", so that we can shift along rows.

HTML version of Basic Foils prepared 6 December 96

Foil 22 Cartesian Topology in MPI -- MPI_CART_SUB

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

MPI_CART_SUB(comm2d, [F,T], comm2drows)

Uses the 2 element boolean vector to determine if the ith dim occurs in the subgrid (T for true) or not (F for false). This example returns a handle called comm2drows which will consist of

sqrt(N) non-overlapping communicators, each of which consists of sqrt(N) column elements.

Then when we call MPI_Bcast with the communicator comm2drows, the broadcast occurs within the individual row communicators.

HTML version of Basic Foils prepared 6 December 96

Foil 23 Cartesian Topology in MPI -- MPI_CART_shift

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Now we can also use the CART functions to define processor neighbors in the 2D grid of processors.

MPI_CART_shift(comm2d, dir=0,disp=-1,sourceproc, top_neighbor)

For each processor, this takes a direction (given here as 0 for columns in C, but would be 1 for columns in Fortran) and a displacement, which is here -1 to go to next highest index.

This returns the two processor ranks for the source and destination of a communication function.

HTML version of Basic Foils prepared 6 December 96

Foil 24 Cartesian Topology in MPI -- MPI_SENDRECV_REPLACE

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

So here we can use a send and receive to achieve our upward roll:

MPI_SENDRECV_REPLACE(B, m*m, MPI_REAL,top_neighbor, 0, sourceproc, 0, comm2d, status)

In the case where N=16, we choose the processor (sub-block) at position (2,1 Fortran (1,0) in C), and follow the algorithm through all 4 steps.

The default topology of nodes in MPI is a one-dimensional ring, with processor numbers, as given by the MPI_rank function, ranging from 0 to n-1, corresponding to the physical processors.

For this problem, we can use the MPI Cartesian coordinate functions to define and use a different topology of virtual processors.

HTML version of Basic Foils prepared 6 December 96

Foil 25 Cartesian Topology in MPI -- MPI_Cart_Create

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

MPI_Cart_Create(MPI_COMM_WORLD, ndim=2, dims, [T,T], T, comm2d)

Where dims is a 2-element integer vector giving the number of elements in the 2 dimensions,

[T,T] is a 2-element vector giving True to wrap the topology in both dimensions, T specifies that it's o.k. to reorder the default rank (from MPI_rank) of the processors in order to achieve a good embedding.

This returns a handle to the communicator for the new topology in the variable comm2d.

HTML version of Basic Foils prepared 6 December 96

Foil 26 Matrix Multiplication MPI Style Pseudocode

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Assume that main program has set up the 2-dimensional grid structure

HTML version of Basic Foils prepared 6 December 96

Foil 27 Matrix Multiplication Pseudocode, continued

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Assume that each processor has array coords, which tell its position in the processor grid and a "rows" communicator group is created:

MPI_Cart_coords(comm2d, myrank, 2, coords)
MPI_Cart_sub(comm2d, [F,T], comm2drows)

HTML version of Basic Foils prepared 6 December 96

Foil 28 Broadcast in the Full Matrix Case

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Matrix Multiplication makes extensive use of broadcast operations as its communication primitives

We can use this application to discuss three approaches to broadcast

Naive
Logarithmic
Pipe

Which have different performance depending on message sizes and hardware architecture

HTML version of Basic Foils prepared 6 December 96

Foil 29 Implementation of Naive and Log Broadcast

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

In our performance analysis, we assume is the time to send one data item, disregarding important but complicating issues such as latency.

Naive broadcast: The simplest way to broadcast a block of data from a source processor to a group of processors is to repeat a send message. In our case, the data is of size and the number of processors is , so the total time needed is

Log broadcast: In many architectures such as the hypercube, it is more efficient to send the message via intermediate processors who copy the message and pass it along. This reduces the number of messages to the log of the number of processors.

Although we are not modelling differences in communication time here, it is also the case that on a hypercube network, the processors can be ordered so that neighboring processors in the binary tree are only one hop away on the network.

HTML version of Basic Foils prepared 6 December 96

Foil 30 The Pipe Broadcast Operation

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

In the case that the size of the message is large, other implementation optimizations are possible, since it will be necessary for the broadcast message to be broken into a sequence of smaller messages. The broadcast can set up a path (or paths) from the source processor that visits every processor in the group. The message is sent from the source along the path in a pipeline, where each processor receives a block of the message from its predecessor and sends it to its successor. This is a similar concept to wormhole routing, but where each intermediate processor also saves the message in addition to passing it along. (See diagram next page.)

The performance of this broadcast is then the time to send the message to the processor on the end of the path plus the overhead of starting and finishing the pipeline.

For sufficiently large grain size the pipe broadcast is better than the log broadcast, namely when

HTML version of Basic Foils prepared 6 December 96

Foil 31 Schematic of Pipe Broadcast Operation

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Example showing source processing sending message in 9 pieces to 7 other processors

HTML version of Basic Foils prepared 6 December 96

Foil 32 Performance Analysis of Matrix Multiplication

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Time of matrix multiply

Time for pipe broadcast of A
Time to roll B
Time to calculate C
Total time

Efficiency is given by

Overhead, where :

If n = m2 is the grain size, then
with typical two dimensional character with
overhead proportional to in d dimensions

HTML version of Basic Foils prepared 6 December 96

Foil 33 Cannon's Algorithm for Matrix Multiplication

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

This is a similar matrix multiplication algorithm to the first in that it assumes the block definition of matrix multiply

and the calculation of each is done in place (owner computes rule) by moving the pairs of blocks and to the processor in which it resides.

Cannon's algorithm differs in the order in which the pairs of blocks are multiplied. We first "skew" both the A and the B matrix so that we can "roll" both A and B - A to the left and B to the top - to circulate the blocks in row l and column k to calculate .

Reference:

Ho, Johnsson and Edelman

HTML version of Basic Foils prepared 6 December 96

Foil 34 Cannon's Algorithm

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

HTML version of Basic Foils prepared 6 December 96

Foil 35 The Set-up Stage of the Algorithm

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Matrices A and B after the initial skewing:

HTML version of Basic Foils prepared 6 December 96

Foil 36 The first iteration of the algorithm

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Multiply C = A B.

Roll A one to the left and B one upward:

HTML version of Basic Foils prepared 6 December 96

Foil 37 Performance Analysis of Cannon's Algorithm

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Set-up

Time to skew A - we assume that the row shifts occur independently and are optimized to take the shortest path. We ignore start-up and note that each subblock is moved /2 positions. The answer is as usual architecture dependent and could be much better than estimate with dependence.
Time to skew B is the same.

Time for each of iterations

Computing C = AB
Time to roll A
Time to roll B is the same.

Total time

HTML version of Basic Foils prepared 6 December 96

Foil 38 Full LU Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

We wish to decompose the matrix A into the product LU, where L is a lower triangular matrix with 1's on the main diagonal, and U is an upper triangular matrix.

We assume A is a full M by M matrix.

In general, pivoting is necessary to ensure numerical stability.

LU decomposition is often used in the solution of systems of linear equations, Ax = b. The equations can be written as two triangular systems,

Ly = b, and Ux = y

The first equation is solved for y by forward reduction, and the solution x is then obtained from the second equation by back substitution.

HTML version of Basic Foils prepared 6 December 96

Foil 39 Some References

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

The following papers deal with parallel algorithms for the LU decomposition of full matrices, and contain useful references to other work.

G. A. Geist and M. T. Heath, 1986, "Matrix Factorization on a Hypercube Multiprocessor," in Hypercube Multiprocessors 1986, published by SIAM Press, Philadelphia.
E. Chu and A. George, 1987, "Gaussian Elimination with Partial Pivoting and Load Balancing on a Multiprocessor," Parallel Computing, 5:65.
See also: Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988,

HTML version of Basic Foils prepared 6 December 96

Foil 40 Sequential LU Algorithm

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

The algorithm proceeds in M steps.

At the start of step k, we identify the row, r, containing the largest value of |Ai,k| for k � i < M. If r � k then rows r and k are exchanged. This is called partial pivoting, and is done to improve the numerical stability. After the exchange the element that is now Ak,k is called the pivot.

HTML version of Basic Foils prepared 6 December 96

Foil 41 Sequential LU Algorithm, continued

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

At each step k, column number k of L and row number k of U are calculated:

Then the rows and columns > k are modified as follows:

After step k, the first k rows and columns of A are not used again. We can therefore overwrite A with the columns of L and the rows of U as we find them. The diagonal of L does not have to be explicitly stored since it is all 1's.

HTML version of Basic Foils prepared 6 December 96

Foil 42 Sequential Pseudocode

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Using matrix A (0:M-1, 0:M-1)

HTML version of Basic Foils prepared 6 December 96

Foil 43 Parallel Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

We must choose a decomposition for the matrix A which is load balanced throughout the algorithm, and which minimizes communication.

Contiguous blocks or rows or columns?
This won't work since it is not load balanced. Once processing of a block of rows or columns is completed, the corresponding processor will have nothing to do. For example in a decomposition of rows by block, when the computational window has reached the point as shaded in the diagram, processor 0 is idle for the rest of the algorithm.

HTML version of Basic Foils prepared 6 December 96

Foil 44 Better Parallel Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Scattered (or cyclic) row decomposition?

Each processor gets a set of non-contiguous rows as shown. Work is aproximately load balanced as the computational window moves down the diagonal.

HTML version of Basic Foils prepared 6 December 96

Foil 45 Parallel Pseudocode

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

The scattered load balancing works because there is no locality in the communication. We examine the details of the communication:

HTML version of Basic Foils prepared 6 December 96

Foil 46 Performance Analysis of the Parallel LU Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Set block size m = M/ initially and it will change and become 1 at end as rows and columns are eliminated

Then time for one iteration of algorithm

Time to select pivot row which is typical
logarithmic time reduction algorithm
Time to send rows r and k (which may be possible
to be overlapped in same communication time)
Time to broadcast pivot row
Time to divide column k
Time to broadcast pivot column
Time to transform rows
Total time for one iteration

HTML version of Basic Foils prepared 6 December 96

Foil 47 Banded LU Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Suppose that the matrix A is banded with bandwidth b and half-width w, where b=2w-1.

The sequential algorithm is similar to the full matrix case, except at each stage only those elements within a computational "window" of w rows and w columns are updated.

Partial pivoting can cause the number of columns in the computational window to be greater than m. This necessitates some extra bookkeeping in both the sequential and parallel algorithms.

The parallel banded and full algorithms are similar, but use a different decomposition. To get better load balancing, a scattered decomposition over both rows and columns is used in the banded algorithm. In the full case, a scattered decomposition just over rows was used.

HTML version of Basic Foils prepared 6 December 96

Foil 48 Some References

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

G. C. Fox, 1984, "LU Decomposition for Banded Matrices", Caltech Report C3P-99.

S. Lennart Johnsson, 1985, "Solving Narrow Banded Systems on Ensemble Architectures", ACM Trans. on Math. Software, 11:271.

Y. Saad and M. H. Schultz, 1985, "Parallel Direct Methods for Solving Banded Linear Systems", Yale Rsearch Report YALEU/DCS/RR-387.

J. J. Dongarra and S. Lennart Johnsson, 1987, "Solving Banded Systems on a Parallel Processor", Parallel Computing, 5:219.

D. W. Walker, T. Aldcroft, A. Cisneros, G. C. Fox and W. Furmanski, 1988, "LU Decomposition of Banded Matrices and the Solution of Linear Systems on Hypercubes", in Proceedings of the Third Conference on Hypercube Concurrent Processors and Applications, published by ACM Press, New York.

HTML version of Basic Foils prepared 6 December 96

Foil 49 Banded Matrix Decomposition

From Full Matrices - December 4, 1995 CPS615 Basic Simulation Track for Computational Science -- Fall Semester 95. *

Full HTML Index

Scattered decomposition of a 20 by 20 matrix with bandwidth b=11 for a 16 processor hypercube. A similar decomposition can be used for mesh-connected topologies.

© on Tue Oct 7 1997