Given by Geoffrey C. Fox, Nancy J. McCracken at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 6 December 96
Outside Index
Summary of Material
This CPS615 module covers basic full matrix parallel algorithms with a discussion of matrix multiplication, LU decomposition with latter covered for banded as well as true full case |
Matrix multiplication covers the approach given in "Solving Problems on Concurrent Processors" as well as Cannon's algorithm. |
We review those applications -- especially Computational electromagnetics and Chemistry -- where full matrices are commonly used |
Of course sparse matrices are far more important than full matrices! |
Outside Index Summary of Material
Geoffrey Fox |
Nancy McCracken |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This CPS615 module covers basic full matrix parallel algorithms with a discussion of matrix multiplication, LU decomposition with latter covered for banded as well as true full case |
Matrix multiplication covers the approach given in "Solving Problems on Concurrent Processors" as well as Cannon's algorithm. |
We review those applications -- especially Computational electromagnetics and Chemistry -- where full matrices are commonly used |
Of course sparse matrices are far more important than full matrices! |
We have studied partial differential equations, for example |
and shown how they translate into matrix equations. |
These matrices were "sparse". The operator only linked a few states ( values in , components of ). |
"Full" matrices are those with "essentially all" elements nonzero. More precisely, it is not worth exploiting the zero's.
|
Full matrices come from "operators" which link several (~all) basis states used in expressing general state. |
Examples from Chemistry: |
Operator is Hamiltonian H with matrix elements, |
States can be labelled by many quantities:
|
Often you want to find eigenstates |
where operator H is diagonal with respect to basis set . However, this is usually impossible. |
Often one knows that |
for example, if the compound is ,then is the "free" Hamiltonian for isolated molecules |
and is the interaction, i.e. the forces between atoms. |
Simple states diagonalize , but will be nonzero for "most" i and j. |
According to survey by Edelman, the dominant use of "large" matrix inversion on Supercomputers is the Method of Moments for Computational Electromagnetics. This method of solving matrix equations was invented by Harrington at Syracuse University in about 1967. |
Suppose we have |
where the are suitable expansion functions for which can be calculated. Then |
The easiest method would be to use eigenfunctions |
This is a very important method although you can't find eigenvectors often. |
For example, in "scattering problems", which are usual in electromagnetics, eigenvalues are continuous. In the case, |
for any , is an eigenfunction with eigenvalue: |
So we look at problems where is not an eigenfunction. |
where we need . |
Choose "suitable" set of weight functions |
becomes with
|
Read the literature (for example, Computer Physics Communications, Nov. 1991) for choices of and . Clearly one will take as functions for which can be easily calculated. |
Comments:
|
Eigenvalues/vectors - to find "bound states", "equilibrium states" - for example, using MOPAC |
Equation Solutions - for reasons similar to those just discussed |
Multiplication to "change basis" |
From the previous discussion, we see that matrix multiplication is actually very rarely used in scientific computing for large N. (Yet it's a favorite computer science algorithm!) |
Full matrix equation solvers are sometimes used but not very common. Of course equation solvers are incredibly important for sparse matrices because if the matrix is large
|
In solving , formally , but this is NOT normally the best numerical method. |
If is sparse, both and the better LU decomposition are NOT sparse. |
Suppose we want to multiply the matrices A and B together to form matrix |
C: |
We will assume all matrices are square - the algorithm can be generalized to deal with rectangular matrices. |
The input matrices, A and B, are decomposed into rectangular sub-blocks. |
If we have N processors we have rows and columns of sub-blocks. This means that N should be a perfect square. (Of course this condition can be relaxed -- it is used for simplicity of discussion) |
One sub-block is assigned to each processor in a grid corresponding to the sub-block structure. |
The algorithm ensures that the output matrix C has the same decomposition as A and B. |
If A B and C are M by M with M2 = N m2 and thus M = m |
Then each processor stores an m by m subblock |
Processors are labelled by a two vector (l,k) l,k = 1 ... |
Let be the sub-block at position (l,k), then the problem can be stated in block matrix form: |
Example of sub-block structure of matrix A where N = 16 ( = 4): |
Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988, p. 167, for the broadcast, multiply, roll algorithm. |
Cannon, L., Ph.D. thesis, Montana State University, Bozeman, MN, 1969. |
Demmel, Heath, and van der Vorst, "Parallel Numerical Linear Algebra", 1993, for a detailed survey of algorithms. |
Agarwal, Gustavson, and Zubair, "A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication", IBM Journal of Research and Development, which discusses issues of overlapping computing and communication in block matrix multiply. |
See Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988 |
Broadcast sub-blocks of A along rows to form matrix T |
Multiply sub-blocks and add into C |
Roll B up one row (picture |
shows result after roll with |
correct elements of B in place) |
Broadcast the next sub-blocks |
of A along the rows |
Multiply sub-blocks and add into C |
Under the new MPI cartesian topology, there are two functions which convert between the logical process coordinates, such as [i,j] in our 2D topology, to the (possibly new) 1D topolgy rank number required by the send and receive functions. |
MPI_CART_COORDS(comm2d, myrank, 2, coords) |
Takes a processor number myrank, in a 2 dim communicator called comm2d, and returns the 2 element vector cooresponding to the [i,j] position of this processor in the new system. |
MPI_CART_RANK(comm2d, [i,j], myrank) |
Takes a 2 element int array with logical process coordinates i and j and returns a 1D processor rank in the variable myrank. |
Actually in our MatMul program, nobody needs to call MPI_CART_COORDS |
since they each know their i,j position, but each processor should |
call MPI_CART_RANK to get a (possibly new) myrank. |
One can also define subgroups that form lower-dimensional cartesian subgrids. In our case, we want to define the concept of "rows", so that we can shift along rows. |
MPI_CART_SUB(comm2d, [F,T], comm2drows) |
Uses the 2 element boolean vector to determine if the ith dim occurs in the subgrid (T for true) or not (F for false). This example returns a handle called comm2drows which will consist of |
sqrt(N) non-overlapping communicators, each of which consists of sqrt(N) column elements. |
Then when we call MPI_Bcast with the communicator comm2drows, the broadcast occurs within the individual row communicators. |
Now we can also use the CART functions to define processor neighbors in the 2D grid of processors. |
MPI_CART_shift(comm2d, dir=0,disp=-1,sourceproc, top_neighbor) |
For each processor, this takes a direction (given here as 0 for columns in C, but would be 1 for columns in Fortran) and a displacement, which is here -1 to go to next highest index. |
This returns the two processor ranks for the source and destination of a communication function. |
So here we can use a send and receive to achieve our upward roll: |
MPI_SENDRECV_REPLACE(B, m*m, MPI_REAL,top_neighbor, 0, sourceproc, 0, comm2d, status) |
In the case where N=16, we choose the processor (sub-block) at position (2,1 Fortran (1,0) in C), and follow the algorithm through all 4 steps. |
The default topology of nodes in MPI is a one-dimensional ring, with processor numbers, as given by the MPI_rank function, ranging from 0 to n-1, corresponding to the physical processors. |
For this problem, we can use the MPI Cartesian coordinate functions to define and use a different topology of virtual processors. |
MPI_Cart_Create(MPI_COMM_WORLD, ndim=2, dims, [T,T], T, comm2d) |
Where dims is a 2-element integer vector giving the number of elements in the 2 dimensions, |
[T,T] is a 2-element vector giving True to wrap the topology in both dimensions, T specifies that it's o.k. to reorder the default rank (from MPI_rank) of the processors in order to achieve a good embedding. |
This returns a handle to the communicator for the new topology in the variable comm2d. |
Assume that main program has set up the 2-dimensional grid structure |
Assume that each processor has array coords, which tell its position in the processor grid and a "rows" communicator group is created:
|
Matrix Multiplication makes extensive use of broadcast operations as its communication primitives |
We can use this application to discuss three approaches to broadcast
|
Which have different performance depending on message sizes and hardware architecture |
In our performance analysis, we assume is the time to send one data item, disregarding important but complicating issues such as latency. |
Naive broadcast: The simplest way to broadcast a block of data from a source processor to a group of processors is to repeat a send message. In our case, the data is of size and the number of processors is , so the total time needed is |
Log broadcast: In many architectures such as the hypercube, it is more efficient to send the message via intermediate processors who copy the message and pass it along. This reduces the number of messages to the log of the number of processors. |
Although we are not modelling differences in communication time here, it is also the case that on a hypercube network, the processors can be ordered so that neighboring processors in the binary tree are only one hop away on the network. |
In the case that the size of the message is large, other implementation optimizations are possible, since it will be necessary for the broadcast message to be broken into a sequence of smaller messages. The broadcast can set up a path (or paths) from the source processor that visits every processor in the group. The message is sent from the source along the path in a pipeline, where each processor receives a block of the message from its predecessor and sends it to its successor. This is a similar concept to wormhole routing, but where each intermediate processor also saves the message in addition to passing it along. (See diagram next page.) |
The performance of this broadcast is then the time to send the message to the processor on the end of the path plus the overhead of starting and finishing the pipeline. |
For sufficiently large grain size the pipe broadcast is better than the log broadcast, namely when |
Example showing source processing sending message in 9 pieces to 7 other processors |
Time of matrix multiply
|
Efficiency is given by |
Overhead, where :
|
This is a similar matrix multiplication algorithm to the first in that it assumes the block definition of matrix multiply |
and the calculation of each is done in place (owner computes rule) by moving the pairs of blocks and to the processor in which it resides. |
Cannon's algorithm differs in the order in which the pairs of blocks are multiplied. We first "skew" both the A and the B matrix so that we can "roll" both A and B - A to the left and B to the top - to circulate the blocks in row l and column k to calculate . |
Reference: |
Ho, Johnsson and Edelman |
Matrices A and B after the initial skewing: |
Multiply C = A B. |
Roll A one to the left and B one upward: |
Set-up
|
Time for each of iterations
|
Total time |
We wish to decompose the matrix A into the product LU, where L is a lower triangular matrix with 1's on the main diagonal, and U is an upper triangular matrix. |
We assume A is a full M by M matrix. |
In general, pivoting is necessary to ensure numerical stability. |
LU decomposition is often used in the solution of systems of linear equations, Ax = b. The equations can be written as two triangular systems,
|
The first equation is solved for y by forward reduction, and the solution x is then obtained from the second equation by back substitution. |
The following papers deal with parallel algorithms for the LU decomposition of full matrices, and contain useful references to other work.
|
The algorithm proceeds in M steps. |
At the start of step k, we identify the row, r, containing the largest value of |Ai,k| for k £ i < M. If r ¹ k then rows r and k are exchanged. This is called partial pivoting, and is done to improve the numerical stability. After the exchange the element that is now Ak,k is called the pivot. |
At each step k, column number k of L and row number k of U are calculated: |
Then the rows and columns > k are modified as follows: |
After step k, the first k rows and columns of A are not used again. We can therefore overwrite A with the columns of L and the rows of U as we find them. The diagonal of L does not have to be explicitly stored since it is all 1's. |
Using matrix A (0:M-1, 0:M-1) |
We must choose a decomposition for the matrix A which is load balanced throughout the algorithm, and which minimizes communication.
|
Scattered (or cyclic) row decomposition? |
Each processor gets a set of non-contiguous rows as shown. Work is aproximately load balanced as the computational window moves down the diagonal. |
The scattered load balancing works because there is no locality in the communication. We examine the details of the communication: |
Set block size m = M/ initially and it will change and become 1 at end as rows and columns are eliminated |
Then time for one iteration of algorithm
|
Suppose that the matrix A is banded with bandwidth b and half-width w, where b=2w-1. |
The sequential algorithm is similar to the full matrix case, except at each stage only those elements within a computational "window" of w rows and w columns are updated. |
Partial pivoting can cause the number of columns in the computational window to be greater than m. This necessitates some extra bookkeeping in both the sequential and parallel algorithms. |
The parallel banded and full algorithms are similar, but use a different decomposition. To get better load balancing, a scattered decomposition over both rows and columns is used in the banded algorithm. In the full case, a scattered decomposition just over rows was used. |
G. C. Fox, 1984, "LU Decomposition for Banded Matrices", Caltech Report C3P-99. |
S. Lennart Johnsson, 1985, "Solving Narrow Banded Systems on Ensemble Architectures", ACM Trans. on Math. Software, 11:271. |
Y. Saad and M. H. Schultz, 1985, "Parallel Direct Methods for Solving Banded Linear Systems", Yale Rsearch Report YALEU/DCS/RR-387. |
J. J. Dongarra and S. Lennart Johnsson, 1987, "Solving Banded Systems on a Parallel Processor", Parallel Computing, 5:219. |
D. W. Walker, T. Aldcroft, A. Cisneros, G. C. Fox and W. Furmanski, 1988, "LU Decomposition of Banded Matrices and the Solution of Linear Systems on Hypercubes", in Proceedings of the Third Conference on Hypercube Concurrent Processors and Applications, published by ACM Press, New York. |
Scattered decomposition of a 20 by 20 matrix with bandwidth b=11 for a 16 processor hypercube. A similar decomposition can be used for mesh-connected topologies. |