Given by Geoffrey C. Fox at Delivered Lectures of CPS615 Basic Simulation Track for Computational Science on 5 Decemr 96. Foils prepared 29 December 1996
Outside Index
Summary of Material
Secs 66.2
This lecture covers two distinct areas. |
Firstly a short discussion of LInear Programming -- what type of problems its used for, what the equations look like and basic issues in the difficult use of parallel processing |
Then we give an abbreviated discussion of Full Matrix algorithms covering
|
Outside Index Summary of Material
Geoffrey Fox |
NPAC |
Room 3-131 CST |
111 College Place |
Syracuse NY 13244-4100 |
This lecture covers two distinct areas. |
Firstly a short discussion of LInear Programming -- what type of problems its used for, what the equations look like and basic issues in the difficult use of parallel processing |
Then we give an abbreviated discussion of Full Matrix algorithms covering
|
See Original Foil |
See Original Foil |
See Original Foil |
See Original Foil |
We have studied partial differential equations, for example |
and shown how they translate into matrix equations. |
These matrices were "sparse". The operator only linked a few states ( values in , components of ). |
"Full" matrices are those with "essentially all" elements nonzero. More precisely, it is not worth exploiting the zero's.
|
Full matrices come from "operators" which link several (~all) basis states used in expressing general state. |
Examples from Chemistry: |
Operator is Hamiltonian H with matrix elements, |
States can be labelled by many quantities:
|
Often you want to find eigenstates |
where operator H is diagonal with respect to basis set . However, this is usually impossible. |
Often one knows that |
for example, if the compound is ,then is the "free" Hamiltonian for isolated molecules |
and is the interaction, i.e. the forces between atoms. |
Simple states diagonalize , but will be nonzero for "most" i and j. |
According to survey by Edelman, the dominant use of "large" matrix inversion on Supercomputers is the Method of Moments for Computational Electromagnetics. This method of solving matrix equations was invented by Harrington at Syracuse University in about 1967. |
Suppose we have |
where the are suitable expansion functions for which can be calculated. Then |
The easiest method would be to use eigenfunctions |
This is a very important method although you can't find eigenvectors often. |
For example, in "scattering problems", which are usual in electromagnetics, eigenvalues are continuous. In the case, |
for any , is an eigenfunction with eigenvalue: |
So we look at problems where is not an eigenfunction. |
where we need . |
Choose "suitable" set of weight functions |
becomes with
|
Read the literature (for example, Computer Physics Communications, Nov. 1991) for choices of and . Clearly one will take as functions for which can be easily calculated. |
Comments:
|
Eigenvalues/vectors - to find "bound states", "equilibrium states" - for example, using MOPAC |
Equation Solutions - for reasons similar to those just discussed |
Multiplication to "change basis" |
From the previous discussion, we see that matrix multiplication is actually very rarely used in scientific computing for large N. (Yet it's a favorite computer science algorithm!) |
Full matrix equation solvers are sometimes used but not very common. Of course equation solvers are incredibly important for sparse matrices because if the matrix is large
|
In solving , formally , but this is NOT normally the best numerical method. |
If is sparse, both and the better LU decomposition are NOT sparse. |
Suppose we want to multiply the matrices A and B together to form matrix |
C: |
We will assume all matrices are square - the algorithm can be generalized to deal with rectangular matrices. |
The input matrices, A and B, are decomposed into rectangular sub-blocks. |
If we have N processors we have rows and columns of sub-blocks. This means that N should be a perfect square. (Of course this condition can be relaxed -- it is used for simplicity of discussion) |
One sub-block is assigned to each processor in a grid corresponding to the sub-block structure. |
The algorithm ensures that the output matrix C has the same decomposition as A and B. |
If A B and C are M by M with M2 = N m2 and thus M = m |
Then each processor stores an m by m subblock |
Processors are labelled by a two vector (l,k) l,k = 1 ... |
Let be the sub-block at position (l,k), then the problem can be stated in block matrix form: |
Example of sub-block structure of matrix A where N = 16 ( = 4): |
Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988, p. 167, for the broadcast, multiply, roll algorithm. |
Cannon, L., Ph.D. thesis, Montana State University, Bozeman, MN, 1969. |
Demmel, Heath, and van der Vorst, "Parallel Numerical Linear Algebra", 1993, for a detailed survey of algorithms. |
Agarwal, Gustavson, and Zubair, "A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication", IBM Journal of Research and Development, which discusses issues of overlapping computing and communication in block matrix multiply. |
See Fox, Johnson, Lyzenga, Otto, Salmon and Walker, "Solving Problems on Concurrent Processors", Vol. 1, 1988 |
Broadcast sub-blocks of A along rows to form matrix T |
Multiply sub-blocks and add into C |
Roll B up one row (picture |
shows result after roll with |
correct elements of B in place) |
Broadcast the next sub-blocks |
of A along the rows |
Multiply sub-blocks and add into C |
Under the new MPI cartesian topology, there are two functions which convert between the logical process coordinates, such as [i,j] in our 2D topology, to the (possibly new) 1D topolgy rank number required by the send and receive functions. |
MPI_CART_COORDS(comm2d, myrank, 2, coords) |
Takes a processor number myrank, in a 2 dim communicator called comm2d, and returns the 2 element vector cooresponding to the [i,j] position of this processor in the new system. |
MPI_CART_RANK(comm2d, [i,j], myrank) |
Takes a 2 element int array with logical process coordinates i and j and returns a 1D processor rank in the variable myrank. |
Assume that main program has set up the 2-dimensional grid structure |
Assume that each processor has array coords, which tell its position in the processor grid and a "rows" communicator group is created:
|
Time of matrix multiply
|
Efficiency is given by |
Overhead, where :
|
This is a similar matrix multiplication algorithm to the first in that it assumes the block definition of matrix multiply |
and the calculation of each is done in place (owner computes rule) by moving the pairs of blocks and to the processor in which it resides. |
Cannon's algorithm differs in the order in which the pairs of blocks are multiplied. We first "skew" both the A and the B matrix so that we can "roll" both A and B - A to the left and B to the top - to circulate the blocks in row l and column k to calculate . |
Reference: |
Ho, Johnsson and Edelman |
We must choose a decomposition for the matrix A which is load balanced throughout the algorithm, and which minimizes communication.
|