Solution of very large problems requires out-of-core matrix filling and solution. The out-of-core matrix solution would be done using a vendor-specific out-of-core solver, such as ProSolver DES [80] on the Intel Paragon or the out-of-core CMSSL solver on the CM-5. Here, we only consider the development of out-of-core filling on the CM-5 system. The reason for developing an out-of-core filling algorithm is that the matrix is too large to be stored in the main memory of the system. We have to fill one portion of the matrix at a time and write it to a file, and then fill another portion of the matrix and write it out, and so on. Compare this with the in-core matrix fill algorithm, where the matrix is filled once then written out once. The main idea of designing an out-of-core algorithm is to modify the in-core filling algorithm structure and fill a portion of the matrix instead of the whole matrix.
Similarly, the issues involved with parallel out-of-core algorithm design are the problems of decomposition, load-balancing and communication. As with the in-core fill algorithm, the no communication approach is adopted here. The important question for out-of-core filling is how to decompose the problem into a set of small problems which can be fitted in in-core memory.
Before answering the question, we
assume that the matrix, with N rows and columns, contains bytes;
the system has
bytes of available memory attached on each node;
the number of processors
is
; and the total number of
out-of-core filling is
, where
has to satify the inequality
When the column
block decomposition is used, each in-core fill is to fill a slab of matrix. The
size of the out-of-core slab
is restricted by
and
where is the number of columns for each node to fill at the
out-of-core fill. At the last out-of-core fill (or
),
the number of unfilled columns is
. The job to fill the
columns is distributed into
nodes as evenly as possible. Let
and
. A node fills the
columns if its index is less
, otherwise it fills
columns.
The worst load balancing is when some nodes have one column more to fill than
the other nodes.
Now, we can discuss our out-of-core fill algorithm since the data
decomposition is decided. Based on the application and the particular
system architecture, one supplies and
which is the
maximum number of columns that can be held in each node's memory. Let each node
go through a loop of number of out-of-core procedures (from 1 to
). Each node calculates
for the
out-of-core fill and sets
the global upper and lower bound. For example, the global lower bound is
one and upper bound is
for
; and the global lower bound is
and upper bound is
and so on. Each node
fills a portion of the matrix in the same way as the in-core fill algorithm.
However, each node does not pay any attention to the columns which fall
outside
the fill bound. After every node has completed the desired filling, a
global synchronous write command is issued to write the portion of the matrix
into a file. Then, each node examined goes back to the loop.
A pseudo code of the out-of-core matrix fill is shown
in Figure 4.12.
We implemented this algorithm and tested it on the CM5 for the
case. The results are summarized in Table
. It is seen that
there is relatively little difference between out-of-core and in-core fill
times.