Implementations for the parallel block-diagonal-bordered sparse
Gauss-Seidel method have been developed in the C programming language
for the Thinking Machines CM-5 multi-processor using a host-node
paradigm with explicit message passing. Two versions of the parallel
block-diagonal-bordered iterative algorithm have been implemented: one
implementation uses low-latency active messages to update using data in the lower border and to distribute values of
from the last diagonal block. The second
implementation uses conventional high(er) latency non-blocking,
buffered interprocessor communications. The programming paradigms for
these two implementations differ significantly.
The programming paradigm we used with active messages, is to calculate
a vector vector product when updating the last diagonal block
and immediately send the value to the processor holding the
corresponding value of
. Likewise, when values of
are calculated, they are sent immediately but
to only those processors that require these values. The programming
paradigm we used with buffered communications is to perform the vector
vector products, store them in separate buffers, and then
have each processor send buffers to all other processors. Likewise,
when values of
are calculated, they are stored
in a buffer at each processor and sent to all other processors. The
active message communications programming paradigm greatly simplified
development of the algorithm, and the empirical results, presented in
section 7.2, show that (not unexpectedly) the
low-latency, active message-based implementation is significantly
faster.
The parallel block-diagonal-bordered sparse Gauss-Seidel algorithm
can be broken into three component parts as defined in the derivation
of available parallelism in chapter :
A version of the software is available that runs on a single processor on the CM-5 to provide empirical speed-up data to quantify multi-processor performance. This sequential software simply places all data into one diagonal block without a border and includes the capability to gather convergence-rate data. The parallel implementation has been developed as an instrumented proof-of-concept to examine the efficiency of each section of the code described above. The host processor is used to gather and tabulate statistics on the multi-processor calculations. Statistics are gathered at synchronization points, so there is no impact on total empirical measures of performance. Empirical performance data is presented in section 7.2 for varied numbers of processors solving real power systems sparse network matrices.
The algorithm section that updates the values of
calculates a sparse matrix
dense vector product by
calculating individual sparse vector
dense vector products
for lower border rows. These partial sums must be distributed to the
proper processor holding the respective row data in the last diagonal
block. Separate sparse vector
dense vector products are
performed for each block of data on a processor. Only nonzero rows in
the lower border are utilized when calculating vector
vector
products to generate the required partial sum values to update the
values of
. Examining only non-zero values
significantly limits the amounts of calculations in this phase.
There has been no attempt at parallel reduction of the partial sums of
updates from the borders. In the process of developing an
implementation with optimal performance, we discovered that any
attempt to consolidate updates to a value in the last diagonal block
caused more overhead than was encountered by sending multiple update
values to the same processor. There is more work required to sum
update data than to calculate the sparse vector dense vector
products. Likewise, there has been no attempt at parallel reduction
of the partial sums of updates from the borders.
In the parallel Gauss-Seidel, differences in programming paradigms
resulting from the different interprocessor communications
capabilities are no more significant than when calculating new values
in the last diagonal block. With active message communications, only
those communications that are required are performed. Lists of
processors that require a particular value of
are calculated a priori and used repeatedly. Empirical data presented
in section 7.2 clearly illustrates the limited
growth in the number of communications messages as the number of
processors increases. When using buffered communications, the
programming paradigm is to calculate all values on a processor within
a color and then broadcast those values to all other processors.
Development testing illustrated that the algorithm for this
communications paradigm would be the most efficient way to implement
the algorithm. There is little to be gained by attempting to
determine a subset of processors. In almost all distributions of data
to processors, one or more of the multiple values in a buffered
message was required on every processor. As a result, every processor
must broadcast data to all other processors after every color. While
some communications overhead could be saved by limiting what data is
sent to each processor --- shorten the messages --- it was determined
that too much overhead would be required to sort the data and produce
separate communications buffers for each destination processor.