next up previous
Next: Empirical Results for Up: Parallel Sparse Iterative Previous: The Hierarchical Data

Parallel Blocked-Diagonal-Bordered Gauss-Seidel

Implementations for the parallel block-diagonal-bordered sparse Gauss-Seidel method have been developed in the C programming language for the Thinking Machines CM-5 multi-processor using a host-node paradigm with explicit message passing. Two versions of the parallel block-diagonal-bordered iterative algorithm have been implemented: one implementation uses low-latency active messages to update using data in the lower border and to distribute values of from the last diagonal block. The second implementation uses conventional high(er) latency non-blocking, buffered interprocessor communications. The programming paradigms for these two implementations differ significantly.

The programming paradigm we used with active messages, is to calculate a vector vector product when updating the last diagonal block and immediately send the value to the processor holding the corresponding value of . Likewise, when values of are calculated, they are sent immediately but to only those processors that require these values. The programming paradigm we used with buffered communications is to perform the vector vector products, store them in separate buffers, and then have each processor send buffers to all other processors. Likewise, when values of are calculated, they are stored in a buffer at each processor and sent to all other processors. The active message communications programming paradigm greatly simplified development of the algorithm, and the empirical results, presented in section 7.2, show that (not unexpectedly) the low-latency, active message-based implementation is significantly faster.

The parallel block-diagonal-bordered sparse Gauss-Seidel algorithm can be broken into three component parts as defined in the derivation of available parallelism in chapter gif:

  1. solve for in the diagonal blocks,
  2. calculate by forming the
    matrix vector products in parallel,
  3. solve for in the last diagonal block.
Pseudo-code representations of each parallel algorithm section are presented separately in figures gif through gif in appendix gif. In particular, each of these figures correspond to the following figure numbers:
  1. monitor convergence for the parallel Gauss-Seidel method --- figure gif,
  2. solve for in the diagonal blocks and upper border --- figure gif,
  3. update for the last diagonal block ---
  4. solve for in the last diagonal block ---
  5. perform the convergence check ---
Figure gif provides the framework for the parallel block-diagonal-bordered sparse Gauss-Seidel implementation. In this implementation, multiple iterations are performed before a check is made on convergence. Convergence checks have computational complexity nearly as great as to solve for , so the algorithm performs a predefined number of iterations before the convergence check. To better understand the algorithm, timing statistics were collected for each portion of the algorithm.

A version of the software is available that runs on a single processor on the CM-5 to provide empirical speed-up data to quantify multi-processor performance. This sequential software simply places all data into one diagonal block without a border and includes the capability to gather convergence-rate data. The parallel implementation has been developed as an instrumented proof-of-concept to examine the efficiency of each section of the code described above. The host processor is used to gather and tabulate statistics on the multi-processor calculations. Statistics are gathered at synchronization points, so there is no impact on total empirical measures of performance. Empirical performance data is presented in section 7.2 for varied numbers of processors solving real power systems sparse network matrices.

The algorithm section that updates the values of calculates a sparse matrix dense vector product by calculating individual sparse vector dense vector products for lower border rows. These partial sums must be distributed to the proper processor holding the respective row data in the last diagonal block. Separate sparse vector dense vector products are performed for each block of data on a processor. Only nonzero rows in the lower border are utilized when calculating vector vector products to generate the required partial sum values to update the values of . Examining only non-zero values significantly limits the amounts of calculations in this phase.

There has been no attempt at parallel reduction of the partial sums of updates from the borders. In the process of developing an implementation with optimal performance, we discovered that any attempt to consolidate updates to a value in the last diagonal block caused more overhead than was encountered by sending multiple update values to the same processor. There is more work required to sum update data than to calculate the sparse vector dense vector products. Likewise, there has been no attempt at parallel reduction of the partial sums of updates from the borders.

In the parallel Gauss-Seidel, differences in programming paradigms resulting from the different interprocessor communications capabilities are no more significant than when calculating new values in the last diagonal block. With active message communications, only those communications that are required are performed. Lists of processors that require a particular value of are calculated a priori and used repeatedly. Empirical data presented in section 7.2 clearly illustrates the limited growth in the number of communications messages as the number of processors increases. When using buffered communications, the programming paradigm is to calculate all values on a processor within a color and then broadcast those values to all other processors. Development testing illustrated that the algorithm for this communications paradigm would be the most efficient way to implement the algorithm. There is little to be gained by attempting to determine a subset of processors. In almost all distributions of data to processors, one or more of the multiple values in a buffered message was required on every processor. As a result, every processor must broadcast data to all other processors after every color. While some communications overhead could be saved by limiting what data is sent to each processor --- shorten the messages --- it was determined that too much overhead would be required to sort the data and produce separate communications buffers for each destination processor.



next up previous
Next: Empirical Results for Up: Parallel Sparse Iterative Previous: The Hierarchical Data



David P. Koester
Sun Oct 22 17:27:14 EDT 1995