next up previous
Next: Conclusions Up: Preliminary test results Previous: Processing Bottlenecks

Comparison to Amdahl's Law

These parallel algorithm implementations each have a substantial sequential portion. Amdahl's law [16] describes theoretical limits on parallel performance due to the sequential portion of the algorithm. After reviewing the definition, Amdahl's law-based performance estimates are compared to measured performance for parallel block-diagonal-bordered sparse matrix factorization.

Definition (Amdahl's Law) Given , the time to solve a problem on a single processor, then , the time to solve a problem on processors, can be parameterized in by

where is the inherently sequential fraction of computations. The aforementioned estimate of can be used when estimating the relative speedup [21] by

where relative speedup is defined as

Amdahl's Law can be used to estimate the maximum potential relative speedup by taking the inverse of the sequential portion of the parallel problem.

It can be shown that the amount of processing required for the last diagonal block in the test matrices is approximately four times the processing required for each independent block. Independent diagonal blocks are divided into four sections, while the last block is divided into only two sections. The computational complexity of each independent matrix subsection is . For the test matrix with 32 independent blocks, the sequential workload is approximately

According to Amdahl's law, a task with of the operations being sequential limits speedup to no more than 9, regardless of the number of processors applied to the problem. The speedup obtained for parallel block-diagonal-bordered sparse matrix factorization for this matrix was 5.4 --- which is consistent with Amdahl's law because the measured speedup does not exceed the Amdahl's law estimate of the bound on speedup. For the largest matrix reported in this paper, the sequential portion of the processing would be only of the total operations. Consequently, according to Amdahl's law, speedup would not be limited by sequential calculations for a 32 processor CM-5. Additional computations in block-diagonal-bordered sparse matrix algorithms must be performed in parallel for performance to be scalable for smaller sized matrices.

While the parallel block-diagonal-bordered matrix algorithms described in this report had significant amounts of sequential calculations in the last block, other versions of these algorithms are possible that significantly increase the amount of parallel operations when processing this section of the matrix. In spite of the large amount of sequential calculations in the last block, this simple test implementation illustrates that parallel factorization performance is very promising. Nevertheless, additional operations in future algorithms must be performed in parallel so that those algorithms are scalable. Depending on the characteristics of the data, it may be possible to factor the last block using proven parallel dense matrix techniques. Meanwhile, state-of-the-art communications techniques can also be employed to improve the communications performance of all three parallel algorithms. The same active message communications techniques that may improve parallel factorization performance may also reduce the bottlenecks when performing parallel forward reduction and utilizing active messages may improve the broadcast of x vector values in the backward substitution phase.



next up previous
Next: Conclusions Up: Preliminary test results Previous: Processing Bottlenecks



David P. Koester
Sun Oct 22 16:27:33 EDT 1995