These parallel algorithm implementations each have a substantial sequential portion. Amdahl's law [16] describes theoretical limits on parallel performance due to the sequential portion of the algorithm. After reviewing the definition, Amdahl's law-based performance estimates are compared to measured performance for parallel block-diagonal-bordered sparse matrix factorization.
Definition (Amdahl's Law) Given , the time to solve
a problem on a single processor, then
, the time to solve a
problem on
processors, can be parameterized in
by
where is the inherently sequential fraction of
computations. The aforementioned estimate of
can be used when
estimating the relative speedup
[21] by
where relative speedup is defined as
Amdahl's Law can be used to estimate the maximum potential relative speedup by taking the inverse of the sequential portion of the parallel problem.
It can be shown that the amount of processing required for the last
diagonal block in the test matrices is approximately four times the
processing required for each independent block. Independent diagonal
blocks are divided into four sections, while the last block is divided
into only two sections. The computational complexity of each
independent matrix subsection is . For the test
matrix with 32 independent blocks, the sequential workload is
approximately
According to Amdahl's law, a task with of the operations
being sequential limits speedup to no more than 9, regardless of the
number of processors applied to the problem. The speedup obtained for
parallel block-diagonal-bordered sparse matrix factorization for
this matrix was 5.4 --- which is consistent with Amdahl's law because
the measured speedup does not exceed the Amdahl's law estimate of the
bound on speedup. For the largest matrix reported in this paper, the
sequential portion of the processing would be only
of
the total operations. Consequently, according to Amdahl's law,
speedup would not be limited by sequential calculations for a 32
processor CM-5. Additional computations in block-diagonal-bordered
sparse matrix algorithms must be performed in parallel for
performance to be scalable for smaller sized matrices.
While the parallel block-diagonal-bordered matrix algorithms described in this report had significant amounts of sequential calculations in the last block, other versions of these algorithms are possible that significantly increase the amount of parallel operations when processing this section of the matrix. In spite of the large amount of sequential calculations in the last block, this simple test implementation illustrates that parallel factorization performance is very promising. Nevertheless, additional operations in future algorithms must be performed in parallel so that those algorithms are scalable. Depending on the characteristics of the data, it may be possible to factor the last block using proven parallel dense matrix techniques. Meanwhile, state-of-the-art communications techniques can also be employed to improve the communications performance of all three parallel algorithms. The same active message communications techniques that may improve parallel factorization performance may also reduce the bottlenecks when performing parallel forward reduction and utilizing active messages may improve the broadcast of x vector values in the backward substitution phase.