Solving sparse linear systems practically dominates scientific computing, but the performance of direct sparse matrix solvers have tended to trail behind their dense matrix counterparts [20]. Parallel sparse matrix solver performance generally is less than similar dense matrix solvers even though there is more inherent parallelism in sparse matrix algorithms than dense matrix algorithms. This additional parallelism is often described by elimination trees [14,15,16,20,37,38,39,40,41,45], graphs that illustrate the dependencies in the calculations. Parallel sparse linear solvers can simultaneously factor entire groups of mutually independent contiguous blocks of columns or rows without communications; meanwhile, dense linear solvers can only update blocks of contiguous columns or rows each pipelined communication cycle. The limited success with efficient sparse matrix solvers is not surprising, because general sparse linear solvers require more complicated data structures and algorithms that must contend with irregular memory reference patterns. The irregular nature of many real-world sparse matrices has aggravated the task of implementing scalable sparse matrix solvers on vector or parallel architectures: efficient scalable algorithms for these classes of machines require regularity in available data vector lengths and in interprocessor communications patterns [8,18,32].
We have focused on developing parallel linear solvers optimized for sparse matrices from the power systems community --- in particular, we have examined linear solvers for matrices resulting from power distribution system networks. These matrices are some of the most sparse matrices encountered in real-life applications, and these matrices also are irregular. Recently, [18,22] have reported scalable Choleski solvers, but they are for matrices that have more row/columns, that have more nonzero elements per row/column, and that are more regular than power systems matrices. When scalability of sparse linear solvers is examined using real, irregular sparse matrices, the available parallelism in the sparse matrix and load-imbalance overhead can be as much the reason for poor parallel efficiency as the parallel algorithm or implementation [26,32].