NPAC Technical Report SCCS-271c
On the parallelization of blocked LU factorization algorithms for distributed memory architectures
Geoffrey Fox, Gaber Mohamed, Gregor von Laszewski, Manish Parashar
Submitted September 01 1992
Abstract
Our experimental results showed that block based algorithms
for numerically intensive applications are superior to their noblock
counterpart (SCCS94b). It is desirable to parallelize block based
algorithms on distributed memory MIMD architectures since many
scientific and engineering applications make use of these algorithms.
Our goal is to optimize sample applications from LAPACK, develop them
in Fortran 77D and Fortran 90D, and have them available as a scalable
compiler library. In the presented study, we show ways to parallelize
sequential block algorithms for the LU factorization. The goal of
this paper is twofold.
On one hand, since these algorithms are difficult to parallelize they
will be included in a benchmarking suite for the Fortran 90D project.
We point out problems inherent in the sequential nature of the block
based algorithms. We learn that it is not intuitively clear which
algorithm might perform best on a distributed memory architecture. The
problems described here will help to improve the design of a source to
source code compiler applied to numerically intensive applications.
Beside these conclusions, experiments done on the iPSC Hypercube show
which parallel block algorithm should be used depending on the number
of available processors, the matrix size, and the block size. Three
algorithms for the column oriented Fortran are compared.