NPAC Technical Report SCCS-271c

On the parallelization of blocked LU factorization algorithms for distributed memory architectures

Geoffrey Fox, Gaber Mohamed, Gregor von Laszewski, Manish Parashar

Submitted September 01 1992

Abstract

Our experimental results showed that block based algorithms for numerically intensive applications are superior to their noblock counterpart (SCCS94b). It is desirable to parallelize block based algorithms on distributed memory MIMD architectures since many scientific and engineering applications make use of these algorithms. Our goal is to optimize sample applications from LAPACK, develop them in Fortran 77D and Fortran 90D, and have them available as a scalable compiler library. In the presented study, we show ways to parallelize sequential block algorithms for the LU factorization. The goal of this paper is twofold. On one hand, since these algorithms are difficult to parallelize they will be included in a benchmarking suite for the Fortran 90D project. We point out problems inherent in the sequential nature of the block based algorithms. We learn that it is not intuitively clear which algorithm might perform best on a distributed memory architecture. The problems described here will help to improve the design of a source to source code compiler applied to numerically intensive applications. Beside these conclusions, experiments done on the iPSC Hypercube show which parallel block algorithm should be used depending on the number of available processors, the matrix size, and the block size. Three algorithms for the column oriented Fortran are compared.