Next: RCS Computation Up: Implementation Previous: Fill the Right-Hand

LU Factor and Solve

In this section, a data-parallel implementation to perform LU decomposition of the moment matrix and solution of the matrix equation will be presented.

As we mentioned in Section 4.1 that the factor/solve code is a completely independent program linking the rest of the ParaMoM-MPP with a an SDA file or a Data Vault file. The CMSSL, a mathematical library on the CM-5 system, uses a data-parallel programming model. A program which can call a CMSSL function must be written either in CM Fortran or C-Star. Both computer languages are extensions of the sequential traditional computer languages Fortran 77 and C, respectively.

The linear solver used for dense systems consists of the Gaussian elimination (LU decomposition) and back substitution solver. For numerical stability we have chosen the LU factorization with partial pivoting. The CMSSL's LU routines use the combination of blocking and load balancing. Blocking means routines operating on and transferring blocks of data rather than single data elements. Blocking reduces vector-vector operations and increases matrix-vector operations, which can give very high performance when local to a processing element. ScaLAPACK partition is controlled by the user.

The elimination operation presents a load balancing problem using a slab block decomposition. The load balancing provided in CMSSL uses cyclic ordering to achieve load balancing. Let the columns be processed in cyclic order (column 1 of the first column of processing elements, the column 1 of the second column of processing elements, and so on). Then as columns are eliminated, the active subgrid on each processing element shrinks, but no processing element becomes completely inactive until the last column is eliminated from some column of processing elements. That is, it increases the number of processing elements that are active at any time.

The combination of these two strategies is called Block Cyclic Ordering. In Block Cyclic Ordering, a routine computes on a block of columns or rows as a unit. A routine performs a block update of a number of columns instead of a single column at a time. Thus, the load balancing scheme processes blocks, rather than single columns, in cyclic order. Assuming that the block size is chosen to be b, the LU factorization routine eliminates columns and rows in the following order:

After eliminating b columns from each column of processing elements, the routine returns to columns b+1 through 2b of processing element column 1, and so on.

Letting a linear system be , the forward elimination gives

Let , and then . Following the back substitution

where and are matrices, and and are the lower triangular and upper triangular matrices respectively. When , () provides the opportunity to take advantage of data-parallel paradigm in this implementation.

The CM Fortran Utility Library's parallel I/O is used to open a SDA or Data Vault file and read in the moment matrix and the right-hand side vectors. The `SO' I/O mode function is used (see [75]). Although, the `SO' I/O mode is slower than the `FMS' I/O mode, it provides the most portability across CM configurations and execution models. The LU routine is called after reading in the moment matrix and the right-hand side vectors. The lower and upper triangular matrices are written to the moment matrix to save memory space. The back substitution is applied to obtain the solution vector or vectors. Once again, the `SO' I/O mode is used to write the solution vectors to a storage device. The LU/SOLVE program reports the Mflops achieved for the LU decomposition and back substitution separately.

As mentioned in the previous section, ScaLAPACK is used for factorizing the moment matrix and back substitution solution on the Intel machines and IBM SP-1. The performance is very much dependent on the data partition. Achieving good node-level performance on the IBM SP-1 is relatively easy, because each node is a general purpose RISC processor and there are no node-level vectorization issues.



Next: RCS Computation Up: Implementation Previous: Fill the Right-Hand


xshen@
Sat Dec 3 17:51:03 EST 1994