Next: Fill the Right-Hand Up: Implementation Previous: Precomputation

Moment Matrix Fill

In the matrix filling process, there are two separate decomposition or partitioning concerns. An important point is that these decompositions need not be the same. Our general approach may be summarized as follows:

The MoM matrix filling is very well suited to MIMD parallel processing. Each matrix element is completely independent of every other element. Therefore, the simplest parallel algorithm is to simply partition the work the same way the data is partitioned (which is driven by the needs of the matrix solver as mentioned above) and have every node compute a piece of the matrix independently. However, this is not necessarily the most efficient algorithm, as described below.

The most obvious way to fill the matrix in a sequential program is for the outermost loops to be indexed by the columns and rows of the matrix. Each column of the matrix corresponds to an expansion function (``source basis function") and each row corresponds to a testing function (``field basis function"). In practice, however, most sequential MoM codes fill the matrix by looping through surface patches. The reason for this is that each source or testing basis function domain generally extends over more than one patch; each basis function overlaps one or more neighboring basis functions.

In a typical triangular patch formulation such as that described in Chapters 2 and 3, each basis function is defined on the edge between two adjacent patches. Each patch is part of the domain of three different basis functions. A source-field pair of patches is involved in the computation of nine different source-field basis function interactions. Each such source-field basis function interaction corresponds to a single matrix element. Much of the information required to compute the interaction between the source and field basis functions is related only to the geometry (e.g., the distance between two points in the basis function domain). This information is the same regardless of which basis function interaction is currently considered. Because there are nine basis function interactions for a given pair of patches, the geometry information may be computed once and used nine times if the loops are indexed by patches rather than by basis functions.

A source patch is only half the domain of each of three basis functions (likewise for field patches); a patch-patch interaction produces contributions to nine different matrix elements. A matrix element is not completely computed until the two patches that make up the basis function have interacted with the two patches that make up the testing function domain. Thus, each matrix location is written to four times. Each time the new contribution is added to the current value.

The exact savings due to patch-by-patch looping depends on the specific formulation and computer system being used. However, a typical timing shows that 25%to 40%of the total fill time is spent computing the patch-patch geometry interactions. If each such interaction were to be calculated nine times, as is necessary when looping is done over columns and rows, the total computational cost for matrix filling would be 300%to 420%of the original.

By using a sequential computer, there is no question that the patch looping algorithm is more efficient than looping over basis functions. On the parallel computer, if the work is partitioned by patches the patch interactions can again be computed once and used nine times. However, each resulting matrix contribution must be distributed to the node on which it resides, which in general is NOT the local node. Therefore, the costs associated with communication between nodes could possibly overcome the advantage of reusing the patch interactions.

A valid point is that if the work is partitioned according to matrix position, as previously mentioned, no internode communication is required during the matrix filling loops. Therefore, although the computation time per node may be increased 300%to 420%of the original time as previously described, the TOTAL performance should increase directly proportionally to the number of nodes p, giving speed-up in the range of p/3 to p/4.2 relative to the single-node patch-looping algorithm. These are very respectable speedup numbers, and should be very easy to achieve. We consider this partitioning to be an acceptable fallback position if performance with the more complicated patch partitioning algorithm described in the next subsection is not as good as expected.

Patch-Partioned Parallel Algorithm
As described in the previous paragraph, patch partition has certain advantages over edge partition. In this section, a patch partitioned parallel algorithm will be presented. The question is how to design an efficient patch partition algorithm. A simple thought is to select one node to compute the patch pair interaction for a given source-field patch pair, and then send the result as message to the nodes that need the result. There are up to eight nodes which receive this message. That means that eight possible messages are required to replace eight possible recomputations. Communication is very expensive in a distributed memory system. Although some techniques like asynchronous message passing, collective message passing, and active message passing will reduce communication overhead, intensive communication is what we try to avoid. We can conclude that it is not an attractive idea. If the partition is done in such a way that the computation involves a field patch that can be performed by one node, then the possible number of computations of interaction is reduced to three. From a data decomposition point view, this is nothing but column decomposition. We have the choice of scattering and slab decomposition; a slab is a matrix block consisting of a number of adjacent columns of the matrix. The decision is based on which decomposition is more efficient. Relating to this question, we have to look at the global numbering scheme used to number the edges of the triangular patches in the geometry model. Each triangle has at least two of its edges with an adjacent global number. If the triangle is a source patch then the global number of its edges is related to the column number in the moment matrix. On the other hand, the global edge number of a field patch is the row number of the moment matrix. The locality of the edge number within a patch gives an indication to use the block decomposition instead of the scattering one. The column block (or slab) decomposition can be described as follows.

Let and , then the partition is as presented in Figure 4.6. Except for the boundary column in each node, the possible number of computations for the interaction is two. In other words, there is only one possible redundant computation for patch interaction. If is a big number or there is a large granularity problem, the probability is high for there being no redundant computation. Since there are no communication requirements for this filling implementation, the term `embarrassingly parallel' is used to discribe such an algorithm in the parallel computing community.

The details of the filling algorithm are described in the rest of this subsection. In our algorithm, every node loops over all the source patches sequentially (actually, this could be done in any order). On one source patch, there are up to three possible edges. Each node finds out the column number which is associated with the edge, the index of the node that should compute this column, and the local index of the column in the slab within the node. A job counter is assigned to each node. One node's job counter will be increased by one if it is found to be the one to compute the current column. After each node goes through all three possible edges of that source patch, only the nodes which are selected to do the computation fill the moment matrix elements. The rest of the nodes go ahead to pick another source patch, repeating the selecting process. One can see the selecting process by the nodes is the only overhead compared with the sequential algorithm. So, one can expect the parallel algorithm to have a good speed-up. Since there are usually only two nodes selected for one source patch and both of them will compute the patch-pair interaction, this is only one extra computation. This fill algorithm allows for nearly total load balance because any processor will have at most one extra row compared to all other processors. Figure 4.7 shows a pseudo code of matrix fill algorithm which described.

To write the moment matrix to a file, we use the CMMD global I/O mode called global synchronous sequential mode, in which a global file is accessed by all nodes simultaneously, with each node reading (or writing) its own portion of the file. The data written to the file comes from each node in sequence, from node 0 up to the highest numbered node. Each node contributes a buffer of data (of arbitrary size) to be written, and the data is written into the file sequentially by node number (node 0 first, node 1 next, etc.) This mode allows the amount of data written by each node to be different; the buffer on each node can have a different size. Nodes that have no data to write may set their buffer length to be zero shown in Figure 4.8.

The CMMD I/O library only provides UNIX-level I/O calls. So there are potential problems in using Fortran 77 read/write functions, which use buffered I/O. Buffered I/O calls are permitted in all nodes. However, they are not guaranteed to work as expected in CMMD-sync-seq mode. Since CMMD relies on the C run-time library to provide the buffered I/O code, CMMD has no control over the timing of when a buffer spills and actually performs a read and write.

Buffered I/O, by its nature, removes control from the user over when the underlying read or write occurs. It is difficult or impossible for the programmer to guarantee synchronous calls across the partition to the underlying I/O routines.

This restriction results in a difficulty for Fortran I/O under this mode, as the only provided Fortran I/O mechanisms are built on top of a layer of buffering. To overcome this restriction, CMMD provides a global write function which utilizes UNIX write mechanisms to perform global write under CMMD-sync-seq.

The algorithm implemented on the Intel machine is almost the same as the one described above. There are two points to be mentioned. One is that the slab partition is very good for matrix fill but may not be good for factor/solve using ScaLAPACK since the slab partition may not have good load balancing for ScaLAPACK. The other is that the moment matrix does not need to be written out in the Intel and IBM implementations.



Next: Fill the Right-Hand Up: Implementation Previous: Precomputation


xshen@
Sat Dec 3 17:51:03 EST 1994