In the matrix filling process, there are two separate decomposition or partitioning concerns. An important point is that these decompositions need not be the same. Our general approach may be summarized as follows:
The most obvious way to fill the matrix in a sequential program is for the outermost loops to be indexed by the columns and rows of the matrix. Each column of the matrix corresponds to an expansion function (``source basis function") and each row corresponds to a testing function (``field basis function"). In practice, however, most sequential MoM codes fill the matrix by looping through surface patches. The reason for this is that each source or testing basis function domain generally extends over more than one patch; each basis function overlaps one or more neighboring basis functions.
In a typical triangular patch formulation such as that described in Chapters 2 and 3, each basis function is defined on the edge between two adjacent patches. Each patch is part of the domain of three different basis functions. A source-field pair of patches is involved in the computation of nine different source-field basis function interactions. Each such source-field basis function interaction corresponds to a single matrix element. Much of the information required to compute the interaction between the source and field basis functions is related only to the geometry (e.g., the distance between two points in the basis function domain). This information is the same regardless of which basis function interaction is currently considered. Because there are nine basis function interactions for a given pair of patches, the geometry information may be computed once and used nine times if the loops are indexed by patches rather than by basis functions.
A source patch is only half the domain of each of three basis functions (likewise for field patches); a patch-patch interaction produces contributions to nine different matrix elements. A matrix element is not completely computed until the two patches that make up the basis function have interacted with the two patches that make up the testing function domain. Thus, each matrix location is written to four times. Each time the new contribution is added to the current value.
The exact savings due to patch-by-patch looping depends on the specific formulation and computer system being used. However, a typical timing shows that 25%to 40%of the total fill time is spent computing the patch-patch geometry interactions. If each such interaction were to be calculated nine times, as is necessary when looping is done over columns and rows, the total computational cost for matrix filling would be 300%to 420%of the original.
By using a sequential computer, there is no question that the patch looping algorithm is more efficient than looping over basis functions. On the parallel computer, if the work is partitioned by patches the patch interactions can again be computed once and used nine times. However, each resulting matrix contribution must be distributed to the node on which it resides, which in general is NOT the local node. Therefore, the costs associated with communication between nodes could possibly overcome the advantage of reusing the patch interactions.
A valid point is that if the work is partitioned according to matrix
position, as previously mentioned,
no internode communication is required during the matrix filling loops.
Therefore, although the computation time per node may be increased
300%to 420%of the
original time as previously described, the TOTAL performance should increase
directly
proportionally to the number of nodes p, giving speed-up in
the range of p/3 to p/4.2
relative to the single-node patch-looping algorithm.
These are very respectable speedup
numbers, and should be very easy to achieve. We consider
this partitioning to be an
acceptable fallback position if performance with the more
complicated patch partitioning
algorithm described in the next subsection is not as good as expected.
Patch-Partioned Parallel Algorithm
As described in the previous paragraph, patch partition has certain
advantages over edge partition. In this section, a patch partitioned
parallel algorithm will be presented. The question is how to design an
efficient patch partition algorithm. A simple thought is to select one node
to compute the patch pair interaction for a given source-field patch pair,
and then send the result as message to the nodes that need the result.
There
are up to eight nodes which receive this message. That means that eight
possible messages are required to replace eight possible recomputations.
Communication is
very expensive in a distributed memory system. Although some techniques
like asynchronous message passing, collective message passing, and active
message passing will reduce communication overhead, intensive
communication is what we try to avoid.
We can
conclude that it is not an attractive idea. If the partition is done in such a way
that the computation involves a field patch that can be
performed by one node, then
the possible number of computations of interaction is reduced to three.
From a data
decomposition point view, this is nothing but column decomposition. We have
the choice of scattering and slab decomposition; a slab is a matrix block
consisting of a number of adjacent columns of the matrix.
The decision is based on
which decomposition is more efficient. Relating to this question, we have to
look at the global numbering scheme used to number the edges of the triangular
patches in the geometry model. Each triangle has at least two of its edges
with an adjacent global number. If the triangle is a source patch then
the global number of its edges is related to the column number in the
moment matrix. On the other hand, the global edge number of a field patch
is the row number of the moment matrix. The locality of the edge number
within a patch gives an indication to use the block decomposition
instead of the scattering one. The column block (or slab) decomposition
can be described as follows.
Let and
, then the partition is as
presented in Figure 4.6.
Except for the boundary column in each node, the possible number of
computations for the interaction is two. In other words, there is only one
possible redundant computation for patch interaction.
If
is a big number or there is a large
granularity problem, the probability is high for there being no
redundant computation.
Since there are no communication requirements for this filling implementation,
the term
`embarrassingly parallel' is used to discribe such an algorithm in the parallel computing community.
The details of the filling algorithm are described in the rest of this subsection. In our algorithm, every node loops over all the source patches sequentially (actually, this could be done in any order). On one source patch, there are up to three possible edges. Each node finds out the column number which is associated with the edge, the index of the node that should compute this column, and the local index of the column in the slab within the node. A job counter is assigned to each node. One node's job counter will be increased by one if it is found to be the one to compute the current column. After each node goes through all three possible edges of that source patch, only the nodes which are selected to do the computation fill the moment matrix elements. The rest of the nodes go ahead to pick another source patch, repeating the selecting process. One can see the selecting process by the nodes is the only overhead compared with the sequential algorithm. So, one can expect the parallel algorithm to have a good speed-up. Since there are usually only two nodes selected for one source patch and both of them will compute the patch-pair interaction, this is only one extra computation. This fill algorithm allows for nearly total load balance because any processor will have at most one extra row compared to all other processors. Figure 4.7 shows a pseudo code of matrix fill algorithm which described.
To write the moment matrix to a file, we use the CMMD global I/O mode called global synchronous sequential mode, in which a global file is accessed by all nodes simultaneously, with each node reading (or writing) its own portion of the file. The data written to the file comes from each node in sequence, from node 0 up to the highest numbered node. Each node contributes a buffer of data (of arbitrary size) to be written, and the data is written into the file sequentially by node number (node 0 first, node 1 next, etc.) This mode allows the amount of data written by each node to be different; the buffer on each node can have a different size. Nodes that have no data to write may set their buffer length to be zero shown in Figure 4.8.
The CMMD I/O library only provides UNIX-level I/O calls. So there are potential problems in using Fortran 77 read/write functions, which use buffered I/O. Buffered I/O calls are permitted in all nodes. However, they are not guaranteed to work as expected in CMMD-sync-seq mode. Since CMMD relies on the C run-time library to provide the buffered I/O code, CMMD has no control over the timing of when a buffer spills and actually performs a read and write.
Buffered I/O, by its nature, removes control from the user over when the underlying read or write occurs. It is difficult or impossible for the programmer to guarantee synchronous calls across the partition to the underlying I/O routines.
This restriction results in a difficulty for Fortran I/O under this mode, as the only provided Fortran I/O mechanisms are built on top of a layer of buffering. To overcome this restriction, CMMD provides a global write function which utilizes UNIX write mechanisms to perform global write under CMMD-sync-seq.
The algorithm implemented on the Intel machine is almost the same as the one described above. There are two points to be mentioned. One is that the slab partition is very good for matrix fill but may not be good for factor/solve using ScaLAPACK since the slab partition may not have good load balancing for ScaLAPACK. The other is that the moment matrix does not need to be written out in the Intel and IBM implementations.