Given by Nancy J McCracken at CPS615 Basic Simulation Track for Computational Science on Fall Semester 95. Foils prepared 22 February 98
Outside Index
Summary of Material
This talk assumes the material found in the CPS615 lecture notes, Introduction to Numerical Integration. |
Going on from the static data decomposition for integration on page 27 of those notes, we examine further parallel algorithms for computing adaptive Romberg (recursive Simpson) integration. |
The data decomposition and remapping techniques discussed are generally applicable to parallel adaptive algorithms for other applications. |
Outside Index
Summary of Material
CPS615 |
November 1, 1995 |
Nancy McCracken |
Geoffrey Fox |
NPAC |
Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This talk assumes the material found in the CPS615 lecture notes, Introduction to Numerical Integration. |
Going on from the static data decomposition for integration on page 27 of those notes, we examine further parallel algorithms for computing adaptive Romberg (recursive Simpson) integration. |
The data decomposition and remapping techniques discussed are generally applicable to parallel adaptive algorithms for other applications. |
Recall that when you repeatedly apply Simpson's rule, halving the interval each time, you can use old values of the function. Furthermore, there is a convenient pattern to the Simpson coefficients. |
In our program, we start with an initial set of Xi's for which we calculate f values and save them in an array oldf. Then for each iteration of Simpson's rule, we collect the new values in an array newf. The result uses 2 * oldf and 4*newf (with exceptions for the endpoints). |
maintains same static data decomposition as Simpson's rule - see page 27 of Numerical Integration notes. |
program sketched using MPI, could also be done in HPF |
all processors calculate new overall sum and decide whether to continue another iteration. (master/slave paradigm not necessary) |
Variation: processors decide independently whether to continue. This saves the Allreduce communication. |
Instead of assigning subintervals or decomposing iterations among the processors, we treat each element of matrix oldf as an interval and decompose intervals among the processors. |
Adaptive: only those subintervals which do not meet the error criteria are iterated. Since this leads to load imbalance, processors divide up intervals each time. |
Introduce a boolean array to keep track of which intervals need further computing. After each iteration, if K intervals need further computing, each processor assigns itself K/Nproc intervals in a contiguous set. |
Initial Decomposition: |
Decomposition after iteration: |
Note that each processor "owns" a contiguous set of indices and works on a subset. |
Instead of sending all data to all processors, each processor sends the number of intervals that it needs to compute on the next iteration to the processor on its left and right (except the end processors have only one neighbor). |
First, use an Allreduce to calculate the load average, the ideal number of intervals per processor. When a processor receives the number of intervals from its rightmost neigbor, it compares with its own. If it is different by more than one:
|
Iterate the averaging of intervals until a load balance is achieved - at the end of each iteration, an allreduce can test whether any processor is still load balancing. |
We believe this converges as the below average load processors stay fixed, and extra load will flow "downhill" until it gets there. This is probably more slowly converging than necessary - there are lots of possible algorithms, including arbitrarily stopping after some number of iterations. |
The load balancing iterations may take up to Nproc sends and receives of only a small number of values. If there is no significant load imbalance, this may be much less, making this method have less message-passing than the CommAll algorithm. |
Variation: In this application and in general, the units of computation do not have to be contiguous. Each processor can have a list of the indices that it is working on and send load to neighbors with wraparound, or communicate load to other choices of processors. |