Adaptive Integration - November 1, 1995 Computing Techniques for Adaptive Integration CPS615 November 1, 1995 Nancy McCracken Geoffrey Fox NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 Computing Techniques for Adaptive Integration This talk assumes the material found in the CPS615 lecture notes, Introduction to Numerical Integration. Going on from the static data decomposition for integration on page 27 of those notes, we examine further parallel algorithms for computing adaptive Romberg (recursive Simpson) integration. The algorithms are sketched in a C-like pseudo-code intended to give the general idea. Note that programming details are not tested are will inevitably have errors. The data decomposition and remapping techniques discussed are generally applicable to parallel adaptive algorithms for other applications. Sequential Romberg Integration Recall that when you repeatedly apply SimpsonÕs rule, halving the interval each time, you can use old values of the function. Furthermore, there is a convenient pattern to the Simpson coefficients. In our program, we start with an initial set of XiÕs for which we calculate f values and save them in an array oldf. Then for each iteration of SimpsonÕs rule, we collect the new values in an array newf. The result uses 2 * oldf and 4*newf (with exceptions for the endpoints). Program for Romberg Integration Parallel Romberg Integration maintains same static data decomposition as SimpsonÕs rule - see page 27 of Numerical Integration notes. program sketched using MPI, could also be done in HPF Parallel Romberg Integration, Continued all processors calculate new overall sum and decide whether to continue another iteration. (master/slave paradigm not necessary) Variation: processors decide independently whether to continue. This saves the Allreduce communication. Adaptive Parallel Romberg - AlltoAll Version At each iteration, data is decomposed again by all processors. Instead of assigning subintervals or decomposing iterations among the processors, we treat each element of matrix oldf as an interval and decompose intervals among the processors. Adaptive: only those subintervals which do not meet the error criteria are iterated. Since this leads to load imbalance, processors divide up intervals each time. Introduce a boolean array to keep track of which intervals need further computing. After each iteration, if K intervals need further computing, each processor assigns itself K/Nproc intervals in a contiguous set. Initial Decomposition: Decomposition after iteration: Note that each processor ÒownsÓ a contiguous set of indices and works on a subset. Adaptive Parallel Romberg - program initialization Adaptive Parallel Romberg - compute integral iteration and communicate values to all processors Adaptive Parallel Romberg - recompute decomposition Adaptive Parallel Romberg - Neighbors Version At each iteration, processors average load with neighbors Instead of sending all data to all processors, each processor sends the number of intervals that it needs to compute on the next iteration to the processor on its left and right (except the end processors have only one neighbor). First, use an Allreduce to calculate the load average, the ideal number of intervals per processor. When a processor receives the number of intervals from its rightmost neigbor, it compares with its own. If it is different by more than one: If its own is greater and below load average, do not adjust load. Otherwise, average the two numbers and adjust endindex by half the number, counting those intervals for which compute is still true. Send the new index to the right neighbor to use as its startindex. Repeat this process for the leftmost neighbor. Iterate the averaging of intervals until a load balance is achieved - at the end of each iteration, an allreduce can test whether any processor is still load balancing. More discussion of Neighbors version We believe this converges as the below average load processors stay fixed, and extra load will flow ÒdownhillÓ until it gets there. This is probably more slowly converging than necessary - there are lots of possible algorithms, including arbitrarily stopping after some number of iterations. The load balancing iterations may take up to Nproc sends and receives of only a small number of values. If there is no significant load imbalance, this may be much less, making this method have less message-passing than the CommAll algorithm. Variation: In this application and in general, the units of computation do not have to be contiguous. Each processor can have a list of the indices that it is working on and send load to neighbors with wraparound, or communicate load to other choices of processors.