This is a brief note on parallel complexity analysis of 2D Multigrid, including communication costs. The multigrid method requires each grid point to be updated depending on as many as 8 neighbors (those to the N, E, S, W, NW, SW, SE and NE). This will determine our communication costs.
We assume we have an n=(2^m+1)-by-n grid of data, and that p=s^2 is a perfect square. The data is laid out on an s-by-s grid of processors, with each processor owning an ((n-1)/s)-by-((n-1)/s) subgrid. This is illustrated in the following diagram by the pink dotted line, which encircles the grid points owned by the corresponding processor. The grid points in the top processor row have been labeled (and color coded) by the grid number i of the problem P(i) in which they participate. There is exactly one point labeled 2 per processor. The only grid point in P(1) with a nonboundary value is owned by the processor above the pink one. In general, if p>=16, there will be .5*log_2(p)-1 grids P(i) in which fewer than all processor own
The processor indicated by the pink dotted line requires certain grid point data from its neighbors to execute the algorithm; this data has been indicated by the same color coded numbers as before. For example, to update its own (blue) grid points for P(5), the pink processor requires 8 blue grid point values from its N, S, E, and W neighbors, as well as single blue grid point values from its NW, SW, SE and NE neighbors. Similarly, updating the (red) grid points for P(4) requires 4 red grid point values from the N, W, E and W neighbors, and one each from the NW, SW, SE and NE neighbors. This pattern continues until each processor has only one grid point. After this, only some processors participate in the computation, requiring one value each from 8 other processors.
This is enough to do a big-Oh analysis of the communication costs. For simplicity, let p=4^k=2^(2*k), so each processor owns a 2^(m-k)-by-2^(m-k) subgrid. Consider a V-cycle starting at level m. Then in levels m through k of the V-cycle, each processor will own at least one grid point. After that, in levels k-1 through 1, fewer than all processors will own an active grid point, and so some processors will remain idle. Let f be the time per flop, alpha the time for a message startup, and beta the time per word to send a message.
At level m >= i >= k, the time in the V-cycle will be
Time at level i = O( 4^(i-k) ) * f ... number of flops, proportional to the ... number of grid points per processor + O( 1 ) * alpha ... send a constant number of messages ... to neighbors + O( 2^(i-k) ) * beta ... number of words sentSumming all these terms from i=k to m yields
Time at levels k to m = O( 4^(m-k) )*f + O( m-k )*alpha + O( 2^(m-k) )*beta = O( n^2/p )*f + O( log(n/p) )*alpha + O( n/sqrt(p) )*betaNote that there is a "surface-to-volume effect" when i>>k, with each processor communicating just its boundary data and computing with the many more grid points that it owns. We have seen this attractive phenomenon before in particle simulations, where it also meant that computation time was likely to dominate communication time.
On levels k >= i >= 1, this effect disappears, because each processor needs to communicate about as many words as it does floating point operations:
Time at level i = O( 1 ) * f ... number of flops, proportional to the ... number of grid points per processor + O( 1 ) * alpha ... send a constant number of messages ... to neighbors + O( 1 ) * beta ... number of words sentSumming all these terms yields
Time at levels 1 to k-1 = O( k-1 )*f + O( k-1 )*alpha + O( k-1 )*beta = O( log(p) )*f + O( log(p) )*alpha + O( log(p) )*betaThe total time for a V-cycle starting at the finest level is therefore
Time = O( n^2/p + log(p) )*f + O( log(n) )*alpha + O( n/sqrt(p) + log(p) )*betaSimilarly, the total time for a V-cycle starting at level j, k <= j <= m, is
Time(j) = O( 4^j/p + log(p) )*f + O( j )*alpha + O( 2^j/sqrt(p) + log(p) )*betaThe total time for a V-cycle starting at level j < k is
Time(j) = O( j )*f + O( j )*alpha + O( j )*betaTherefore, the total time for Full Multigrid, assuming n^2 >= p, so that each processor has at least 1 grid point, is
Time = sum_{j=1}^m Time(j) = O( n^2/p + log(p)*log(n) )*f + O( (log(n))^2 )*alpha + O( n/sqrt(p) + log(p)*log(n) )*beta = O( N/p + log(p)*log(N) )*f + O( (log(N))^2 )*alpha + O( sqrt(N/p) + log(p)*log(N) )*betawhere N=n^2 is the number of unknowns. Thus we see that the speedup of the serial floating point work O(N), is nearly perfect, O( N/p + log(p)*log(N) ), when N>>p, but reduces to log(N)^2 when p=N (the PRAM model).
In practice, the time spent on the coarsest grids, when each processor has one or zero active grid points, while negligible on a serial machine, is significant on a parallel machine, enough so to make the efficiency noticeably lower.
For example, in the paper "Analysis of the Multigrid FMV Cycle on large scale parallel machines", R. Tuminaro and D. Womble, SIAM J. Sci. Comp. v. 14, n. 5, 1993, the authors consider several variations of multigrid on a 1024 processor nCUBE2. This machine has a relatively low latency and high bandwidth compared to its floating point speed. On a 64-by-64 grid of unknowns, where there are only 4 unknowns per processor on the finest grid, (this is quite small for such a large machine!), the efficiency of a V-cycle alone was .02, and for Full Multigrid it was .008. In this case, most of the time some processors are idle.
For a 1024-by-1024 grid, with 1024 unknowns per processor, the efficiencies were .7 and .42, respectively, which are quite reasonable, though Full Multigrid is noticeably less efficient. Still, it is 2.6 times faster than V-cycles alone for solving the problem. The authors explored variations on Full Multigrid, where they did varying numbers of calls to MGV (a V-cycle) within the loop of FMG, using estimates of convergence rate and parallel efficiencies to pick the optimal number of MGV calls; they were able to increase the efficiencies to .01 and .47, respectively. For the 1024-by-1024 grid, either efficiency (.42 or .47) is quite reasonable.