Received: from popeye.llnl.gov (popeye.llnl.gov [128.115.18.63]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with ESMTP id NAA22237 for ; Fri, 10 Oct 1997 13:33:40 -0400 (EDT) Received: from [134.9.14.17] (standrews.llnl.gov [134.9.14.17]) by popeye.llnl.gov (8.8.5/LLNL-3.0.2) with ESMTP id KAA24648; Fri, 10 Oct 1997 10:24:32 -0700 (PDT) X-Sender: e588294@popeye.llnl.gov Message-Id: Mime-Version: 1.0 Date: Fri, 10 Oct 1997 10:24:25 -0800 To: cpande@hamlet.caltech.edu, gcf@npac.syr.edu From: "James R. McGraw" Subject: Referee's Report: C 370 Cc: jmcgraw@llnl.gov Content-Type: text/plain; charset="us-ascii" Content-Length: 9485 Referee Report for Concurrency: Practice and Experience Paper Id: C 370 Paper Title: Adaptive Parallelism in Compiler-Parallelized Code Authors: Hall, Martonosi Referee: James R. McGraw Address: Lawrence Livermore National Laboratory P.O. Box 808, mail Stop 561 Livermore, CA 94551 (510) 422-0541 jmcgraw@llnl.gov Referee Recommendation: Do not accept the paper. It requires a major revision before reconsideration. Referee Comments for Editor only: ===================================================================== Referee Report for Concurrency: Practice and Experience Paper Id: C 370 Paper Title: Adaptive Parallelism in Compiler-Parallelized Code Authors: Hall, Martonosi Referee Comments for Author and Editor: The focus of this research is to examine the potential benefit of dynamically controlling the number of processors assigned to an application during its execution. The argument is that during execution, an application's ability to exploit concurrency can vary substantially. Rather than having processors sit idle, it could be advantageous to reallocate them to other jobs. This paper examines a particular context for studying the question (compiler-parallelized code, aka loop level analysis) and a particular type of algorithm (mostly ad hoc, with a rational logic to it). The paper performs a very limited initial study and concludes there is some potential benefit from this line of work. The analysis in the paper covers five "applications," four from the Specfp95 benchmark suite and a sparse conjugate gradient solver from the NAS benchmark set. All are compiled by the Stanford SUIF compiler for the SGI PowerChallenge, using a maximum of 14 processors. The dynamic allocation of processors is based on a per-loop basis within each code, using processor utilization (speedup) as the criteria for adding or taking away processors. The policy algorithm tries not to overreact too quickly to changes in speedup achieved by each different execution of a loop. For the particular test cases (running two different codes simultaneously) the total execution time performance gain ranged from + 33% to - 22%, with most of the cases showing at least some improvement. The baseline for comparison was running the two codes one after the other. My reactions to this work are mixed. At a high level, I am not optimistic about the cost/benefit tradeoff likely to accrue from this approach. My primary reason is one of memory usage. Implicit in this work is the assumption that two (or more) different jobs could reside on a processor at the same time. For most really large scientific applications, they tend to be even more space limited per processor that compute limited. Each application code uses all of available memory. Where is the space going to come from? Second, we are discovering that one of the most critical aspects of performance on these processors is correct management of the caches within each processor. Shifting processors back and forth between two jobs will certainly push the issue of cache management outside the control of an application developer. As a result, I tend to prefer trying to see what can be done beyond the simple loop-level analysis within one code, to see if there is greater concurrency that can be exploited to overlap the kind of problems identified in the paper. However, looking more closely at the general approach described in the paper, it appears to me that the strategy for dynamic processor allocation could be applied much more broadly than the paper suggests. In particular, if a code exploits task parallelism and loop parallelism, then the concepts here could be relevant in dynamic scheduling within a single job. that might be very useful. It would also fit well with the approach the authors have taken, which was to jointly compile two different codes so they could share a common thread scheduler. AS currently presented, I do have a number of specific technical questions regarding the approach taken and the results presented. I believe that all of these issues need to be resolved in a favorable manner. The combination of information in Table 1 and Figure 1 lead me to believe that I do not completely understand what the authors have done. Based on my interpretation of Figure 1, the speedups shown in Table 1 appear to be low, by a significant amount. Lets take the Hydro2d case as an example (the concern applies to all five cases): It is hard to get an exact reading from the bar graph, but I think it is safe to say that: 4% of seq. exec. time has speedup 1 2% of seq. exec. time has speedup 12 60% of seq. exec. time has speedup 13 34% of seq. exec. time has speedup 14 Using these numbers, I can calculate the speedup for the entire code as: T(s) = seq. execution time T(s) Speedup = ------------------------------------------------------ .04 * T(s) + .02 *T(s)/12 + .6 *T(s)/13 + .34 *T(s)/14 1 = ----------------------------- .04 + .02/12 + .6/13 + .34/14 = 8.9 Even if I use a worst case assumption that all the buckets are rounded up, meaning I drop my speedup numbers to 11, 12, and 13, the result is still a projected 8.5 speedup. However, Table 1 indicates that the speedup for this code is only 7.8. For the su2cor case, my eye-ball analysis came up with a projected speedup of 5.3, as opposed to your 2.9. (I used 9% @ 1, 4% @ 5, 3% @ 7, 21% @ 8, 4% @ 9, 19% @ 10, and 40% @ 11.) These differences may not appear to be that great, but I cannot see how you can possibly get down to a 2.9 speedup. As one point of reference, a 25% sequential, 75% speedup of 8 is 2.9. This is a lot worse than what you show for Su2cor in Figure 1. It seems to me that all of your speedup numbers are similarly low in Table 1, based on this type of calculation. Is there some other non-trivial factor that I am missing, like sequential time in the code, not spent inside any loops? (Maybe the bar charts should not add up to 100%?) On page 5, the authors state "a loop is always allocated an even number of threads." But on page 6, there is the statement that "adapting from two processors to one processor is treated as a special case... ." Wouldn't the even number requirement preclude going down to one processor? With respect to the experiments and the experimental results, I am less convinced of their meaning and value than the authors are. First, the program characteristics pose some minor problems because their various execution times in some cases (we don't know which) vary by as much as a factor of 2. So in the pair-wise matchings, there is likely to be plenty of times when one of the two has actually finished very early and the benefit of the overlap is lost. Moreover, the overall performance is clearly limited by the longer-running of the two applications. The authors just don't give enough informaton to clearly understand what happened in any one run with respect to the scheduling decisions. Second, I think the experiments also need to do some direct comparisons with what can be gained with the gang-scheduling strategy. The authors are not correct in saying in the conclusions that these results imporve on gang scheduling. No real comparisons are made to gang-scheduling. In particular, Su2cor and CGM never have any business asking for more than 10 or 11 processors, because they have inadequate concurrency (for the given problem size). How fast would both of these problems be solved if I gave 7 processors to each job on a dedicated basis? My bet would be that at 7 processors, both would show reasonable speedups and compete well with the 31% improvement the authors achieved. Third, I get the feeling these are optimistic results because the thread scheduling is being done within a single task (the two different codes are compiled together to use the same thread package). That has got to be more cost effective than pushing it outside to the OS to handle. How much more costly will things be in this case, and how will that affect performance? I think this is very hard to answer. Fourth, I think it is important to see what impact is being made on each of the individual programs' completion times. For example, in these tests, swim and multigrid have little, if anything, to gain in terms of efficient use of processors, but their execution could be slowed by the interactions with the other codes. How bad is that? Notice also that this scheduling strategy has a built-in negative incentive to achieve high parallelism in all of a code's loops. The better a code developer does, the more the code is likely to be slowed down by other codes that need coverage for their slow sections of code. The more I write, the more I am coming to the conclusion that I really don't believe the results given or the conclusions that can be drawn from them. The speedup data in Table 1 is suspect. The varying lengths of the sequential and parallel programs can have a significant impact on the experimental results. And, I am not sure it is even possible to run important jobs side-by-side because of the memory requirements needs of large problems. As a result of all these concerns, the best I can recommend to the authors is a major rewrite of the paper to try to address my concerns. I do not recommend acceptance.