Received: from popeye.llnl.gov (popeye.llnl.gov [128.115.18.63]) by postoffice.npac.syr.edu (8.7.5/8.7.1) with ESMTP id NAA22237 
for <gcf@npac.syr.edu>; Fri, 10 Oct 1997 13:33:40 -0400 (EDT)
Received: from [134.9.14.17] (standrews.llnl.gov [134.9.14.17])
          by popeye.llnl.gov (8.8.5/LLNL-3.0.2) with ESMTP
          id KAA24648; Fri, 10 Oct 1997 10:24:32 -0700 (PDT)
X-Sender: e588294@popeye.llnl.gov
Message-Id: <v0310280cb06422387c23@[134.9.14.17]>
Mime-Version: 1.0
Date: Fri, 10 Oct 1997 10:24:25 -0800
To: cpande@hamlet.caltech.edu, gcf@npac.syr.edu
From: "James R. McGraw" <jmcgraw@llnl.gov>
Subject: Referee's Report:  C 370
Cc: jmcgraw@llnl.gov
Content-Type: text/plain; charset="us-ascii"
Content-Length: 9485

Referee Report for
Concurrency:  Practice and Experience

Paper Id:  C 370
Paper Title:  Adaptive Parallelism in Compiler-Parallelized Code
Authors:  Hall, Martonosi

Referee:  James R. McGraw
Address:  Lawrence Livermore National Laboratory
          P.O. Box 808, mail Stop 561
          Livermore, CA 94551
          (510) 422-0541
          jmcgraw@llnl.gov

Referee Recommendation:

Do not accept the paper.  It requires a major revision before reconsideration.


Referee Comments for Editor only:

=====================================================================
Referee Report for
Concurrency:  Practice and Experience

Paper Id:  C 370
Paper Title:  Adaptive Parallelism in Compiler-Parallelized Code
Authors:  Hall, Martonosi

Referee Comments for Author and Editor:

The focus of this research is to examine the potential benefit of
dynamically controlling the number of processors assigned to an application
during its execution.  The argument is that during execution, an
application's ability to exploit concurrency can vary substantially.
Rather than having processors sit idle, it could be advantageous to
reallocate them to other jobs.  This paper examines a particular context
for studying the question (compiler-parallelized code, aka loop level
analysis) and a particular type of algorithm (mostly ad hoc, with a
rational logic to it).  The paper performs a very limited initial study and
concludes there is some potential benefit from this line of work.

The analysis in the paper covers five "applications," four from the
Specfp95 benchmark suite and a sparse conjugate gradient solver from the
NAS benchmark set.  All are compiled by the Stanford SUIF compiler for the
SGI PowerChallenge, using a maximum of 14 processors.  The dynamic
allocation of processors is based on a per-loop basis within each code,
using processor utilization (speedup) as the criteria for adding or taking
away processors.  The policy algorithm tries not to overreact too quickly
to changes in speedup achieved by each different execution of a loop.  For
the particular test cases (running two different codes simultaneously) the
total execution time performance gain ranged from + 33% to - 22%, with most
of the cases showing at least some improvement.  The baseline for
comparison was running the two codes one after the other.

My reactions to this work are mixed.  At a high level, I am not optimistic
about the cost/benefit tradeoff likely to accrue from this approach.  My
primary reason is one of memory usage.  Implicit in this work is the
assumption that two (or more) different jobs could reside on a processor at
the same time.  For most really large scientific applications, they tend to
be even more space limited per processor that compute limited.  Each
application code uses all of available memory.  Where is the space going to
come from?  Second, we are discovering that one of the most critical
aspects of performance on these processors is correct management of the
caches within each processor.  Shifting processors back and forth between
two jobs will certainly push the issue of cache management outside the
control of an application developer.  As a result, I tend to prefer trying
to see what can be done beyond the simple loop-level analysis within one
code, to see if there is greater concurrency that can be exploited to
overlap the kind of problems identified in the paper.

However, looking more closely at the general approach described in the
paper, it appears to me that the strategy for dynamic processor allocation
could be applied much more broadly than the paper suggests.  In particular,
if a code exploits task parallelism and loop parallelism, then the concepts
here could be relevant in dynamic scheduling within a single job.  that
might be very useful.  It would also fit well with the approach the authors
have taken, which was to jointly compile two different codes so they could
share a common thread scheduler.

AS currently presented, I do have a number of specific technical questions
regarding the approach taken and the results presented.  I believe that all
of these issues need to be resolved in a favorable manner.

The combination of information in Table 1 and Figure 1 lead me to believe
that I do not completely understand what the authors have done.  Based on
my interpretation of Figure 1, the speedups shown in Table 1 appear to be
low, by a significant amount.  Lets take the Hydro2d case as an example
(the concern applies to all five cases):

It is hard to get an exact reading from the bar graph, but I think it is
safe to say that:

    4% of seq. exec. time has speedup  1
    2% of seq. exec. time has speedup 12
   60% of seq. exec. time has speedup 13
   34% of seq. exec. time has speedup 14

Using these numbers, I can calculate the speedup for the entire code as:

     T(s) = seq. execution time

                                    T(s)
Speedup =  ------------------------------------------------------
           .04 * T(s) + .02 *T(s)/12 + .6 *T(s)/13 + .34 *T(s)/14

                       1
        =  -----------------------------
           .04 + .02/12 + .6/13 + .34/14


        =  8.9

Even if I use a worst case assumption that all the buckets are rounded up,
meaning I drop my speedup numbers to 11, 12, and 13, the result is still a
projected 8.5 speedup.  However, Table 1 indicates that the speedup for
this code is only 7.8.

For the su2cor case, my eye-ball analysis came up with a projected speedup
of 5.3, as opposed to your 2.9.  (I used 9% @ 1, 4% @ 5, 3% @ 7, 21% @ 8,
4% @ 9, 19% @ 10, and 40% @ 11.)  These differences may not appear to be
that great, but I cannot see how you can possibly get down to a 2.9
speedup.  As one point of reference, a 25% sequential, 75% speedup of 8 is
2.9.  This is a lot worse than what you show for Su2cor in Figure 1.

It seems to me that all of your speedup numbers are similarly low in Table
1, based on this type of calculation.  Is there some other non-trivial
factor that I am missing, like sequential time in the code, not spent
inside any loops?  (Maybe the bar charts should not add up to 100%?)

On page 5, the authors state "a loop is always allocated an even number of
threads."  But on page 6, there is the statement that "adapting from two
processors to one processor is treated as a special case... ."  Wouldn't
the even number requirement preclude going down to one processor?

With respect to the experiments and the experimental results, I am less
convinced of their meaning and value than the authors are.  First, the
program characteristics pose some minor problems because their various
execution times in some cases (we don't know which) vary by as much as a
factor of 2.  So in the pair-wise matchings, there is likely to be plenty
of times when one of the two has actually finished very early and the
benefit of the overlap is lost.  Moreover, the overall performance is
clearly limited by the longer-running of the two applications.  The authors
just don't give enough informaton to clearly understand what happened in
any one run with respect to the scheduling decisions.

Second, I think the experiments also need to do some direct comparisons
with what can be gained with the gang-scheduling strategy.  The authors are
not correct in saying in the conclusions that these results imporve on gang
scheduling.  No real comparisons are made to gang-scheduling.  In
particular, Su2cor and CGM never have any business asking for more than 10
or 11 processors, because they have inadequate concurrency (for the given
problem size).  How fast would both of these problems be solved if I gave 7
processors to each job on a dedicated basis?  My bet would be that at 7
processors, both would show reasonable speedups and compete well with the
31% improvement the authors achieved.

Third, I get the feeling these are optimistic results because the thread
scheduling is being done within a single task (the two different codes are
compiled together to use the same thread package).  That has got to be more
cost effective than pushing it outside to the OS to handle.  How much more
costly will things be in this case, and how will that affect performance?
I think this is very hard to answer.

Fourth, I think it is important to see what impact is being made on each of
the individual programs' completion times.  For example, in these tests,
swim and multigrid have little, if anything, to gain in terms of efficient
use of processors, but their execution could be slowed by the interactions
with the other codes.  How bad is that?  Notice also that this scheduling
strategy has a built-in negative incentive to achieve high parallelism in
all of a code's loops.  The better a code developer does, the more the code
is likely to be slowed down by other codes that need coverage for their
slow sections of code.

The more I write, the more I am coming to the conclusion that I really
don't believe the results given or the conclusions that can be drawn from
them.  The speedup data in Table 1 is suspect.  The varying lengths of the
sequential and parallel programs can have a significant impact on the
experimental results.  And, I am not sure it is even possible to run
important jobs side-by-side because of the memory requirements needs of
large problems.

As a result of all these concerns, the best I can recommend to the authors
is a major rewrite of the paper to try to address my concerns.  I do not
recommend acceptance.