RESPONSE to the referees' comments for, "Implementing a Dynamic
Processor Allocation Policy for Multiprogrammed Parallel
Applications in the Solaris Operating System," by Kelvin
K. Yue and David J. Lilja.

Our response follows each of the referee's comments, denoted
by [RESPONSE:  ].


REFEREE:      (A)

IV. DETAILED COMMENTS

The reasons for the recommendation to reject this paper are 
that it makes a very small technical contribution. Specifically,
1) the LLPC design is simple, straightforward, and seems to be 
much inferior to other ideas that have appeared in the literature, and
2) the evaluation is minimal and unconvincing.
The paper's only merit is that it reports on an implementation 
within the framework of a commercial system, but in my view this 
alone does not justify publication in an archival journal - 
maybe a note in OSR at best.

[RESPONSE:
This referee seems to imply that, since LLPC is simple and
straight-forward, it is somehow inferior to other ideas that have
appeared in the literature.  In fact, it is this very simplicity
that allows our approach to outperform the other proposed
mechanisms, as we have shown in our perviously published work
about LLPC and in this paper.  The primary contribution of this
present paper, however, is that it describes an implementation
of the LLPC mechanism in a commercially-important operating system,
Solaris, and its corresponding compiler.  We feel that the
description of this implementation certainly justifies publication
in an archival journal.]


LLPC design:
As you note, LLPC is a version of dynamic partitioning. Your 
algorithm for LLPC is essentially for the first application 
to grab all the processors, and then give other applications 
single processors. There is voluminous literature on dynamic 
partitioning showing that equipartitioning is better, that is, 
you should strive for having equal numbers of threads/processors 
for all applications. it seems trivial to add the number of 
applications to the number of threads returned by the scheduling 
control function, and this would allow the LLPC mechanism to change 
the number of threads used so as to approximate equipartitioning. 
References for equipartitioning include your [1,5] and
Deng & Dymond, 8th SPAA 82-88, 6/96
Parsons & Sevcik, SIGMETRICS '96, 57-67
McCann et al, TOCS 11(2) 146-178, 5/93
Tucker & Gupta, 12th SOSP 159-166, 12/89

[RESPONSE: 
When an application starts running on an ideal system, it should be able 
to use all the available resources it requires because there is no other 
competition.  When there is later competition for resources,
the LLPC algorithm will automatically adjust the allocation
dynamically to suit the requirements of each application.
Statically limiting the number of CPUs an application can use
will waste system resources when additional application programs
begin competing for the limited system resources.  One of the points
of this paper is to demonstrate the effectiveness of this dynamic
allocation compared to static partitioning.]


In addition, your design wastes significant resources by 
leaving spinning threads in sequential parts of each application. 
This is an inherent deficiency of LLPC where changes are only made 
at the beginning of each loop. You should either try harder to solve 
it (e.g. by a light-weight block-resume mechanism) or at least 
evaluate its effect in great detail.

[RESPONSE: 
We disagree that our mechanism wastes significant resources.  In
fact, by dynamically allocating resources, we more efficiently
use the available resources than the static partitioning mechanims
this referee appears to advocate above. If a CPU is idle, it
executes an idle thread, which simply runs a loop checking for
new work from the work queue.  Instead of releasing the thread
to the operating system, which would incur very high overheads,
this approach keeps the CPU under our control so that it is ready
for the next piece of work from this application.]


Evaluation:
Real systems do not execute a fixed static workload of two chosen applications. 
In real systems, applications are submitted at random times, and then they terminate. 
At any given instant, you can take a snapshot and see 0 or 1 or 2 or 3 or x applications, 
but over time the system is highly dynamic. such dynamics have a strong effect on processor 
allocation, especially so in a system that uses a "first app gets all" allocation rule. 
Therefore you MUST conduct such a dynamic execution in order to evaluate LLPC, basing 
the arrivals, job sizes, and runtimes on data from execution logs from real machines.
Sect 4/Fig 2: it seems that another thread was created at about time 3 and terminated 
at about time 13. Is this so? How was it arranged?

[RESPONSE:
Although the total number of applications in our experiments is fixed
and static, the parallel and sequential portions of those applications
vary dynamically during the execution of the programs.  This unpredictable
variation produces a very dynamic workload, not a simple static workload
that this referee suggests we measured.]


Table 1: can you also provide info on parallel/serial fractions of the application?

[RESPONSE:
The parallel/serial fractions of the application programs are shown
implicitly in terms of the speedup in the last column of Table 1.]

Introduction:
I disagree with the sweeping comment that gang scheduling reduces utilization; 
see the following paper:
Feitelson & Jette, LNCS 1291 238-261, 4/97

[RESPONSE:
While scheduling is an area of ongoing research, there are numerous
published results that support this comment.  See, for instance,
the papers we referenced, including A. Tucker's Ph.D Thesis from
Stanford U., and C. Severance et al, ICPP, p. I:24-31 1995, among others.]

Finally, signposting is always a problem, but a sentence like "The
final section concludes the paper" is really redundant...

[RESPONSE:
This seems to be a minor comment about stylistic differences.]



---------------------------------------------------------------------------

REFEREE:     (B)
IV. DETAILED COMMENTS

In this paper, the authors describe a mechanism for supporting
dynamic processor allocation policies for parallel applications
on multiprogrammed SMPs running the Solaris operating system. 

This is not the first work on support for dynamic partitioning
policies on SMPs. Previously, McCann et al (ACM TOCS Volume 11(2),
1993) and Andrew Tucker (in his 1993 dissertation, which the authors
cite) have described how dynamic partitioning can be supported
on SMPs. The authors approach is similar to the approaches described
in these papers. The main contribution of this work is extending the 
Solaris scheduling control mechanism and using it for implementing
a dynamic processor allocation policy. 

As such, while the paper describes good work, I do not believe the 
research contribution is sufficient to justify publication
in IEEE TPDS. I recommend that the authors re-submit the paper
to a journal such as Software: Practice & Experience.

[RESPONSE:
Which we have done, although Concurrency:  Practice & Experience
is a better choice.]

Some additional comments:

1) Threads are considered as being in three states: running, spinning
   and sleeping. How does the scheduling mechanism implemented handle
   thread blocking due to user I/O and page faults?

[RESPONSE:
Running, spinning and sleeping are "higher level" states in that
a thread is still considered to be in the running state by the
OS when it is blocked due to system level activities, such as
page faults.  The OS handles this situation in the normal fashion.]

2) How does the scheduling policy handle fairness in delivering
   resources to competing jobs?

[RESPONSE: 
One of the important issues in designing a processor allocation policy
is the fairness of the policy.  LLPC would seem to favor applications
that reach their parallel section first by allowing them to create
as many processes as they require, while later arriving-applications
might not be allowed to create as many threads as needed to achieve
their best performance.  However, our experiments show that
no single application ever monopolizes all of the processors.
If an application can use only a single processor for its current 
parallel loop because all the other processors are being used, 
it is very likely that these processors will be available for the
application's next parallel loop.  In a subsequent study, we have
shown that 80% of the time an application is able to obtain
the best allocation for its parallel execution.  Thus, fairness
occurs automatically due to the dynamically varying workloads
of the individual applications.]

3) There is a typo In Figure 1: destory --> destroy

[RESPONSE: 
Corrected.  Also corrected the same typo in Figure 2.]


---------------------------------------------------------------------------

I would strongly accept the paper:

[RESPONSE: 
We agree!]

TPDS 107468, "Implementing a Dynamic Processor Allocation Policy for 
Multiprogrammed Parallel Applications in the Solaris Operating 
System" by David Lija.

The paper addresses one of the most important issues remaining in 
processor allocation, the modification of operating systems to 
support dynamic resource allocation.  The paper is well written, 
compares their solution with other approaches, describes the 
algorithms used to dynamically allocate threads, and provides 
experimental results.  The paper also excites a fair number of 
related questions:
- the choice for number of processors is made independently by each 
master thread.  Can performance be improved if a mechanism is added 
to support global decisions?  An example is to set the number of 
threads to wake up to leave free a number of threads equal to the 
number of jobs trying to run on the system.  This allows room for 
other jobs to start.

[RESPONSE: 
We examined that strategy previously.  However, our results indicated
that the system resources were better used by allocating them when
they are needed instead of saving some for later.  This will maintain
a high level of performance when the system is not busy.]

- how does the scheduler avoid collective behavior effects in which 
all master threads attempt to simultaneously acquire more threads? 
Would a third regime in which the system neither expands or contracts 
the number of threads help overall performance?

[RESPONSE:
Based on the data from our experiments, no single application 
ever monopolizes all of the processors.  If an application can use
only a single processor for its current parallel loop because all 
of the other processors are being used, it is very likely that these 
processors will be available for the next parallel loop. In a 
subsequent study, we have shown that 80% of the time an application 
is able to obtain the best allocation for its parallel execution. ]

- the system partially hides the latency of waking a new thread by 
continuing to execute with the original number of threads during the 
wake-up time.  Can this latency be further decreased by tying the 
thread execution to a particular cache location where the data is 
already loaded?

[RESPONSE:
As other researchers have shown, it is possible to further hide
the latency by tying threads to caches.  However, LLPC does not
have the ability to tie the execution of a thread to a particular
cache location. It is up to the lower level kernel to apply that
optimization.]

- is there a way to compare the expected CPU utilization with the 
Tera MTA approach?  Effectively, the system provides a way to map 
workload onto the CPUs that allows reassignment of idle resources. 
The difference is that the latency that is required to start a 
sleeping thread is much greater than the context switch time of the 
MTA.  If the granularity of the thread is large enough, the 
difference in latency can be amortized.  What is the thread 
granularity (number of execution cycles) typically compared to the 
latency of starting a sleeping thread (number of execution cycles)?

[RESPONSE:
Comparing to the Tera MTA approach would be a very interesting
experiment.  However, it is beyond the scope of this present paper.
The thread granularity is quite large compared to the latency of
starting a sleeping thread, which required only a few instructions.]

There are a few arithmetic checks that can help the reader of the 
paper.  In Table 3, the average number of threads per loop is shown 
to be about 3.26 when a sequential process is using on average 0.78 
threads (18 sec / 23 sec).  I would expect the total to add up to 
slightly less than 4 threads.  Is the difference the start-up time 
for the sequential process?

[RESPONSE:
The thread/loop ratio is 3.26, while the average number of threads
for the sequential application is 0.78. That gives a total number
of threads in the system to be 4.04.  This value means that during
certain times, there are more threads than avaliable CPUs. However,
LLPC is able to adjust and keep close to the number of CPUs.  In
the time-sharing experiments, however, the total number of
threads is 5.0.]

In table 16, the average number of threads executed per parallel loop 
is greater than two for each application.  This implies that each 
application is taking advantage of sequential execution that is 
occuring in the competing application.  Can you quantify the amount 
of time expected in sequential execution and show that this is 
comparable to the increase in number of threads for the parallel 
execution?

[RESPONSE:
We did not keep track of the data in such fine detail.  Consequently,
we are unable to fulfill this request.]

---------------------------------------------------------------------------
REFEREE:    (E)
IV. DETAILED COMMENTS

This paper does two things. First, it describes how LLPC -- a previously
proposed dynamic processor allocation policy for shared-memory
multiprocessors -- is implemented given Sun's operating system and
parallelizing compiler. This took
- extending "scheduling control" to obtain overall load information (p.7)
- adding the thread state "sleeping" (p.9)
- tuning the LLPC algorithm: delaying the put-to-sleep and wake-up (p.10)
Second, using synthetic and SPEC95 benchmarks, the resulting performance
is compared to time-sharing (Table 3) and static partitioning (Figure 3).

This work is highly interesting. Whereas most work in the area of parallel
job scheduling uses analytic or simulation modeling, this paper presents
empirical results. The author's access to and modification of the Sun's
source code is unique.

However, my main points of criticism are
- the disadvantages of dynamic allocation are not even mentioned
- the experiments are simplistic and somewhat biased and
- some statements and explanations are poor or even plain wrong!

First, application code has to be malleable. Your code on p. 8 is,
but you never tell the reader.

Second, you use one scenario (a parallel and a sequential application)
when comparing LLPC to time-sharing and a different scenario (two parallel 
applications) when comparing LLPC to static partitioning. Thus, your
experiments are biased rather than symmetric. I suggest showing results
for LLPC, time-sharing and static partitioning for both scenarios. 
(What about other scenarios: more than two applications, a whole workload?
In the best of all worlds, one would also compare to co-scheduling.)

Third and most important, the following must be changed:
* conclusion, second to last sentence: WRONG
  LLPC performs better than time-sharing (see section 4.4) because
  "parallelism on average" is LOWER, 3.25 instead of 4 !!!
  
* abstract, last sentence: not "obtain a fair share" but "use otherwise
  idle". LLPC is not necessarily fair, remember "swim grabbing all
  processors", p. 16. (A discussion of fairness would be interesting,
  though.)
  
[RESPONSE:
We agree that fairness is an interesting issue, and we examined
it in some of our previous evaluations of LLPC (see the cited
papers).  In the application swim, for instance, the application
did not grab all the processors.  Rather, it was the other
applications that grabbed most of the processors and put swim
at a disadvantage.  Even so, swim was able to get over 2 processors
on average using LLPC.]

  Also, p. 16 second paragraph and Fig. 3: you seem to mean AVERAGE
  slowdown. Make it clear. (But mention makespan, also).
  
[RESPONSE:
The slowdown factor is not defined as the average slowdown.  Rather,
it is defined as the slowdown of each particular application.]

* section 4, first paragraph, second to last sentence: 
  indicate that between 2nd and 8th parallel loop, load is increased
  
[RESPONSE:
It is not clear what the referee was referring to here.]
  
* p. 15, second paragraph: explain the reasons for the increase in
  execution time
  
[RESPONSE:
The reason for the increase in execution time is because of the
existence of the sequential application in the system.  This has
been clearly explained in the text.]
  
* Table 4, caption: change to "avg. # threads executed per parallel loop
  for application in row when using LLPC to execute simultaneously with
  application in column"
* typos:
  p. 4, third paragraph: two periods
  section 4, third line: omit "to"
  conclusion, last line: "than"
  reference 1: "Feb."
  
[RESPONSE:
These have been corrected.]

Further suggestions:
- p. 4: add a picture explaining dynamic partitioning
- no ideas for future work, how about
  considering alternatives to evenly distribution of work p. 7
  further investigating the trade-off regarding wake-up overhead p. 9
  adding thread stealing p. 10
  further investgating the problem with system daemons p. 14

[RESPONSE:
We decided against incorporating these comments.]

----------------------------------------------------------------------
REFEREE:      (G)
IV. DETAILED COMMENTS


This paper describes a system (including a parallelizing compiler and
modifications to a commercial operating system) that allows parallel
applications running on a shared-memory multiprocessor to adjust their
degree of parallelism dynamically, in response to changes in system
load.

The greatest strength of the paper is its practicality.  The system
has been implemented in a commercial development environment (Solaris)
and is apparently intended to be included in future releases, if it
hasn't been already.  Most companies with similar systems don't take
the time to present their work in academic journals; this case is
exceptional, and should be commended.

Another strength is that the presentation of introductory material is
clear and complete (almost to a fault).  Although the amount of
background may be unnecessary for many readers, I think it is a good
idea to err on the side of explaining too much, especially in a field
where some terminology is still ambiguous.

The only weakness of the paper is that the discussion of related work
(and the bibliography) is mostly limited to commercial systems, with
relatively little discussion of academic and theoretical work.  For
example, it might be worth discussing the pros and cons of letting
applications choose their own number of threads, compared to a
centralized system.  On page 16, the authors point out a failure mode
in which one application gets a head start and allocates all the
processors, imposing a significant slowdown on another application.
This seems like a serious limitation; I think it deserves more
discussion, especially in the context of existing research on
centralized allocation strategies.

[RESPONSE:
See the responses to this same comment above.]

The following are more specific comments, some of which are stylistic
or typographical:

Abstract: Where LLPC is introduced, it would be good to make it
clearer that this is a policy proposed by the same authors in prior
work.

Introduction: The first sentence is a bit circular, and may not be
meaningful.  It is not clear what "fair" means in this context, and it
is not necessarily true that fairness is necessary for minimal
execution times or high throughput.  In fact, the authors may not want
to bring fairness up at all, since their policy does not address the
issue, at least as I understand the word.

Section 2.2: The discussion of processor scheduling says that the
scheduler chooses the thread from the beginning of the queue and puts
it at the end.  That's not necessarily true, since lots of systems
either have multiple queues or priorities within the queue.  It might
be good to describe a more general system, especially since the system
in question actually has more than one queue.

On page 4, there are two periods at the end of a sentence.

Last paragraph on page 4: I would remove the phrase "so that the
system load is light".

On page 5, be sure to define SPMD.

On page 6, there is an n-dash (-- in LaTeX) where there should
be an m-dash (---).

[RESPONSE:
We have incorporated the majority of the above suggestions into
the paper.]


On page 9, the discussion of the "new" thread state (sleeping) is
confusing because I was under the impression that all systems
supported a "waiting" or "blocked" state, and I didn't understand how
"sleeping" was different from these.  Also, there is another n-dash
here.

[RESPONSE:
Sleeping is similar to the waiting or blocked state.  In OS
terminology, however, a thread spins when it is in the waiting or
blocked state.  We use the term "sleeping" to avoid confusion
with standard OS terminaology.]

On page 12, I was a little surprised at the age of the system being
used for testing.  Has this paper been in the works for a while?

[RESPONSE:
Yes, unfortunately, it has been in the review process for an
unusually long time!]

Section 4.3: The explanation for the slowdown---sensitivity to
background load caused by daemons---doesn't seem to explain the
data.  The number of threads per loop is only 0.1% below the
maximum, which means that the program almost never runs on fewer
than 4 processors.  On the other hand, the run times increase
by around 1.5%  There must be another source of overhead (although
it is certainly not unacceptable).

[RESPONSE:
The run times of the daemons are very small.  The overhead of LLPC
really is 1.5%.]

Section 4.4: The authors use a synthetic workload to evaluate
the ability of LLPC to react dynamically to changing system load.
The synthetic program runs for 18 seconds and sleeps for 5.  The
authors don't discuss where these numbers come from, but it is
clear that the frequency of the load changes is VERY long
compared to the granularity of the loops in the applications
(for example, su2cor runs 191,000 loops in 241 seconds, or about
1 millisecond per loop).  The number of load changes during the
execution is about ten, which means that the program runs 19,100
loops between load changes (not counting load changes due to
daemons).  It would be useful for the authors to discuss these
relationships, and maybe do some experiments to test the ability
of the system to adapt to load changes with higher frequency.

[RESPONSE:
The individual time for a loop may be as small as 1 msec, but
loops can be run consecutively for a few seconds.  We agree
that it would be nice to further elaborate these relationships,
but it is no longer feasible to carry out the necessary
experiments.]

As I already mentioned, page 16 reveals a failure mode in which
processors are not allocated to competing applications fairly.
This is a real problem with decentralized allocation, and should
be discussed.

Section 5: I feel strongly that the Conclusion section should
contain only conclusions (things we didn't know when we started
the paper) and no repetition of introductory material.  Thus,
I think the first two sentences of this section should be removed.
The rest of the section reads just fine without them.

[RESPONSE:
We have a difference of opinion on this minor stylistic point.]