RESPONSE to the referees' comments for, "Implementing a Dynamic Processor Allocation Policy for Multiprogrammed Parallel Applications in the Solaris Operating System," by Kelvin K. Yue and David J. Lilja. Our response follows each of the referee's comments, denoted by [RESPONSE: ]. REFEREE: (A) IV. DETAILED COMMENTS The reasons for the recommendation to reject this paper are that it makes a very small technical contribution. Specifically, 1) the LLPC design is simple, straightforward, and seems to be much inferior to other ideas that have appeared in the literature, and 2) the evaluation is minimal and unconvincing. The paper's only merit is that it reports on an implementation within the framework of a commercial system, but in my view this alone does not justify publication in an archival journal - maybe a note in OSR at best. [RESPONSE: This referee seems to imply that, since LLPC is simple and straight-forward, it is somehow inferior to other ideas that have appeared in the literature. In fact, it is this very simplicity that allows our approach to outperform the other proposed mechanisms, as we have shown in our perviously published work about LLPC and in this paper. The primary contribution of this present paper, however, is that it describes an implementation of the LLPC mechanism in a commercially-important operating system, Solaris, and its corresponding compiler. We feel that the description of this implementation certainly justifies publication in an archival journal.] LLPC design: As you note, LLPC is a version of dynamic partitioning. Your algorithm for LLPC is essentially for the first application to grab all the processors, and then give other applications single processors. There is voluminous literature on dynamic partitioning showing that equipartitioning is better, that is, you should strive for having equal numbers of threads/processors for all applications. it seems trivial to add the number of applications to the number of threads returned by the scheduling control function, and this would allow the LLPC mechanism to change the number of threads used so as to approximate equipartitioning. References for equipartitioning include your [1,5] and Deng & Dymond, 8th SPAA 82-88, 6/96 Parsons & Sevcik, SIGMETRICS '96, 57-67 McCann et al, TOCS 11(2) 146-178, 5/93 Tucker & Gupta, 12th SOSP 159-166, 12/89 [RESPONSE: When an application starts running on an ideal system, it should be able to use all the available resources it requires because there is no other competition. When there is later competition for resources, the LLPC algorithm will automatically adjust the allocation dynamically to suit the requirements of each application. Statically limiting the number of CPUs an application can use will waste system resources when additional application programs begin competing for the limited system resources. One of the points of this paper is to demonstrate the effectiveness of this dynamic allocation compared to static partitioning.] In addition, your design wastes significant resources by leaving spinning threads in sequential parts of each application. This is an inherent deficiency of LLPC where changes are only made at the beginning of each loop. You should either try harder to solve it (e.g. by a light-weight block-resume mechanism) or at least evaluate its effect in great detail. [RESPONSE: We disagree that our mechanism wastes significant resources. In fact, by dynamically allocating resources, we more efficiently use the available resources than the static partitioning mechanims this referee appears to advocate above. If a CPU is idle, it executes an idle thread, which simply runs a loop checking for new work from the work queue. Instead of releasing the thread to the operating system, which would incur very high overheads, this approach keeps the CPU under our control so that it is ready for the next piece of work from this application.] Evaluation: Real systems do not execute a fixed static workload of two chosen applications. In real systems, applications are submitted at random times, and then they terminate. At any given instant, you can take a snapshot and see 0 or 1 or 2 or 3 or x applications, but over time the system is highly dynamic. such dynamics have a strong effect on processor allocation, especially so in a system that uses a "first app gets all" allocation rule. Therefore you MUST conduct such a dynamic execution in order to evaluate LLPC, basing the arrivals, job sizes, and runtimes on data from execution logs from real machines. Sect 4/Fig 2: it seems that another thread was created at about time 3 and terminated at about time 13. Is this so? How was it arranged? [RESPONSE: Although the total number of applications in our experiments is fixed and static, the parallel and sequential portions of those applications vary dynamically during the execution of the programs. This unpredictable variation produces a very dynamic workload, not a simple static workload that this referee suggests we measured.] Table 1: can you also provide info on parallel/serial fractions of the application? [RESPONSE: The parallel/serial fractions of the application programs are shown implicitly in terms of the speedup in the last column of Table 1.] Introduction: I disagree with the sweeping comment that gang scheduling reduces utilization; see the following paper: Feitelson & Jette, LNCS 1291 238-261, 4/97 [RESPONSE: While scheduling is an area of ongoing research, there are numerous published results that support this comment. See, for instance, the papers we referenced, including A. Tucker's Ph.D Thesis from Stanford U., and C. Severance et al, ICPP, p. I:24-31 1995, among others.] Finally, signposting is always a problem, but a sentence like "The final section concludes the paper" is really redundant... [RESPONSE: This seems to be a minor comment about stylistic differences.] --------------------------------------------------------------------------- REFEREE: (B) IV. DETAILED COMMENTS In this paper, the authors describe a mechanism for supporting dynamic processor allocation policies for parallel applications on multiprogrammed SMPs running the Solaris operating system. This is not the first work on support for dynamic partitioning policies on SMPs. Previously, McCann et al (ACM TOCS Volume 11(2), 1993) and Andrew Tucker (in his 1993 dissertation, which the authors cite) have described how dynamic partitioning can be supported on SMPs. The authors approach is similar to the approaches described in these papers. The main contribution of this work is extending the Solaris scheduling control mechanism and using it for implementing a dynamic processor allocation policy. As such, while the paper describes good work, I do not believe the research contribution is sufficient to justify publication in IEEE TPDS. I recommend that the authors re-submit the paper to a journal such as Software: Practice & Experience. [RESPONSE: Which we have done, although Concurrency: Practice & Experience is a better choice.] Some additional comments: 1) Threads are considered as being in three states: running, spinning and sleeping. How does the scheduling mechanism implemented handle thread blocking due to user I/O and page faults? [RESPONSE: Running, spinning and sleeping are "higher level" states in that a thread is still considered to be in the running state by the OS when it is blocked due to system level activities, such as page faults. The OS handles this situation in the normal fashion.] 2) How does the scheduling policy handle fairness in delivering resources to competing jobs? [RESPONSE: One of the important issues in designing a processor allocation policy is the fairness of the policy. LLPC would seem to favor applications that reach their parallel section first by allowing them to create as many processes as they require, while later arriving-applications might not be allowed to create as many threads as needed to achieve their best performance. However, our experiments show that no single application ever monopolizes all of the processors. If an application can use only a single processor for its current parallel loop because all the other processors are being used, it is very likely that these processors will be available for the application's next parallel loop. In a subsequent study, we have shown that 80% of the time an application is able to obtain the best allocation for its parallel execution. Thus, fairness occurs automatically due to the dynamically varying workloads of the individual applications.] 3) There is a typo In Figure 1: destory --> destroy [RESPONSE: Corrected. Also corrected the same typo in Figure 2.] --------------------------------------------------------------------------- I would strongly accept the paper: [RESPONSE: We agree!] TPDS 107468, "Implementing a Dynamic Processor Allocation Policy for Multiprogrammed Parallel Applications in the Solaris Operating System" by David Lija. The paper addresses one of the most important issues remaining in processor allocation, the modification of operating systems to support dynamic resource allocation. The paper is well written, compares their solution with other approaches, describes the algorithms used to dynamically allocate threads, and provides experimental results. The paper also excites a fair number of related questions: - the choice for number of processors is made independently by each master thread. Can performance be improved if a mechanism is added to support global decisions? An example is to set the number of threads to wake up to leave free a number of threads equal to the number of jobs trying to run on the system. This allows room for other jobs to start. [RESPONSE: We examined that strategy previously. However, our results indicated that the system resources were better used by allocating them when they are needed instead of saving some for later. This will maintain a high level of performance when the system is not busy.] - how does the scheduler avoid collective behavior effects in which all master threads attempt to simultaneously acquire more threads? Would a third regime in which the system neither expands or contracts the number of threads help overall performance? [RESPONSE: Based on the data from our experiments, no single application ever monopolizes all of the processors. If an application can use only a single processor for its current parallel loop because all of the other processors are being used, it is very likely that these processors will be available for the next parallel loop. In a subsequent study, we have shown that 80% of the time an application is able to obtain the best allocation for its parallel execution. ] - the system partially hides the latency of waking a new thread by continuing to execute with the original number of threads during the wake-up time. Can this latency be further decreased by tying the thread execution to a particular cache location where the data is already loaded? [RESPONSE: As other researchers have shown, it is possible to further hide the latency by tying threads to caches. However, LLPC does not have the ability to tie the execution of a thread to a particular cache location. It is up to the lower level kernel to apply that optimization.] - is there a way to compare the expected CPU utilization with the Tera MTA approach? Effectively, the system provides a way to map workload onto the CPUs that allows reassignment of idle resources. The difference is that the latency that is required to start a sleeping thread is much greater than the context switch time of the MTA. If the granularity of the thread is large enough, the difference in latency can be amortized. What is the thread granularity (number of execution cycles) typically compared to the latency of starting a sleeping thread (number of execution cycles)? [RESPONSE: Comparing to the Tera MTA approach would be a very interesting experiment. However, it is beyond the scope of this present paper. The thread granularity is quite large compared to the latency of starting a sleeping thread, which required only a few instructions.] There are a few arithmetic checks that can help the reader of the paper. In Table 3, the average number of threads per loop is shown to be about 3.26 when a sequential process is using on average 0.78 threads (18 sec / 23 sec). I would expect the total to add up to slightly less than 4 threads. Is the difference the start-up time for the sequential process? [RESPONSE: The thread/loop ratio is 3.26, while the average number of threads for the sequential application is 0.78. That gives a total number of threads in the system to be 4.04. This value means that during certain times, there are more threads than avaliable CPUs. However, LLPC is able to adjust and keep close to the number of CPUs. In the time-sharing experiments, however, the total number of threads is 5.0.] In table 16, the average number of threads executed per parallel loop is greater than two for each application. This implies that each application is taking advantage of sequential execution that is occuring in the competing application. Can you quantify the amount of time expected in sequential execution and show that this is comparable to the increase in number of threads for the parallel execution? [RESPONSE: We did not keep track of the data in such fine detail. Consequently, we are unable to fulfill this request.] --------------------------------------------------------------------------- REFEREE: (E) IV. DETAILED COMMENTS This paper does two things. First, it describes how LLPC -- a previously proposed dynamic processor allocation policy for shared-memory multiprocessors -- is implemented given Sun's operating system and parallelizing compiler. This took - extending "scheduling control" to obtain overall load information (p.7) - adding the thread state "sleeping" (p.9) - tuning the LLPC algorithm: delaying the put-to-sleep and wake-up (p.10) Second, using synthetic and SPEC95 benchmarks, the resulting performance is compared to time-sharing (Table 3) and static partitioning (Figure 3). This work is highly interesting. Whereas most work in the area of parallel job scheduling uses analytic or simulation modeling, this paper presents empirical results. The author's access to and modification of the Sun's source code is unique. However, my main points of criticism are - the disadvantages of dynamic allocation are not even mentioned - the experiments are simplistic and somewhat biased and - some statements and explanations are poor or even plain wrong! First, application code has to be malleable. Your code on p. 8 is, but you never tell the reader. Second, you use one scenario (a parallel and a sequential application) when comparing LLPC to time-sharing and a different scenario (two parallel applications) when comparing LLPC to static partitioning. Thus, your experiments are biased rather than symmetric. I suggest showing results for LLPC, time-sharing and static partitioning for both scenarios. (What about other scenarios: more than two applications, a whole workload? In the best of all worlds, one would also compare to co-scheduling.) Third and most important, the following must be changed: * conclusion, second to last sentence: WRONG LLPC performs better than time-sharing (see section 4.4) because "parallelism on average" is LOWER, 3.25 instead of 4 !!! * abstract, last sentence: not "obtain a fair share" but "use otherwise idle". LLPC is not necessarily fair, remember "swim grabbing all processors", p. 16. (A discussion of fairness would be interesting, though.) [RESPONSE: We agree that fairness is an interesting issue, and we examined it in some of our previous evaluations of LLPC (see the cited papers). In the application swim, for instance, the application did not grab all the processors. Rather, it was the other applications that grabbed most of the processors and put swim at a disadvantage. Even so, swim was able to get over 2 processors on average using LLPC.] Also, p. 16 second paragraph and Fig. 3: you seem to mean AVERAGE slowdown. Make it clear. (But mention makespan, also). [RESPONSE: The slowdown factor is not defined as the average slowdown. Rather, it is defined as the slowdown of each particular application.] * section 4, first paragraph, second to last sentence: indicate that between 2nd and 8th parallel loop, load is increased [RESPONSE: It is not clear what the referee was referring to here.] * p. 15, second paragraph: explain the reasons for the increase in execution time [RESPONSE: The reason for the increase in execution time is because of the existence of the sequential application in the system. This has been clearly explained in the text.] * Table 4, caption: change to "avg. # threads executed per parallel loop for application in row when using LLPC to execute simultaneously with application in column" * typos: p. 4, third paragraph: two periods section 4, third line: omit "to" conclusion, last line: "than" reference 1: "Feb." [RESPONSE: These have been corrected.] Further suggestions: - p. 4: add a picture explaining dynamic partitioning - no ideas for future work, how about considering alternatives to evenly distribution of work p. 7 further investigating the trade-off regarding wake-up overhead p. 9 adding thread stealing p. 10 further investgating the problem with system daemons p. 14 [RESPONSE: We decided against incorporating these comments.] ---------------------------------------------------------------------- REFEREE: (G) IV. DETAILED COMMENTS This paper describes a system (including a parallelizing compiler and modifications to a commercial operating system) that allows parallel applications running on a shared-memory multiprocessor to adjust their degree of parallelism dynamically, in response to changes in system load. The greatest strength of the paper is its practicality. The system has been implemented in a commercial development environment (Solaris) and is apparently intended to be included in future releases, if it hasn't been already. Most companies with similar systems don't take the time to present their work in academic journals; this case is exceptional, and should be commended. Another strength is that the presentation of introductory material is clear and complete (almost to a fault). Although the amount of background may be unnecessary for many readers, I think it is a good idea to err on the side of explaining too much, especially in a field where some terminology is still ambiguous. The only weakness of the paper is that the discussion of related work (and the bibliography) is mostly limited to commercial systems, with relatively little discussion of academic and theoretical work. For example, it might be worth discussing the pros and cons of letting applications choose their own number of threads, compared to a centralized system. On page 16, the authors point out a failure mode in which one application gets a head start and allocates all the processors, imposing a significant slowdown on another application. This seems like a serious limitation; I think it deserves more discussion, especially in the context of existing research on centralized allocation strategies. [RESPONSE: See the responses to this same comment above.] The following are more specific comments, some of which are stylistic or typographical: Abstract: Where LLPC is introduced, it would be good to make it clearer that this is a policy proposed by the same authors in prior work. Introduction: The first sentence is a bit circular, and may not be meaningful. It is not clear what "fair" means in this context, and it is not necessarily true that fairness is necessary for minimal execution times or high throughput. In fact, the authors may not want to bring fairness up at all, since their policy does not address the issue, at least as I understand the word. Section 2.2: The discussion of processor scheduling says that the scheduler chooses the thread from the beginning of the queue and puts it at the end. That's not necessarily true, since lots of systems either have multiple queues or priorities within the queue. It might be good to describe a more general system, especially since the system in question actually has more than one queue. On page 4, there are two periods at the end of a sentence. Last paragraph on page 4: I would remove the phrase "so that the system load is light". On page 5, be sure to define SPMD. On page 6, there is an n-dash (-- in LaTeX) where there should be an m-dash (---). [RESPONSE: We have incorporated the majority of the above suggestions into the paper.] On page 9, the discussion of the "new" thread state (sleeping) is confusing because I was under the impression that all systems supported a "waiting" or "blocked" state, and I didn't understand how "sleeping" was different from these. Also, there is another n-dash here. [RESPONSE: Sleeping is similar to the waiting or blocked state. In OS terminology, however, a thread spins when it is in the waiting or blocked state. We use the term "sleeping" to avoid confusion with standard OS terminaology.] On page 12, I was a little surprised at the age of the system being used for testing. Has this paper been in the works for a while? [RESPONSE: Yes, unfortunately, it has been in the review process for an unusually long time!] Section 4.3: The explanation for the slowdown---sensitivity to background load caused by daemons---doesn't seem to explain the data. The number of threads per loop is only 0.1% below the maximum, which means that the program almost never runs on fewer than 4 processors. On the other hand, the run times increase by around 1.5% There must be another source of overhead (although it is certainly not unacceptable). [RESPONSE: The run times of the daemons are very small. The overhead of LLPC really is 1.5%.] Section 4.4: The authors use a synthetic workload to evaluate the ability of LLPC to react dynamically to changing system load. The synthetic program runs for 18 seconds and sleeps for 5. The authors don't discuss where these numbers come from, but it is clear that the frequency of the load changes is VERY long compared to the granularity of the loops in the applications (for example, su2cor runs 191,000 loops in 241 seconds, or about 1 millisecond per loop). The number of load changes during the execution is about ten, which means that the program runs 19,100 loops between load changes (not counting load changes due to daemons). It would be useful for the authors to discuss these relationships, and maybe do some experiments to test the ability of the system to adapt to load changes with higher frequency. [RESPONSE: The individual time for a loop may be as small as 1 msec, but loops can be run consecutively for a few seconds. We agree that it would be nice to further elaborate these relationships, but it is no longer feasible to carry out the necessary experiments.] As I already mentioned, page 16 reveals a failure mode in which processors are not allocated to competing applications fairly. This is a real problem with decentralized allocation, and should be discussed. Section 5: I feel strongly that the Conclusion section should contain only conclusions (things we didn't know when we started the paper) and no repetition of introductory material. Thus, I think the first two sentences of this section should be removed. The rest of the section reads just fine without them. [RESPONSE: We have a difference of opinion on this minor stylistic point.]