I enclose 3 Referee reports on your paper. We would be pleased to accept it and could you please send me a new version before November 5 99 Please send a memo describing any suggestions of the referees that you did not address Ignore any aggressive remarks you don't think appropriate but please tell me. I trust you! Thank you for your help in writing and rereeing papers! Referee 1 ******************************************************* Paper Number: C424 Title: Performance Measurement of Dynamically Compiled Java Executions Authors: Tia Newhall and Barton P. Miller The paper describes Paradyn-J, a Java performance measuring tool. The paper describes how Paradyn-J copes with dynamically compiled executions, correlating the performance of the bytecode and native code versions of the same method, and also accounting for the dynamic compilation costs. Constructs that limit the effectiveness of dynamic compilation are identified. Using performance data from Paradyn-J, the authors achieved speedups of about 10% using a simulated JIT compiler (i.e., the native code was produced by hand, rather than by a real JIT compiler). Both synthetic benchmarks and a neural net application are used to generate results. Recommendation: Major revision. The authors do not show how the results they obtained with a simulated JIT compiler relate to any real JIT compiler, and the results may be significantly better than could be obtained with a real JIT compiler (see major comment 1 below) or statistically insignificant (see major comment 2 below). Comments for the authors ------------------------ Major comments: 1. The major flaw I see with this paper is the use of a simulated JIT compiler. This is justified on page 6 by saying that no source code was available for ExactVM or HotSpot. However, this is not sufficient justification. Source code *is* available for Sun's Reference Implementation of the JDK, which includes a JIT compiler. Also, Sun is not the only producer of JIT compilers. The Kaffe JVM, from Transvirtual, includes a JIT and is available under the Gnu Public License. Furthermore, simulating the JIT compiler rather than using a real one introduces factors which need to be controlled for. The paper offers no evidence that this was done. These factors include: a) The quality of the generated code. The paper seems to indicate that the compiled code was written and tuned by hand. JIT compilers are not known for producing high quality native code. Hence, the results may have been skewed in the authors' favor by running native code that was significantly better than a JIT compiler would produce. b) Memory usage. The simulator, as described, will use less memory than a real JIT compiler, since the JIT will end up with both the bytecode and native code versions AND also need space for holding information during the compile. This, again, may skew the results in the authors' favor. c) Dynamic library costs. The simulator loads the precompiled code in a dynamic library, whereas a JIT would attach the native code to the class file using attributes. The costs of managing the linking of a dynamic library versus looking up the compiled code in the classfile are unknown. All of this makes me suspect that the results obtained are better than one could expect with a real JIT, but the paper gives insufficient information to tell for certain. Also, this means that the statement on lines 3-4 of page 11 is false: it was demonstrated for a simulated JIT, but we don't know if Paradyn-J can associate performance data in that way with a real JIT or not. 2. The footnotes on pages 4 and 5 refer to validating experiments. It would be good if Table 1 (or some other table) included an explicit control experiment, which would be an expansion on those noted in the footnotes. Show me the normal case, so I can understand how well your efforts have paid off. Also, I don't see any confidence intervals or standard deviations. If the standard deviations are high, the differences between the two columns may not be statistically significant. 3. The word "simulated" on page 1 appears to refer to the programs. I was expecting performance results for benchmark-type programs, rather than real applications. Then on page 3, the word "simulated" is used again. This time, it appears to refer to the execution of the programs. Now I was expecting performance results from a simulated Java system. Finally, on page 6 I learn that the word "simulated" refers to the dynamic (JIT) compiler. Make the referent of the word "simulated" clear. 4. I don't understand the last paragraph on page 10. Does this imply that your technique only works on one method at a time? 5. In section 3.3 (page 7), why not use a custom class loader to insert the instrumentation when the class is loaded? That way you could avoid copying the entire method to the heap, saving on memory costs. This approach would also remove the problems mentioned at the bottom of page 8. Also, note that the objections to such preinstrumentation raised on page 16 are not insurmountable: one could certainly leave space for instrumentation in each loaded class file (with NOPs), and do the instrumentation later, for example. Only the last objection on page 16 is really an objection to preinstrumentation. 6. Also on page 16, preinstrumentation by certain tools is criticized in a footnote, which declares that these tools are not thread aware. But then on page 17, the authors admit that Paradyn-J does not properly handle multithreaded programs either! If you meant that inability to handle multithreading only matters for preinstrumenting JIT compilers, then say so, but be prepared to justify it. 7. Section 2 seems too long. Is it really necessary to divide cases 1 and 2 after distinguishing that VM interaction are a sort of runtime library? For example, code for I/O and object creation can both be seen as external to the program and do not appear to need separate cases. 8. In Section 3, it would have been nice to see recommended JVMPI extensions. Minor comments: Replace all the instances of "byte-code" with "bytecode", which is the way the word is written in Sun's documentation. The first part of the paper uses the phrase "dynamic compiler", but not "JIT compiler". The last part of the paper uses "JIT compiler", but not "dynamic compiler". Some kind of consistency would be good. There are several run-on sentences. Most are the result of using a comma where a semicolon is needed. (For example, the third sentence on page 6 has this problem.) Abstract, 1st sentence: the word "promises" is quite strong, too strong in my opinion. Abstract, last sentence: "*The* results of our work are *a* guide ... as *to* what type of ..." Page 2, line -7: "... of *the* JDK [19]." Page 5, line 5: "... 100,000 iteration case is due *to* two factors." Page 5, paragraph beginning "Case 3:", last line: "... three small methods *would* not result ..." (tense agreement). Page 6, section 3.2: What if the instrumented function begins with a loop? Then there will be a branch back to the instrumentation jump later in the function. What about cache effects? [I realize that none of this concerns the main topic of the paper; the authors should find a way to succinctly answer such concerns, or avoid raising them in the first place.] Page 7, last full paragraph: While method call costs are unavoidable, since CPU time must be measured, those costs CAN be quantified! Page 8, line 6: This is the first mention of the SPARC architecture. Mention it earlier, probably somewhere in the Introduction. Also, why did you choose the Sparc architecture instead of the x86 architecture? Why not both? Page 8, last paragraph: The name do_baseTramp is meaningless to me, and seema to be a weird mixture of C-style underscore-based names and Java-style title case-based names. Page 8, last paragraph: Under what circumstances is it unsafe to instrument a method? Page 9, 3rd paragraph: Why is this a problem? The first use of the do_baseTramp method will automatically load the class, won't it? Line 1: "... Paradyn-J has to get the VM [omit "has"] to load the ..." Page 10, line 1: "... the class'*s* methods." Page 10, 2nd full paragraph, line 1: omit the first comma. Page 10, 1st paragraph of section 4, line 4: "... neural network application*, thereby* improving ..." Page 11 and later: Choose one of "CPUtime", "CPUTime", and "CPU time" instead of using all three. I suggest the last one. Page 11, 1st paragraph, line -4: choose one of "each" and "all". Page 13, Table 2: How many trials were performed? What is the standard deviation or confidence interval? Page 14, Table 3: Why is there a blank space for case 3? Was the cost 0? Was it nonzero but negligible? Page 15, line 1: "... (4.9 us to 6.7 us per call)*;* however, ..." Page 15, line 15: omit the apostrophe in "developer's". Page 16, last full paragraph: Why no company name for JProbe, but a company name for OptimizeIt? Be consistent. References: check capitalization of everything. It is inconsistent. References: what is the significance of the dates on some of the references that contain only URLs? Reference 6: There should be an umlaut over the 'o' in Holzle. Reference 11: There is a missing 's' somewhere in that title. Reference 15 is useless as a references. This should probably be a footnote or parenthetical remark. Referee 2 ************************************************************ Subject: C424 JGSI Review a) Overall: Good b) As quantitative profile information is especially important for runtime optimization such as JIT, your profiling tool is very serviceable. And also, profile data of the costs of dynamic compilation and native code execution is very valuable for VM implementers. However, possible optimizations presented in the paper for application programmers, such as reducing methods call, I/O and object creation, seem not to be dynamic-compilation specific. Also, effectiveness of such optimizations is well-known, at least qualitative, I think. Of course, your tool can be detect which method have to be optimized in understandable quantitative form, that is very helpful for AP programmers. But if you can show some more example which is specific to dynamically-compiled code, I think it might be better. Referee 3 *************************************************************** Subject: C424 JGSI Review 1. Overall ******** An interesting, illuminating and important work. The approach to measuring performance of interpreted versus compiled execution is unique and thoughtful. Very well chosen examples of the circumstances in which compiling does or does not improve execution. Good additional comments on using performance data to isolate performance problems independent of compilation (such as frequent object creation, heavy I/O, and many small methods). The paper is short on the numbers, statistics and methodologies to support specific claims of some timings (for example, Table 2, Table 3, and the object creation time of 1.57 seconds in paragraph 3, page 14). 2. Comments for Author(s) ********************* The conversational style is fine, but the frequent use of possessives (i.e."the class' constantpool", rather than "the constantpool of the class") makes for occasionally awkward sentence structure. Two important points were big question marks until page 10 and should have been made much earlier: 1) The instrument points are: method entry, method exit, and method calls. 2) One very good reason for simulating JIT compilation is the possibility that the compiler will reorder or optimize away the instrumentation code. Two additional issues with replacing the compilation step with a set wait: 1) What statistics were used in determining a valid wait time? 2) What are the timing differences between a stand-alone compilation and a JIT compilation in terms of resource contention, memory utilization, and pre-compilation by the interpreter? It should be simple to set up a test that would compare the simulated compiler with a real JIT and produce interesting results. * Missing article - Last sentence of the Abstract should be "Results of our work are a guide..." * Spelling error and number disagreement - Last sentence of bullet #3 on page 2 should be "...help a programmer better understand an application's execution." * Missing comma - Caption on Figure 1 should be "...compiled executions, methods may be..." * Run-on sentence - Last sentence on page 2, beginning "Results from our study demonstrate..." * Time-line, acting character confusion - Top of page 8, it is unclear what actor changes what instructions when. * Repeated phrase - Top of page 9, double indirection "...SPARC instrumentation code to generate instrumentation code for..." If this is intentional, it's confusing. * Extra word - First sentence, para. 3, page 9, should be "...get the VM to load..." * Incorrect article - Third sentence, para. 3, page 9, should be "An alternative is to find the..." * Confusing tense and word order - Fifth sentence on, para. 3 and 4, page 9. It is unclear why one alternative is preferred, which alternative was actually selected and why, and what function the "type tags associated with AP and VM resources" have given the "main method" alternative over the "VM" alternative. It is unclear if the VM was instrumented or not. * Incorrect term - Last two sentences, para. 1, page 11, both instances of "execution time" should be replaced with "CPU time". In fact, according to the graph, the execution time actually increased with the compiled version. The graph implies that the interpreted code completed 1.05(1.18 peak) iterations per second, while the compiled code completed 0.523 (0.823peak) iterations per second. Why is this? JNI overhead? Worth some discussion. * Missing units - Table 2, page 13, what are the units? * Missing units - Table 3, page 14, what are the units? * Missing source - Table 3, page 14, and top of page 15, the MethodCallTime of 2.5 microseconds doesn't seem to come from anywhere. * Missing source - First sentence, para. 3, page 14, the number 1.57 doesn't seem to come from anywhere. 3. Comments for Editor(s) ******************** None.