I enclose 3 Referee reports on your paper. We would be pleased to accept it and could you please send me a new version before November 5 99 Please send a memo describing any suggestions of the referees that you did not address Ignore any aggressive remarks you don't think appropriate but please tell me. I trust you! Thank you for your help in writing and refereeing papers! Referee 1 ************************************************************ Subject: C434 JGSI Review ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overall Recommendation: Accept ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This paper describes the content of and the design philosophy behind a benchmark suite for Java Grande applications, and presents the results of benchmarking a half dozen hardware/JVM combinations. The paper is well-written, interesting, and highly relevant to the Java Grande community. General Comments ~~~~~~~~~~~~~~~~ I have only fairly minor comments on the paper's content. 1. The last paragraph of section 2 touches on "a feature peculiar to Java benchmarking, which is that it is possible to distribute the benchmark without revealing the source code." The importance of this point is lessened by the wide availability of Java de-compilers, which produce quite readable Java source code. 2. At the top of the right column on page 2, it is stated that I/O components of the benchmarks have been removed, presumably because they are not considered relevant to Java Grande applications. Yet in the first paragraph of the introduction, large network requirements are considered one of the hallmarks of Grande applications (and perhaps disk usage should be added to that list). Perhaps the paragraph in section 3 meant to refer to "*terminal* I/O"? 3. In the descussion in section 4.1 on the meaning of the temporal and relative performance metrics, I think it would be worth adding an explicit statement that bigger values under both metrics indicate *better* performance. 4. The last paragraph of section 7 (describing how to obtain the benchmarks) does not constitute future work; it should probably be moved to the very beginning of section 5. Detailed Comments ~~~~~~~~~~~~~~~~~ These comments are just minor nitpicks. 1. The acronyms EPCC and MPI should be spelled out on the first use of each. 2. In section 5, the descriptions of some benchmarks begin with a verb (e.g., "measures", "tests"), while others begin with "This benchmark measures..." or "This benchmark tests...". For uniformity, they should all be changed to follow the same convention. Similarly, some descriptions say "Performance units are..." while others say "Results are reported in...". Finally, some say something like "This kernel/benchmark exercises ...", while others simply say something like "Memory and integer intensive." 3. There are a couple of places where a comma appears immediately before a parenthetical remark; these commas should be moved after the closing right parenthesis. 4. Pet peeve: there are many, many instances of the word "which" that should be replaced by "that". For a guide to the correct usage, see the topic on "Which-Hunting" in "A Handbook for Scholars", Mary- Claire van Leunen, Oxford University Press, 1992. 5. Typos and suggested improvements: Pg 1, col 1, section 1, line 3: "...well outside its original design specifications." -> "...well outside its original design goals." Pg 1, col 1, line -3 (counted from bottom): "Show that real large scale codes can be..." -> "Show that real, large-scale codes can be..." Pg 1, col 1, line -3: "...can be written and provide..." -> "...can be written, and provide..." Pg 1, col 2, line 2: "...execution environments thus allowing..." -> "...execution environments, thus allowing..." Pg 1, col 2, line 6: "...to Grande Applications and in doing so encourage..." -> "...to Grande Applications, and in doing so encourage..." Pg 1, col 2, section 2, line 6: "...a number of benchmarks [] ..." -> "...a number of micro-benchmarks [] ..." Pg 1, col 2, last line: "These are useful in that they can be representative..." -> "These are useful in that they are representative..." Pg 2, col 1, "Robust" item: "The performance of suite ..." -> "The performance of the suite ..." Regarding the Robustness criterion, I am dubious as to whether it is possible to eliminate hardware factors (such as cache size) from a performance measurement. Pg 2, col 1, "Portable" item: "...a variety of Java environments as possible." -> "...a variety of Java platforms as possible." Pg 2, col 1, line -12: "..., we provide three types of benchmark, ..." -> "..., we provide three benchmark types, ..." Pg 2, col 1, line -4: "...of real applications running under the Java environment." -> "...of real Java applications." Pg 2, col 2, line 12: "We also choose the kernels..." -> "We also chose the kernels..." Pg 2, col 2, line -24: "..., as well as ensuring adherence to..." -> "..., as well as to ensure adherence to..." Pg 3, col 1, line 14: "Relative performance is the ration of temporal performance ..." -> "Relative performance is the ratio of temporal performance ..." Pg 3, col 1, line 15: "... that is a chosen JVM/operating system/hardware combination." -> "... that is, a chosen JVM/operating system/hardware combination." Pg 3, col 1, line -16: "Accessing benchmark methods as class methods." -> "Accessing benchmark methods as static methods." (This ain't Smalltalk ;-) Pg 3, col 1, line -8: "We can force compliance to common structure..." -> "We can force compliance to a common structure..." Pg 3, col 2, line -14: "...to different types of variable." -> "...to different types of variables." Pg 5, col 1, line 9: "This performs a one-dimensional forward transform..." -> "This performs a one-dimensional Fourier transform..." Pg 5, col 1, Sparse: The first and third sentences can be merged as follows: Multiplies an N x N sparse matrix stored in compressed-row format with a prescribed sparsity structure by a dense vector 200 times. "This kernel exercises indirection-addressing and..." -> "This kernel exercises indirect-addressing and..." Pg 5, col 1, Search: "... using a alpha-beta pruned search technique." -> "... using an alpha-beta pruned search technique." Pg 5, col 2, section 6.1, line 3: "Also of interest is language comparisons, comparing the performance of Java versus other programming languages such as Fortran, C and C++." -> "Also of interest are language comparisons, that is, comparing the performance of Java to other programming languages such as Fortran, C and C++." Pg 5, col 2, section 6.1, line 6: "Currently, the LUFact and MolDyn benchmarks, allow..." -> "Currently, the LUFact and MolDyn benchmarks allow..." Pg 5, col 2, section 6.1, lines 7-10: "It is intended, however, that the parallel part of the suite will contain versions of well-known Fortran and C parallel benchmarks, ..." -> "However, we intend the parallel part of the suite to contain versions of well-known Fortran and C parallel benchmarks, ..." Pg 5, col 2, section 6.1, lines 11-17: "Measurements have been taken for the Linpack Benchmark (on a 1000 x 1000 problem size) and the Molecular Dynamics benchmark (2048 particles), using Java (Sun JDK 1.2.1 02 production version, and Sun JDK 1.2.1 reference version + Hotspot 1.0), Fortran and C on a 250MHz Sun Ultra Enterprise 3000 with 1Gb of RAM and the results are shown in Figure 3." -> "Measurements have been taken for the LUFact benchmark (on a 1000 x 1000 problem size) and the MolDyn benchmark (2048 particles) using Java (Sun JDK 1.2.1 02 production version, and Sun JDK 1.2.1 reference version + Hotspot 1.0), Fortran and C on a 250MHz Sun Ultra Enterprise 3000 with 1Gb of RAM. The results are shown in Figure 3." Pg 8, col 2, line 1: "Consideration of these issues has lead us to decide ..." -> "Consideration of these issues has led us to decide ..." ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Referee 2 ******************************************************************** C434 JGSI Review Paper: A Benchmark Suite for High Performance Java Authors: J M Bull, L A Smith, M D Westhead, D S Henty and R A Davey Number: C434 a)Overall Recommendation Scale used: 1(trivial) to 5(outstanding) Recommendation: 3, accept Technical Contribution: 3 Technical Content: 3 Presentation: 3 Accept it. b)Words suitable for authors I recommend " weak acceptance". The work presented has no significant contribution to the area of Java benchmarking yet, but it summarizes well the effort this research group and others are imposing towards defining benchmarks for Java Grande applications. The main problem in this work is that the reviewer doesn't think the paper results test the methodology they are suggesting for Java benchmarking. 1) To have Java benchmarks being released in source code rather than bytecode programs is an important design decision for benchmarks. I strongly support source code release. Much performance analysis will depend on program semantics that though possible to retrieve from bytecodes it is easier to retrieve such info from source code. This way, it is also factored out the fact that the benchmarks would depend on the same Java fron-end/ Bytecode compiler, leaving to the JVM performance analyser open choices of compilers it can use. 2) Section 5, Current Status, subsection 5.1, Low Level Operations Cast operations tests should include the dynamic type checking tests for casting objects (reference) types. Primitive data types casting checks are not the only ones that are of relevance. JVMs have to perform many implicit dynamic type checks and would be interesting to see how different Java engines optimize these checks. 3) Section 6.1, Programming language Comparison The code versions in C and Fortran are 100% Java? These code versions have been modified to include Java implicit run-time checks? If not, the results not only reflect the differences in language implementation but also the overhead of extra security policies imposed by Java. The authors should point this detail out. 4) Section 6.2, JVM comparison The whole purpose of this benchmark suite construction was to point out where the differences among JVMs are. However the results presented only permit to say whether a certain JVM performs better or worse in relation to the others, no more detailed insight. So these results do not represent what the initial goal of this project was. How can more detailed performance comparison info can be extracted from the benchmark suite already constructed so far??? One problem pointed out by the authors is related to how to force JIT compilation of methods in different JVM engines. I don't see that as an issue. The performance analyser has to understand that there are different technologies for execution Java, and these different technologies yield different performance improvement and require different system requirements. So, when comparing technologies, he has to be aware that different systems exist and comparison across systems may not be fair. Overall you can see technologies for executing Java in the following groups: Interpreters JIT compilers - simple, baseline compilers - more optimizing compilers, but quick - dynamic optimizing and re-optimizing compilers c)Words for the committee, if necessary I recommend "weak acceptance". The work presented has no significant contribution to the area of Java benchmarking yet, but it summarizes well the effort this research group and others are imposing towards defining benchmarks for Java Grande applications. The main problem in this work is that the reviewer doesn't think the paper results test the methodology they are suggesting for Java benchmarking. Referee 3 ******************************************************** Subject: C434 JGSI Review a) Overall: Good b) I think this benchmark suite is pretty precious. Especially, "transparency" described in section 3 makes this suite very valuable. Results from benchmark tests are very interesting, but if possible, the reason for the results (e.g. Why is NT + HotSpot good at Search but bad at MolDyn?) might help programmers to write efficient codes. I am eagerly looking forward to the parallel version of JGF benchmark.