Subject: Request to review a paper C499 From: Jarek Nieplocha Date: Tue, 19 Jun 2001 12:25:44 -0700 To: fox@csit.fsu.edu CC: Jarek Nieplocha Geoffrey, Here is the review. Knowing Nikos, I hoped for the review to turn out to be better than it did. However, the paper does need more work. Sorry again for the delay, Jarek REFEREE'S REPORT Concurrency and Computation:Practice and Experience --------------------------------------------------------------------------- A: General Information Please return to: Geoffrey C. Fox Electronically Preferred fox@csit.fsu.edu Concurrency and Computation: Practice and Experience Computational Science and Information Technology Florida State University 400 Dirac Science Library Tallahassee Florida 32306-4130 Office FAX 850-644-0098 Office Phone 850-644-4587 but best is cell phone 3152546387 Please fill in Summary Conclusions (Sec. C) and details as appropriate in Secs. D, E and F. B: Refereeing Philosophy We encourage a broad range of readers and contributors. Please judge papers on their technical merit and separate comments on this from those on style and approach. Keep in mind the strong practical orientation that we are trying to give the journal. Note that the forms attached provide separate paper for comments that you wish only the editor to see and those that both the editor and author receive. Your identity will of course not be revealed to the author. C: Paper and Referee Metadata * Paper Number Cnnn: 499 * Date: June 18, 2001 * Paper Title: Data movement and control substrate for parallel adaptive applications * Author(s): J. Dobbelaere, K. Barker, N. Chrisochoides, D. Nave, K. Pingali * Referee: Jarek Nieplocha * Address: Pacific Northwest National Laboratory Referee Recommendations. Please indicate overall recommendations here, and details in following sections. 1. publish as is 2. accepted provided changes suggested are made <<<<<<<<<<<<<<<<<<<<<< 3. reject D: Referee Comments (For Editor Only) E: Referee Comments (For Author and Editor) General Comments: The subject of the paper is very interesting. However, the paper has some serious shortcomings and does not deliver a consistent and convincing message regarding usefulness, performance, and portability of DMCS. In addition, references to other related work are quite old, pointing to mostly inactive projects, and incomplete. The recent referenced papers are only those written the authors themselves. Why is the described system more appropriate for its targeted applications than other well known interfaces such as Generic Active Messages, Nexus, or MPI-2? Readers of the paper will be certainly interested in this issue. Specific Comments: 1. The point about DMCS being easy to port and maintained by non-experts (!) made in Abstract appears to be inconsistent with the description of the porting complexities in Sect 5.1. 2. Several references to the LAPI functionality are not accurate. For example: * on page 8 there are only two LAPI threads mentioned. In fact LAPI uses up to three threads: user thread, another thread that can run header handler in interrupt mode, and third thread that runs completion handler. * LAPI function names mentioned in the paper are mostly incorrect. For example lapi_poll() in fig 2 does not exist in LAPI. The closest function is LAPI_Probe() and it has a different semantics. Names of other LAPI operations referenced in paper like Send() and wait() are incorrect too. * figure 3 representing LAPI operation is incorrect, please refer to paper by G. Shah et al in IPDPS'98. * Reference [2] (a confidential report from 1996) should be updated to point to the IBM PSSP documentation, or the paper by Shah, or even a recent IBM "Redbook" that includes a section on that commercial product interface. 3. The role of polling in one-sided programming interface like DMCS is rather controversial and unfortunately not well addressed in the paper. How do you address the difficulty of inserting polling calls into the application code? What happens if the application spends a substantial amount of time in a precompiled library call (BLAS)? The discussion and microbenchmark performance results in sect. 6 ignores the fact that frequency/response delay due to location of the polling calls is on the critical path of get operation. There is a fair number of publications discussing that issue and proposing techniques for performance improvements. 4. Some of the performance results in section 6 are rather confusing. * How are "send times" in tables 2-5 defined and measured? They seem to correspond to only a portion of the data transfer cost. * How do you explain that operations that transfer more data e.g., dmcs_async_rsr4 execute faster than ones that move less data like dmcs_async_rsr1? * Why only timing results for starting up operations that complete remotely (most of the processing done off the critical path) such as put and remote-service-request are reported? An informative indicator of the DCMS performance would be performance results for blocking get operation as it would reveal the processing costs by local and remote process. * Reporting an overhead of DCMS over the underlying communication protocol does not fully clarify the nature of how that protocol is used by DCMS and if the overall implementation is efficient. For example, there are many ways to use MPI and they can lead to different performance characteristics of an "application". In context of your GQDT application running on Sun/Ethernet (Table 9), how can the reader be sure that the DCMS approach corresponds to the optimal utilization of MPI and network? Perhaps a pure MPI implementation based on a different set of MPI operations (e.g., persistent, nonblocking communication) than those DCMS port is using would improve performance of this application over the reported results? 5. Selection of h/w configuration for running application performance studies does not help evaluate performance and scalability merits of DCMS to the full extend. It would be desirable to run the application and report performance results on a larger system with a faster network. The IBM SP used in this work has only two nodes and the performance Ethernet network in the Sun cluster seems to be insuffcient for this application to scale irregardless of the communication substrate used. Since the paper acknowledges funding support from NSF, it is surprising that authors did not use more adequate computational resources that should be available to them in NSF centers, for example the teraflop class IBM SP at SDSC. 6. The idea of acknowledgment variables described in Section 7 does not appear to be original. In fact, the IBM LAPI interface uses three counter variables for the same purpose as DCMS. 7. How does DCMS handle multiple communication protocols available on SMP clusters? 8. The references to related projects should be updated and extended. For example, you should reference and compare the current work to other portable similar systems such as MPI-2, Madeleine from U. Lyon, ARMCI from Pacific Northwest National lab, and Ironman/ZPL from U. Washington. F: Presentation Changes Please change colors and/or line styles in Figure 6,7, and 8 to improve readability. There are only two out three curves visible.