Referee 1 ********************************************************************* Overall the paper presents an interesting approach to inter-process communication in dynamic, adaptive, irregular applications. The authors clearly state the design objectives of this communication system and also investigate the design tradeoffs for achieving these objectives. This paper presents an interesting and valuable practical experience in the area of parallel processing. The remainder of these comments is focused on the referee's opinion about some of the weak aspects of the reviewed paper and the changes that this paper should undergo in order to reach the level necessary for publishing in the journal. The comments are divided into three groups: (i) justification for the presented communication system (DMCS), (ii) performance evaluation, and (iii) style and presentation. (i) Justification for DMCS DMCS is introduced as a system that meets the requirements of dynamic, adaptive, irregular parallel algorithms and also as a system that is easy to use by the application developers. The major requirements of the communication model needed to support adequately the algorithms presented in the paper are specified as asynchronous, one-sided communication and dynamic process management. The paper reviews MPI and LAPI as existing distributed communication interfaces and points out there inability to meet all of the listed requirements. However, The paper provides an incomplete description of these interfaces and fails to convince that DMCS provides substantial improvements over any or both of the mentioned interfaces. Here are several points that support this impression: - LAPI is described as high-performance interface but as one that has low-level of usability (p.2). As described the DMCS API contains 13 functions. The LAPI interface contains 14 functions. In terms of size, clearly LAPI and DMCS are much smaller then MPI (128 functions) and especially MPI-2. However, the paper fails to show that the API of DMCS is much easier to use than the one provided by LAPI. The model provided by LAPI is in fact very close to the one offered by DMCS. Also, with its API LAPI provides alternative modes for remote and local completion, namely interrupt-based and polling. - Except the fact that DMCS handlers are executed in polling mode while LAPI handlers are executed in interrupt mode (which is not necessarily an advantage), the paper does not clearly show how DMCS is better than LAPI - in this respect the justification of the DMCS effort is viewed as insufficient. - One of the deficiencies of MPI is described as a lack of support for one-sided communication (with the exception of the footnote on p. 2). The paper does not discuss the one-sided communication model provided by MPI-2 in any depth, e.g., is this model adequate or not to the goals of the presented system; if it is not how DMCS is better than MPI-2's one-sided model. Also, there is no comparison between LAPI and MPI-2's one-sided models or any other one-sided models for that matter (for instance an important communication system with features similar to the ones described in the paper is Princeton VMMC). - MPI is described as "not intended to be a target of run-time support systems software needed by compilers and problem solving environments". The authors of the paper do not show how DMCS can become such target and why DMCS is better than MPI in this regard - The paper states that MPI does not support dynamic process management and intra-node concurrency (p. 2). However, the MPI-2 specification, dated 1998, has provided explicit support in both of these areas. Moreover, thread-safety of MPI has been suggested even by the MPI-1 standard (1994). Thread-safe MPI implementations have existed for many years now and they have been used successfully in hybrid modes (MPI + threads or MPI + OpenMP) Further, the justification of DMCS does not make clear if this system is intended for use by a broader class of applications or only for the ones that are described in the paper. If it is the former case, the paper does not provide a study how the assumptions of DMCS and its design will affect the performance of classes of algorithms that suggest different communication patterns (e.g., they are regular in time and space, and also use exchange of medium or large-size messages). If it is the latter case, then the paper focuses only on a narrow class of applications and it seems that designing and implementing a whole new communication interface may not be well justified, especially in the light of the existence of other efforts (such as MPI-2) that have already addressed the same or similar issues and are viewed as de-facto standards. (ii) Performance evaluation One of the fundamental design decisions of the system presented in the paper is that user handlers are executed within the context of the user thread in polling fashion and as a result DMCS provides low communication overhead. However, the paper does not address issues related to how this design affects the communication bandwidth or the processor overhead. In other words, the paper does not address performance in a more complex space constituted of factors such as latency, bandwidth, and processor overhead. The paper focuses only on one of these performance factors and although the authors indicate that this particular performance factor is the most important for the class of algorithms presented in the paper, a broader discussion (or at least mention of the other factors) is necessary for understanding the performance implications of DMCS. The comparison between the overheads of MPI/LAPI on one side and DMCS on another do not provide accurate picture of the relative communication overheads because DMCS does not effectively perform any data movement since it is layered on top of MPI or LAPI. Only the overhead of an implementation of DMCS that provides its own data movement and completion synchronization can be fairly compared with MPI or LAPI. Studying the DMCS implementations provided in the paper, the only valid conclusion that can be made is that DMCS may not be significantly worse than MPI or LAPI in terms of communication overhead. The paper does not specify how DMCS performs on classes of algorithms other than the ones addressed in the paper. The paper does not indicate if the presented system will perform sufficiently well for medium to coarse grain regular data-parallel algorithms so that application developers might choose to use DMCS instead MPI for example. (iii) Style and presentation - Inaccurate, incomplete, or insufficiently justified statements: (a) The paper states that advances in network technology have not kept up with the processor performance (p. 6) - the paper does not mention network technologies, such as Myrinet, Giganet, or Quadrics that provide bandwidth that surpasses the capabilities of the peripheral buses of the host systems. In this regard, the statement should have been that the memory subsystem and/or peripheral bus technologies have not kept up with the processor speeds and have prevented high-speed networks (such as the ones listed above) to deliver their maximum performance to the application processes. Technologies such as RapidIO and Infiniband (also not mentioned) are being presently developed to address the communication bottlenecks of the computer I/O subsystems. (b) The paper identifies the lack of portable thread models as a factor in the decision to eliminate threads from the design of DMCS. The paper does not mention "pthreads, " which are created exactly for reasons of portability and why pthreads are insufficient for the goals of the paper. (c) The paper states (p. 3 and p. 10) that the thread context switch depends on the "hardware architecture of the underlying processor" (p. 10). Clearly the major factor that determines the thread context switch semantics (preemptive or cooperative; mapped to kernel threads or to full-fledged processes) as well as the overhead is primarily dependent on the architecture of operating system and not on the architecture of the processor. For example, on the same Intel PC platform with the same hardware and processor architecture there are at least three different OS namely, Windows, Linux, and Solaris, with substantially different thread semantics and overheads. - References: (a) there are missing references in the text (e.g., Cluster Controller is mentioned on p. 4 but no reference or explanation is given); (b) missing years or publications of some references; (c) the only reference for LAPI, specifically [2], is a confidential IBM draft report from 1996 - there are newer literature sources from 1999 and 2000 as well as many on-line resources - they should have been cited too. - Visual materials: (a) figure 1 is hard to read in black and white colors; (b) it would be better to present the latency and overhead numbers in tables 2, 3, 4, 6, and 7 in microseconds instead of seconds. - Large number of blank lines (e.g., p. 23 and p. 25) - the text flow should be improved. - Network speed should be specified in Mb/s, not in Mb. Presentation Changes The referee suggests that the authors of the paper review all of the provided comments and address the proposed changes in the section related to style and presentation. Referee 2 ********************************************************************* General Comments: The subject of the paper is very interesting. However, the paper has some serious shortcomings and does not deliver a consistent and convincing message regarding usefulness, performance, and portability of DMCS. In addition, references to other related work are quite old, pointing to mostly inactive projects, and incomplete. The recent referenced papers are only those written the authors themselves. Why is the described system more appropriate for its targeted applications than other well known interfaces such as Generic Active Messages, Nexus, or MPI-2? Readers of the paper will be certainly interested in this issue. Specific Comments: 1. The point about DMCS being easy to port and maintained by non-experts (!) made in Abstract appears to be inconsistent with the description of the porting complexities in Sect 5.1. 2. Several references to the LAPI functionality are not accurate. For example: * on page 8 there are only two LAPI threads mentioned. In fact LAPI uses up to three threads: user thread, another thread that can run header handler in interrupt mode, and third thread that runs completion handler. * LAPI function names mentioned in the paper are mostly incorrect. For example lapi_poll() in fig 2 does not exist in LAPI. The closest function is LAPI_Probe() and it has a different semantics. Names of other LAPI operations referenced in paper like Send() and wait() are incorrect too. * figure 3 representing LAPI operation is incorrect, please refer to paper by G. Shah et al in IPDPS'98. * Reference [2] (a confidential report from 1996) should be updated to point to the IBM PSSP documentation, or the paper by Shah, or even a recent IBM "Redbook" that includes a section on that commercial product interface. 3. The role of polling in one-sided programming interface like DMCS is rather controversial and unfortunately not well addressed in the paper. How do you address the difficulty of inserting polling calls into the application code? What happens if the application spends a substantial amount of time in a precompiled library call (BLAS)? The discussion and microbenchmark performance results in sect. 6 ignores the fact that frequency/response delay due to location of the polling calls is on the critical path of get operation. There is a fair number of publications discussing that issue and proposing techniques for performance improvements. 4. Some of the performance results in section 6 are rather confusing. * How are "send times" in tables 2-5 defined and measured? They seem to correspond to only a portion of the data transfer cost. * How do you explain that operations that transfer more data e.g., dmcs_async_rsr4 execute faster than ones that move less data like dmcs_async_rsr1? * Why only timing results for starting up operations that complete remotely (most of the processing done off the critical path) such as put and remote-service-request are reported? An informative indicator of the DCMS performance would be performance results for blocking get operation as it would reveal the processing costs by local and remote process. * Reporting an overhead of DCMS over the underlying communication protocol does not fully clarify the nature of how that protocol is used by DCMS and if the overall implementation is efficient. For example, there are many ways to use MPI and they can lead to different performance characteristics of an "application". In context of your GQDT application running on Sun/Ethernet (Table 9), how can the reader be sure that the DCMS approach corresponds to the optimal utilization of MPI and network? Perhaps a pure MPI implementation based on a different set of MPI operations (e.g., persistent, nonblocking communication) than those DCMS port is using would improve performance of this application over the reported results? 5. Selection of h/w configuration for running application performance studies does not help evaluate performance and scalability merits of DCMS to the full extend. It would be desirable to run the application and report performance results on a larger system with a faster network. The IBM SP used in this work has only two nodes and the performance Ethernet network in the Sun cluster seems to be insuffcient for this application to scale irregardless of the communication substrate used. Since the paper acknowledges funding support from NSF, it is surprising that authors did not use more adequate computational resources that should be available to them in NSF centers, for example the teraflop class IBM SP at SDSC. 6. The idea of acknowledgment variables described in Section 7 does not appear to be original. In fact, the IBM LAPI interface uses three counter variables for the same purpose as DCMS. 7. How does DCMS handle multiple communication protocols available on SMP clusters? 8. The references to related projects should be updated and extended. For example, you should reference and compare the current work to other portable similar systems such as MPI-2, Madeleine from U. Lyon, ARMCI from Pacific Northwest National lab, and Ironman/ZPL from U. Washington. Presentation Changes Please change colors and/or line styles in Figure 6,7, and 8 to improve readability. There are only two out three curves visible. Referee 3 ********************************************************************* Overall this is a good paper, and the work addresses an important aspect of high-performance communication for parallel applications and machines. However, I believe that it is lacking a few important pieces that need to be added in order to make it complete. Most of the complexity in message passing occurs on the receiving side. This is reinforced in the paper by the descriptions of how to implement DMCS under MPI and LAPI. Most of the important issues, such as how the receive side handles message ordering, message buffering, and polling strategies, are described in detail. However, the performance results do not include any data specifically related to the receive side. The paper makes several references to a DMCS polling operation that the main application thread must call in order for DMCS handlers to execute. The polling operation is an important part of the DCMS implementation, but one can't tell from the paper exactly how it should be used. Does the main thread simply call the polling function continuously like in an event driven X program? Or is it something that needs to be periodically called by the application? There should be something that describes the polling function and its use in greater detail. It's also unclear from the performance measurements exactly what the receive side is doing. The overhead of the asynchronous send-side operations is interesting, but I would think that ultimately the delivery of data and/or remote function invocation are important things to measure as well. It seems like the LAPI implementation would certainly perform better than the MPI probing implementation on the receive side, but there should be some performance numbers to back that up. The other piece that's missing is a related work section. There are several programming models and systems based on one-sided communications -- Cray SHMEM, Global Arrays, Cooperative Data Sharing, MPI-2, etc. An overview of how DMCS compares and contrasts with these systems is needed. It appears as though DMCS pre-dates some of these and specifically addresses areas that they don't. It would be beneficial for this paper to clarify and qualify what distinguishes DMCS from these other projects. F: Presentation Changes In Figures 6 and 7, the line plotting "DMCS time" is hardly visible. Likewise in Figure 8 for "LAPI time". For Figures 6 and 7, the nearly invisible line plot and the scale of the graph make it hard to decipher. The tables of data would be much clearer and easier to understand if they were presented in microseconds or milliseconds rather than in seconds using scientific notation. The statement regarding "Cluster CoNTroler" on page 4 should cite a reference and possibly describe what exactly it is. On page 7, MPI is referred to as a "common message passing software package", which is an incorrect description. Care should be taken when discussing MPI as a standard as opposed to MPI as an implementation.