Subject: Re: Request to review a paper C499 From: "Rossen Dimitrov" Date: Sat, 9 Jun 2001 18:36:23 -0500 To: fox@csit.fsu.edu Dr. Fox, Below is my report for paper C499. Thank you for the opportunity to contribute to your effort for issuing the journal. Rossen Dimitrov, Ph.D. > REFEREE'S REPORT > > Concurrency and Computation:Practice and Experience > > --------------------------------------------------------------------------- > > A: General Information > > Please return to: > Geoffrey C. Fox > Electronically Preferred fox@csit.fsu.edu > Concurrency and Computation: Practice and Experience > Computational Science and Information Technology > Florida State University > 400 Dirac Science Library > Tallahassee Florida 32306-4130 > Office FAX 850-644-0098 > Office Phone 850-644-4587 but best is cell phone 3152546387 > > Please fill in Summary Conclusions (Sec. C) and details as appropriate in > Secs. D, E and F. > > B: Refereeing Philosophy > > We encourage a broad range of readers and contributors. Please judge papers > on their technical merit and separate comments on this from those on style > and approach. Keep in mind the strong practical orientation that we are > trying to give the journal. Note that the forms attached provide separate > paper for comments that you wish only the editor to see and those that both > the editor and author receive. Your identity will of course not be revealed > to the author. > > C: Paper and Referee Metadata > > * Paper Number Cnnn: C499 > > * Date: June 9, 2001 > > * Paper Title: Data Movement and Control Substrate for Parallel Adaptive Applications > > * Author(s): Jeffrey Dobbelaere, Kevin Barker, Nikos Christochoides, Demian Nave, and Keshav Pingali > > * Referee: Rossen Dimitrov, Ph.D. > > * Address: 101 S. Lafayette Str, Suite 33 Strakville, MS 39759 (662) 320-4300 x 21 > > Referee Recommendations. Please indicate overall recommendations here, and > details in following sections. > > 1. publish as is > 2. accepted provided changes suggested are made > 3. reject accepted provided changes suggested are made > > D: Referee Comments (For Editor Only) The paper presents and interesting communciation system that addresses specific needs of a class of parallel algorithms. This, together with the expereince that it shares, makes the paper relevant to the practical orientation and the objectives of the journal. However, I have an impression that the paper may have been submitted in a similar form to other publications and have not necessarily been updated in important areas, e.g., insufficient coverage of MPI-2 completed in 1998 (only a footnote in the paper) and old references to LAPI - a confidential IBM draft report from 1996 while newer, open LAPI sources from 1999 and 2000 exist. MPI-2 and LAPI are fundamental concepts for the reviewed paper and in fact this paper presents a communication system that arguably addresses certain deficiencies of MPI and LAPI. So, one would expect that the paper will pay more attention to the evolution of MPI and LAPI as well. There are several inaccurate statements in the paper. I have listed them in the following section of this report, subsection (iii). I would recommend acceptance of the paper only after substantial changes in the areas that are commented in section E below. > > E: Referee Comments (For Author and Editor) > Overall the paper presents an interesting approach to inter-process communication in dynamic, adaptive, irregular applications. The authors clearly state the design objectives of this communication system and also investigate the design tradeoffs for achieving these objectives. This paper presents an interesting and valuable practical experience in the area of parallel processing. The remainder of these comments is focused on the referee's opinion about some of the weak aspects of the reviewed paper and the changes that this paper should undergo in order to reach the level necessary for publishing in the journal. The comments are divided into three groups: (i) justification for the presented communication system (DMCS), (ii) performance evaluation, and (iii) style and presentation. (i) Justification for DMCS DMCS is introduced as a system that meets the requirements of dynamic, adaptive, irregular parallel algorithms and also as a system that is easy to use by the application developers. The major requirements of the communication model needed to support adequately the algorithms presented in the paper are specified as asynchronous, one-sided communication and dynamic process management. The paper reviews MPI and LAPI as existing distributed communication interfaces and points out there inability to meet all of the listed requirements. However, The paper provides an incomplete description of these interfaces and fails to convince that DMCS provides substantial improvements over any or both of the mentioned interfaces. Here are several points that support this impression: - LAPI is described as high-performance interface but as one that has low-level of usability (p.2). As described the DMCS API contains 13 functions. The LAPI interface contains 14 functions. In terms of size, clearly LAPI and DMCS are much smaller then MPI (128 functions) and especially MPI-2. However, the paper fails to show that the API of DMCS is much easier to use than the one provided by LAPI. The model provided by LAPI is in fact very close to the one offered by DMCS. Also, with its API LAPI provides alternative modes for remote and local completion, namely interrupt-based and polling. - Except the fact that DMCS handlers are executed in polling mode while LAPI handlers are executed in interrupt mode (which is not necessarily an advantage), the paper does not clearly show how DMCS is better than LAPI - in this respect the justification of the DMCS effort is viewed as insufficient. - One of the deficiencies of MPI is described as a lack of support for one-sided communication (with the exception of the footnote on p. 2). The paper does not discuss the one-sided communication model provided by MPI-2 in any depth, e.g., is this model adequate or not to the goals of the presented system; if it is not how DMCS is better than MPI-2's one-sided model. Also, there is no comparison between LAPI and MPI-2's one-sided models or any other one-sided models for that matter (for instance an important communication system with features similar to the ones described in the paper is Princeton VMMC). - MPI is described as "not intended to be a target of run-time support systems software needed by compilers and problem solving environments". The authors of the paper do not show how DMCS can become such target and why DMCS is better than MPI in this regard - The paper states that MPI does not support dynamic process management and intra-node concurrency (p. 2). However, the MPI-2 specification, dated 1998, has provided explicit support in both of these areas. Moreover, thread-safety of MPI has been suggested even by the MPI-1 standard (1994). Thread-safe MPI implementations have existed for many years now and they have been used successfully in hybrid modes (MPI + threads or MPI + OpenMP) Further, the justification of DMCS does not make clear if this system is intended for use by a broader class of applications or only for the ones that are described in the paper. If it is the former case, the paper does not provide a study how the assumptions of DMCS and its design will affect the performance of classes of algorithms that suggest different communication patterns (e.g., they are regular in time and space, and also use exchange of medium or large-size messages). If it is the latter case, then the paper focuses only on a narrow class of applications and it seems that designing and implementing a whole new communication interface may not be well justified, especially in the light of the existence of other efforts (such as MPI-2) that have already addressed the same or similar issues and are viewed as de-facto standards. (ii) Performance evaluation One of the fundamental design decisions of the system presented in the paper is that user handlers are executed within the context of the user thread in polling fashion and as a result DMCS provides low communication overhead. However, the paper does not address issues related to how this design affects the communication bandwidth or the processor overhead. In other words, the paper does not address performance in a more complex space constituted of factors such as latency, bandwidth, and processor overhead. The paper focuses only on one of these performance factors and although the authors indicate that this particular performance factor is the most important for the class of algorithms presented in the paper, a broader discussion (or at least mention of the other factors) is necessary for understanding the performance implications of DMCS. The comparison between the overheads of MPI/LAPI on one side and DMCS on another do not provide accurate picture of the relative communication overheads because DMCS does not effectively perform any data movement since it is layered on top of MPI or LAPI. Only the overhead of an implementation of DMCS that provides its own data movement and completion synchronization can be fairly compared with MPI or LAPI. Studying the DMCS implementations provided in the paper, the only valid conclusion that can be made is that DMCS may not be significantly worse than MPI or LAPI in terms of communication overhead. The paper does not specify how DMCS performs on classes of algorithms other than the ones addressed in the paper. The paper does not indicate if the presented system will perform sufficiently well for medium to coarse grain regular data-parallel algorithms so that application developers might choose to use DMCS instead MPI for example. (iii) Style and presentation - Inaccurate, incomplete, or insufficiently justified statements: (a) The paper states that advances in network technology have not kept up with the processor performance (p. 6) - the paper does not mention network technologies, such as Myrinet, Giganet, or Quadrics that provide bandwidth that surpasses the capabilities of the peripheral buses of the host systems. In this regard, the statement should have been that the memory subsystem and/or peripheral bus technologies have not kept up with the processor speeds and have prevented high-speed networks (such as the ones listed above) to deliver their maximum performance to the application processes. Technologies such as RapidIO and Infiniband (also not mentioned) are being presently developed to address the communication bottlenecks of the computer I/O subsystems. (b) The paper identifies the lack of portable thread models as a factor in the decision to eliminate threads from the design of DMCS. The paper does not mention "pthreads, " which are created exactly for reasons of portability and why pthreads are insufficient for the goals of the paper. (c) The paper states (p. 3 and p. 10) that the thread context switch depends on the "hardware architecture of the underlying processor" (p. 10). Clearly the major factor that determines the thread context switch semantics (preemptive or cooperative; mapped to kernel threads or to full-fledged processes) as well as the overhead is primarily dependent on the architecture of operating system and not on the architecture of the processor. For example, on the same Intel PC platform with the same hardware and processor architecture there are at least three different OS namely, Windows, Linux, and Solaris, with substantially different thread semantics and overheads. - References: (a) there are missing references in the text (e.g., Cluster Controller is mentioned on p. 4 but no reference or explanation is given); (b) missing years or publications of some references; (c) the only reference for LAPI, specifically [2], is a confidential IBM draft report from 1996 - there are newer literature sources from 1999 and 2000 as well as many on-line resources - they should have been cited too. - Visual materials: (a) figure 1 is hard to read in black and white colors; (b) it would be better to present the latency and overhead numbers in tables 2, 3, 4, 6, and 7 in microseconds instead of seconds. - Large number of blank lines (e.g., p. 23 and p. 25) - the text flow should be improved. - Network speed should be specified in Mb/s, not in Mb. > F: Presentation Changes The referee suggests that the authors of the paper review all of the provided comments and address the proposed changes in the section related to style and presentation.