Referee 1 *********************************************************************

 Overall the paper presents an interesting approach to inter-process
communication in dynamic, adaptive, irregular applications. The authors clearly
state the design objectives of this communication system and also investigate
the design tradeoffs for achieving these objectives. This paper presents an
interesting and valuable practical experience in the area of parallel
processing.

The remainder of these comments is focused on the referee's opinion about some
of the weak aspects of the reviewed paper and the changes that this paper should
undergo in order to reach the level necessary for publishing in the journal. The
comments are divided into three groups: (i) justification for the presented
communication system (DMCS), (ii) performance evaluation, and (iii) style and
presentation.


(i) Justification for DMCS

DMCS is introduced as a system that meets the requirements of dynamic, adaptive,
irregular parallel algorithms and also as a system that is easy to use by the
application developers. The major requirements of the communication model needed
to support adequately the algorithms presented in the paper are specified as
asynchronous, one-sided communication and dynamic process management. The paper
reviews MPI and LAPI as existing distributed communication interfaces and points
out there inability to meet all of the listed requirements. However, The paper
provides an incomplete description of these interfaces and fails to convince
that DMCS provides substantial improvements over any or both of the mentioned
interfaces. Here are several points that support this impression:

-         LAPI is described as high-performance interface but as one that has
low-level of usability (p.2). As described the DMCS API contains 13 functions.
The LAPI interface contains 14 functions. In terms of size, clearly LAPI and
DMCS are much smaller then MPI (128 functions) and especially MPI-2. However,
the paper fails to show that the API of DMCS is much easier to use than the one
provided by LAPI. The model provided by LAPI is in fact very close to the one
offered by DMCS. Also, with its API LAPI provides alternative modes for remote
and local completion, namely interrupt-based and polling.

-         Except the fact that DMCS handlers are executed in polling mode while
LAPI handlers are executed in interrupt mode (which is not necessarily an
advantage), the paper does not clearly show how DMCS is better than LAPI - in
this respect the justification of the DMCS effort is viewed as insufficient.

-         One of the deficiencies of MPI is described as a lack of support for
one-sided communication (with the exception of the footnote on p. 2). The paper
does not discuss the one-sided communication model provided by MPI-2 in any
depth, e.g., is this model adequate or not to the goals of the presented system;
if it is not how DMCS is better than MPI-2's one-sided model. Also, there is no
comparison between LAPI and MPI-2's one-sided models or any other one-sided
models for that matter (for instance an important communication system with
features similar to the ones described in the paper is Princeton VMMC).

-         MPI is described as "not intended to be a target of run-time support
systems software needed by compilers and problem solving environments". The
authors of the paper do not show how DMCS can become such target and why DMCS is
better than MPI in this regard

-         The paper states that MPI does not support dynamic process management
and intra-node concurrency (p. 2). However, the MPI-2 specification, dated 1998,
has provided explicit support in both of these areas. Moreover, thread-safety of
MPI has been suggested even by the MPI-1 standard (1994). Thread-safe MPI
implementations have existed for many years now and they have been used
successfully in hybrid modes (MPI + threads or MPI + OpenMP)


Further, the justification of DMCS does not make clear if this system is
intended for use by a broader class of applications or only for the ones that
are described in the paper. If it is the former case, the paper does not provide
a study how the assumptions of DMCS and its design will affect the performance
of classes of algorithms that suggest different communication patterns (e.g.,
they are regular in time and space, and also use exchange of medium or
large-size messages). If it is the latter case, then the paper focuses only on a
narrow class of applications and it seems that designing and implementing a
whole new communication interface may not be well justified, especially in the
light of the existence of other efforts (such as MPI-2) that have already
addressed the same or similar issues and are viewed as de-facto standards.

(ii) Performance evaluation

One of the fundamental design decisions of the system presented in the paper is
that user handlers are executed within the context of the user thread in polling
fashion and as a result DMCS provides low communication overhead. However, the
paper does not address issues related to how this design affects the
communication bandwidth or the processor overhead. In other words, the paper
does not address performance in a more complex space constituted of factors such
as latency, bandwidth, and processor overhead. The paper focuses only on one of
these performance factors and although the authors indicate that this particular
performance factor is the most important for the class of algorithms presented
in the paper, a broader discussion (or at least mention of the other factors) is
necessary for understanding the performance implications of DMCS.


The comparison between the overheads of MPI/LAPI on one side and DMCS on another
do not provide accurate picture of the relative communication overheads because
DMCS does not effectively perform any data movement since it is layered on top
of MPI or LAPI. Only the overhead of an implementation of DMCS that provides its
own data movement and completion synchronization can be fairly compared with MPI
or LAPI. Studying the DMCS implementations provided in the paper, the only valid
conclusion that can be made is that DMCS may not be significantly worse than MPI
or LAPI in terms of communication overhead.


The paper does not specify how DMCS performs on classes of algorithms other than
the ones addressed in the paper. The paper does not indicate if the presented
system will perform sufficiently well for medium to coarse grain regular
data-parallel algorithms so that application developers might choose to use DMCS
instead MPI for example.


(iii) Style and presentation

-         Inaccurate, incomplete, or insufficiently justified statements: (a)
The paper states that advances in network technology have not kept up with the
processor performance (p. 6) - the paper does not mention network technologies,
such as Myrinet, Giganet, or Quadrics that provide bandwidth that surpasses the
capabilities of the peripheral buses of the host systems. In this regard, the
statement should have been that the memory subsystem and/or peripheral bus
technologies have not kept up with the processor speeds and have prevented
high-speed networks (such as the ones listed above) to deliver their maximum
performance to the application processes. Technologies such as RapidIO and
Infiniband (also not mentioned) are being presently developed to address the
communication bottlenecks of the computer I/O subsystems. (b) The paper
identifies the lack of portable thread models as a factor in the decision to
eliminate threads from the design of DMCS. The paper does not mention
"pthreads, " which are created exactly for reasons of portability and why
pthreads are insufficient for the goals of the paper. (c) The paper states (p. 3
and p. 10) that the thread context switch depends on the "hardware architecture
of the underlying processor" (p. 10). Clearly the major factor that determines
the thread context switch semantics (preemptive or cooperative; mapped to kernel
threads or to full-fledged processes) as well as the overhead is primarily
dependent on the architecture of operating system and not on the architecture of
the processor. For example, on the same Intel PC platform with the same hardware
and processor architecture there are at least three different OS namely,
Windows, Linux, and Solaris, with substantially different thread semantics and
overheads.

-         References: (a) there are missing references in the text (e.g.,
Cluster Controller is mentioned on p. 4 but no reference or explanation is
given); (b) missing years or publications of some references; (c) the only
reference for LAPI, specifically [2], is a confidential IBM draft report from
1996 - there are newer literature sources from 1999 and 2000 as well as many
on-line resources - they should have been cited too.

-         Visual materials: (a) figure 1 is hard to read in black and white
colors; (b) it would be better to present the latency and overhead numbers in
tables 2, 3, 4, 6, and 7 in microseconds instead of seconds.

-         Large number of blank lines (e.g., p. 23 and p. 25) - the text flow
should be improved.

-         Network speed should be specified in Mb/s, not in Mb.



Presentation Changes

 The referee suggests that the authors of the paper review all of the provided
comments and address the proposed changes in the section related to style and
presentation.


Referee 2 *********************************************************************

General Comments:
The subject of the paper is very interesting. However, the paper 
has some serious shortcomings and does not deliver a consistent 
and convincing message regarding usefulness, performance, and
portability of DMCS. In addition, references to other
related work are quite old, pointing to mostly inactive projects, 
and incomplete. The recent referenced papers are only those written 
the authors themselves. Why is the described system more appropriate for
its targeted applications than other well known interfaces such
as Generic Active Messages, Nexus, or MPI-2? Readers of the paper will be 
certainly interested in this issue.

Specific Comments:

1. The point about DMCS being easy to port and maintained by non-experts (!)
   made in Abstract appears to be inconsistent with the description of 
   the porting complexities in Sect 5.1.

2. Several references to the LAPI functionality are not accurate. 
   For example:
   * on page 8 there are only two LAPI threads mentioned. In fact LAPI uses up to
   three threads: user thread, another thread that can run header handler
   in interrupt mode, and third thread that runs completion handler. 
   * LAPI function names mentioned in the paper are mostly incorrect. 
    For example lapi_poll() in fig 2 does not exist in LAPI. The closest
   function is LAPI_Probe() and it has a different semantics.  
   Names of other LAPI operations referenced in paper like Send() and wait()
   are incorrect too.
   * figure 3 representing LAPI operation is incorrect, please refer
    to paper by G. Shah et al in IPDPS'98. 
   * Reference [2] (a confidential report from 1996) should be updated to point
     to the IBM PSSP documentation, or the paper by Shah, or even a recent 
    IBM "Redbook" that includes a section on that commercial product interface.

3. The role of polling in one-sided programming interface like DMCS is
   rather controversial and unfortunately not well addressed in the paper.
   How do you address the difficulty of inserting polling calls into
   the application code? What happens if the application spends a substantial
   amount of time in a precompiled library call (BLAS)? The discussion
   and microbenchmark performance results in sect. 6 ignores the fact that 
   frequency/response delay due to location of the polling calls is on 
   the critical path of get operation. There is a fair number of publications 
   discussing that issue and proposing techniques for performance
   improvements. 

4. Some of the performance results in section 6 are rather confusing.
   * How are "send times" in tables 2-5 defined and measured? They seem
     to correspond to only a portion of the data transfer cost.
   * How do you explain that operations that transfer more data
   e.g., dmcs_async_rsr4 execute faster than ones that move less
   data like dmcs_async_rsr1?
   * Why only timing results for starting up operations that complete 
   remotely (most of the processing done off the critical path) such as 
   put and remote-service-request are reported? An informative
   indicator of the DCMS performance would be performance results for blocking 
   get operation as it would reveal the processing costs by local and remote
   process.
   * Reporting an overhead of DCMS over the underlying communication protocol 
   does not fully clarify the nature of how that protocol is used by DCMS and 
   if the overall implementation is efficient. For example, there are many 
   ways to use MPI and they can lead to different performance characteristics
   of an "application". In context of your GQDT application running on 
   Sun/Ethernet (Table 9), how can the reader be sure that the DCMS approach 
   corresponds to the optimal utilization of MPI and network? Perhaps
   a pure MPI implementation based on a different set of MPI operations
   (e.g., persistent, nonblocking communication) than those DCMS port is using
   would improve performance of this application over the reported results?

5. Selection of h/w configuration for running application performance studies
   does not help evaluate performance and scalability merits of DCMS to 
   the full extend. It would be desirable to run the application and report
   performance results on a larger system with a faster network. 
   The IBM SP used in this work has only two nodes and the performance 
   Ethernet network in the Sun cluster seems to be insuffcient for this 
   application to scale irregardless of the communication substrate used. 
   Since the paper acknowledges funding support from NSF, it is surprising 
   that authors did not use more adequate computational resources that should 
   be available to them in NSF centers, for example the teraflop class 
   IBM SP at SDSC. 

6. The idea of acknowledgment variables described in Section 7 does not
   appear to be original. In fact, the IBM LAPI interface uses three 
   counter variables for the same purpose as DCMS.

7. How does DCMS handle multiple communication protocols available on SMP
   clusters?

8. The references to related projects should be updated and extended.
   For example, you should reference and compare the current work to 
   other portable similar systems such as MPI-2, Madeleine from U. Lyon, 
   ARMCI from Pacific Northwest National lab, and Ironman/ZPL from U. Washington. 

Presentation Changes

Please change colors and/or line styles in Figure 6,7, and 8 to improve readability. 
There are only two out three curves visible.


Referee 3 *********************************************************************


Overall this is a good paper, and the work addresses an important
aspect of high-performance communication for parallel applications and
machines.  However, I believe that it is lacking a few important
pieces that need to be added in order to make it complete.

Most of the complexity in message passing occurs on the receiving
side.  This is reinforced in the paper by the descriptions of how to
implement DMCS under MPI and LAPI.  Most of the important issues, such
as how the receive side handles message ordering, message buffering,
and polling strategies, are described in detail.  However, the
performance results do not include any data specifically related to
the receive side.  The paper makes several references to a DMCS
polling operation that the main application thread must call in order
for DMCS handlers to execute.  The polling operation is an important
part of the DCMS implementation, but one can't tell from the paper
exactly how it should be used.  Does the main thread simply call the
polling function continuously like in an event driven X program? Or is
it something that needs to be periodically called by the application?
There should be something that describes the polling function and its
use in greater detail.  It's also unclear from the performance
measurements exactly what the receive side is doing.  The overhead of
the asynchronous send-side operations is interesting, but I would
think that ultimately the delivery of data and/or remote function
invocation are important things to measure as well.  It seems like the
LAPI implementation would certainly perform better than the MPI
probing implementation on the receive side, but there should be some
performance numbers to back that up.

The other piece that's missing is a related work section.  There are
several programming models and systems based on one-sided
communications -- Cray SHMEM, Global Arrays, Cooperative Data Sharing,
MPI-2, etc.  An overview of how DMCS compares and contrasts with these
systems is needed.  It appears as though DMCS pre-dates some of these
and specifically addresses areas that they don't.  It would be
beneficial for this paper to clarify and qualify what distinguishes
DMCS from these other projects.

F: Presentation Changes

In Figures 6 and 7, the line plotting "DMCS time" is hardly visible.
Likewise in Figure 8 for "LAPI time".  For Figures 6 and 7, the nearly
invisible line plot and the scale of the graph make it hard to
decipher.

The tables of data would be much clearer and easier to understand if
they were presented in microseconds or milliseconds rather than in
seconds using scientific notation.

The statement regarding "Cluster CoNTroler" on page 4 should cite a
reference and possibly describe what exactly it is.

On page 7, MPI is referred to as a "common message passing software
package", which is an incorrect description.  Care should be taken
when discussing MPI as a standard as opposed to MPI as an
implementation.