Subject:
Re: Request to review a paper C499
From:
"Rossen Dimitrov" <rossen@MPI-Softtech.Com>
Date:
Sat, 9 Jun 2001 18:36:23 -0500
To:
fox@csit.fsu.edu

Dr. Fox,

Below is my report for paper C499.  Thank you for the opportunity to contribute
to your effort for issuing the journal.

Rossen Dimitrov, Ph.D.


> REFEREE'S REPORT
> > Concurrency and Computation:Practice and Experience
> > ---------------------------------------------------------------------------
> > A: General Information
> > Please return to:
> Geoffrey C. Fox
> Electronically Preferred fox@csit.fsu.edu > Concurrency and Computation: Practice and Experience
> Computational Science and Information Technology
> Florida State University
> 400 Dirac Science Library
> Tallahassee Florida 32306-4130
> Office FAX 850-644-0098
> Office Phone 850-644-4587 but best is cell phone 3152546387
> > Please fill in Summary Conclusions (Sec. C) and details as appropriate in
> Secs. D, E and F.
> > B: Refereeing Philosophy
> > We encourage a broad range of readers and contributors. Please judge papers
> on their technical merit and separate comments on this from those on style
> and approach. Keep in mind the strong practical orientation that we are
> trying to give the journal. Note that the forms attached provide separate
> paper for comments that you wish only the editor to see and those that both
> the editor and author receive. Your identity will of course not be revealed
> to the author.
> > C: Paper and Referee Metadata
> > * Paper Number Cnnn:

 C499


> > * Date:

 June 9, 2001


> > * Paper Title:

 Data Movement and Control Substrate for Parallel Adaptive Applications


> > * Author(s):

 Jeffrey Dobbelaere, Kevin Barker, Nikos Christochoides, Demian Nave, and Keshav
Pingali


> > * Referee:

 Rossen Dimitrov, Ph.D.


> > * Address:

 101 S. Lafayette Str,
Suite 33
Strakville, MS 39759
(662) 320-4300 x 21


> > Referee Recommendations. Please indicate overall recommendations here, and
> details in following sections.
> > 1. publish as is
> 2. accepted provided changes suggested are made
> 3. reject

 accepted provided changes suggested are made


> > D: Referee Comments (For Editor Only)

 The paper presents and interesting communciation system that addresses specific
needs of a class of parallel algorithms. This, together with the expereince that
it shares, makes the paper relevant to the practical orientation and the
objectives of the journal.

However, I have an impression that the paper may have been submitted in a
similar form to other publications and have not necessarily been updated in
important areas, e.g., insufficient coverage of MPI-2 completed in 1998 (only a
footnote in the paper) and old references to LAPI - a confidential IBM draft
report from 1996 while newer, open LAPI sources from 1999 and 2000 exist. MPI-2
and LAPI are fundamental concepts for the reviewed paper and in fact this paper
presents a communication system that arguably addresses certain deficiencies of
MPI and LAPI. So, one would expect that the paper will pay more attention to the
evolution of MPI and LAPI as well.
There are several inaccurate statements in the paper. I have listed them in the
following section of this report, subsection (iii).

I would recommend acceptance of the paper only after substantial changes in the
areas that are commented in section E below.


> > E: Referee Comments (For Author and Editor)
> 

 Overall the paper presents an interesting approach to inter-process
communication in dynamic, adaptive, irregular applications. The authors clearly
state the design objectives of this communication system and also investigate
the design tradeoffs for achieving these objectives. This paper presents an
interesting and valuable practical experience in the area of parallel
processing.

The remainder of these comments is focused on the referee's opinion about some
of the weak aspects of the reviewed paper and the changes that this paper should
undergo in order to reach the level necessary for publishing in the journal. The
comments are divided into three groups: (i) justification for the presented
communication system (DMCS), (ii) performance evaluation, and (iii) style and
presentation.


(i) Justification for DMCS

DMCS is introduced as a system that meets the requirements of dynamic, adaptive,
irregular parallel algorithms and also as a system that is easy to use by the
application developers. The major requirements of the communication model needed
to support adequately the algorithms presented in the paper are specified as
asynchronous, one-sided communication and dynamic process management. The paper
reviews MPI and LAPI as existing distributed communication interfaces and points
out there inability to meet all of the listed requirements. However, The paper
provides an incomplete description of these interfaces and fails to convince
that DMCS provides substantial improvements over any or both of the mentioned
interfaces. Here are several points that support this impression:

-         LAPI is described as high-performance interface but as one that has
low-level of usability (p.2). As described the DMCS API contains 13 functions.
The LAPI interface contains 14 functions. In terms of size, clearly LAPI and
DMCS are much smaller then MPI (128 functions) and especially MPI-2. However,
the paper fails to show that the API of DMCS is much easier to use than the one
provided by LAPI. The model provided by LAPI is in fact very close to the one
offered by DMCS. Also, with its API LAPI provides alternative modes for remote
and local completion, namely interrupt-based and polling.

-         Except the fact that DMCS handlers are executed in polling mode while
LAPI handlers are executed in interrupt mode (which is not necessarily an
advantage), the paper does not clearly show how DMCS is better than LAPI - in
this respect the justification of the DMCS effort is viewed as insufficient.

-         One of the deficiencies of MPI is described as a lack of support for
one-sided communication (with the exception of the footnote on p. 2). The paper
does not discuss the one-sided communication model provided by MPI-2 in any
depth, e.g., is this model adequate or not to the goals of the presented system;
if it is not how DMCS is better than MPI-2's one-sided model. Also, there is no
comparison between LAPI and MPI-2's one-sided models or any other one-sided
models for that matter (for instance an important communication system with
features similar to the ones described in the paper is Princeton VMMC).

-         MPI is described as "not intended to be a target of run-time support
systems software needed by compilers and problem solving environments". The
authors of the paper do not show how DMCS can become such target and why DMCS is
better than MPI in this regard

-         The paper states that MPI does not support dynamic process management
and intra-node concurrency (p. 2). However, the MPI-2 specification, dated 1998,
has provided explicit support in both of these areas. Moreover, thread-safety of
MPI has been suggested even by the MPI-1 standard (1994). Thread-safe MPI
implementations have existed for many years now and they have been used
successfully in hybrid modes (MPI + threads or MPI + OpenMP)


Further, the justification of DMCS does not make clear if this system is
intended for use by a broader class of applications or only for the ones that
are described in the paper. If it is the former case, the paper does not provide
a study how the assumptions of DMCS and its design will affect the performance
of classes of algorithms that suggest different communication patterns (e.g.,
they are regular in time and space, and also use exchange of medium or
large-size messages). If it is the latter case, then the paper focuses only on a
narrow class of applications and it seems that designing and implementing a
whole new communication interface may not be well justified, especially in the
light of the existence of other efforts (such as MPI-2) that have already
addressed the same or similar issues and are viewed as de-facto standards.

(ii) Performance evaluation

One of the fundamental design decisions of the system presented in the paper is
that user handlers are executed within the context of the user thread in polling
fashion and as a result DMCS provides low communication overhead. However, the
paper does not address issues related to how this design affects the
communication bandwidth or the processor overhead. In other words, the paper
does not address performance in a more complex space constituted of factors such
as latency, bandwidth, and processor overhead. The paper focuses only on one of
these performance factors and although the authors indicate that this particular
performance factor is the most important for the class of algorithms presented
in the paper, a broader discussion (or at least mention of the other factors) is
necessary for understanding the performance implications of DMCS.


The comparison between the overheads of MPI/LAPI on one side and DMCS on another
do not provide accurate picture of the relative communication overheads because
DMCS does not effectively perform any data movement since it is layered on top
of MPI or LAPI. Only the overhead of an implementation of DMCS that provides its
own data movement and completion synchronization can be fairly compared with MPI
or LAPI. Studying the DMCS implementations provided in the paper, the only valid
conclusion that can be made is that DMCS may not be significantly worse than MPI
or LAPI in terms of communication overhead.


The paper does not specify how DMCS performs on classes of algorithms other than
the ones addressed in the paper. The paper does not indicate if the presented
system will perform sufficiently well for medium to coarse grain regular
data-parallel algorithms so that application developers might choose to use DMCS
instead MPI for example.


(iii) Style and presentation

-         Inaccurate, incomplete, or insufficiently justified statements: (a)
The paper states that advances in network technology have not kept up with the
processor performance (p. 6) - the paper does not mention network technologies,
such as Myrinet, Giganet, or Quadrics that provide bandwidth that surpasses the
capabilities of the peripheral buses of the host systems. In this regard, the
statement should have been that the memory subsystem and/or peripheral bus
technologies have not kept up with the processor speeds and have prevented
high-speed networks (such as the ones listed above) to deliver their maximum
performance to the application processes. Technologies such as RapidIO and
Infiniband (also not mentioned) are being presently developed to address the
communication bottlenecks of the computer I/O subsystems. (b) The paper
identifies the lack of portable thread models as a factor in the decision to
eliminate threads from the design of DMCS. The paper does not mention
"pthreads, " which are created exactly for reasons of portability and why
pthreads are insufficient for the goals of the paper. (c) The paper states (p. 3
and p. 10) that the thread context switch depends on the "hardware architecture
of the underlying processor" (p. 10). Clearly the major factor that determines
the thread context switch semantics (preemptive or cooperative; mapped to kernel
threads or to full-fledged processes) as well as the overhead is primarily
dependent on the architecture of operating system and not on the architecture of
the processor. For example, on the same Intel PC platform with the same hardware
and processor architecture there are at least three different OS namely,
Windows, Linux, and Solaris, with substantially different thread semantics and
overheads.

-         References: (a) there are missing references in the text (e.g.,
Cluster Controller is mentioned on p. 4 but no reference or explanation is
given); (b) missing years or publications of some references; (c) the only
reference for LAPI, specifically [2], is a confidential IBM draft report from
1996 - there are newer literature sources from 1999 and 2000 as well as many
on-line resources - they should have been cited too.

-         Visual materials: (a) figure 1 is hard to read in black and white
colors; (b) it would be better to present the latency and overhead numbers in
tables 2, 3, 4, 6, and 7 in microseconds instead of seconds.

-         Large number of blank lines (e.g., p. 23 and p. 25) - the text flow
should be improved.

-         Network speed should be specified in Mb/s, not in Mb.


> F: Presentation Changes

 The referee suggests that the authors of the paper review all of the provided
comments and address the proposed changes in the section related to style and
presentation.