Subject: C443 JGFSI Reviews
Resent-Date: Fri, 12 Nov 1999 08:37:46 -0500
Resent-From: Geoffrey Fox <gcf@npac.syr.edu>
Resent-To: p_gcf@npac.syr.edu
Date: Fri, 5 Nov 1999 18:36:47 -0800 (PST)
From: "Michael O. Neary" <neary@cs.ucsb.edu>
To: Geoffrey Fox <gcf@npac.syr.edu>

Dear Geoffrey,

Please find attached a new version of our Javelin++ paper.

We have adressed some of the referees' comments with minor
changes to the paper. The main change is a totally revised
section on the discussion of Java applets vs applications,
where we try to clarify that we are *not* abandoning applets,
but just not focusing on their development anymore.
This is mainly due to the ongoing browser incompatibilities
when it comes to implementing RMI. Also, with the advent of
JDK 1.2, there is no principal distinction between applets
and applications w.r.t. security anymore.

All 3 reviewers emphasized the issue of applets vs. applications.
Since this issue is not the focus of our paper, we sought to clarify
that point: We added/modified several lines in the
abstract, introduction, and conclusion in order to emphasize and
clarify the focus of our research: scalability and fault tolerance.

Below, we adress each comment specifically.

Best regards,

Mike Neary & Pete Cappello

Geoffrey Fox writes:
 > I enclose 3 Referee reports on your paper.
 > We would be pleased to accept it and could you please send me
 > a new version before November 5 99
 > Please send a memo describing any suggestions of the
 > referees that you did not address
 > Ignore any aggressive remarks you don't think appropriate but
 > please tell me. I trust you!
 >
 > Thank you for your help in writing and refereeing papers!
 >
 >
 > Referee 1 **********************************************************
 >
 > C443: Javelin++:   Scalability issues in global computing
 > Neary, et al.
 >
 > This paper presents an extension of Javelin in support of scalable
 > global computing. This is a good experiment work. My major concern is
 > its originality. As it is indicated
 > in Section 2 that Javelin++ improved Javelin in three aspects:
 >  -- Java RMI implementation instead of TCP sockets;
 >  -- Java applications instead of applets;
 >  -- distributed broker instead of a centralized broker.

The paper mentions the changes above because they are indeed changes to the
architecture.  However, the switch from TCP to RMI is a technical detail
and not relevant to the scalability issue. Also, as indicated above,
although we use applications primarily, we do not rule out the use of
applets, a point which we have tried to make more clear.
On the other hand, the distributed broker is an architectural change that
is relevant to our quest for a scalable architecture.
To improve the clarity of our intent, we have de-emphasized
the first two points, and stress the latter in the new abstract/intro.

The originality lies in Javelin++'s scalability and fault tolerance.
Our distributed work stealing provides scalability.  The smoothly
integrated distributed deterministic eager scheduler provides fault
tolerance.  Additionally, Javelin++:
1. detects and replaces hosts that have failed or retreated from the
   computation;
2. distributes classes over the broker network via the Java class loader,
   which we extended for this purpose.

 > While some of the extensions may not be so trivial in implementations,
 > overall I am doubting if extensions are significant enough to justify
 > for another journal publication. RMI provides an alternative communication
 > mechansim to TCP. It simplifies programming, in particular, for
 > workstealing type of parallel applications. Expectedly, RMI is slower than
 > TCP. By how much in Javelin++?
 >

The choice of RMI vs. TCP is not relevant to the question of scalability.
We can always plug in a faster communication subsystem. KaRMI, the
Karlsruhe RMI presented at Java Grande, comes to mind.

On the other hand, the features that provide scalability and fault tolerance,
mentioned above are original and non-trivial.  In the revision, we
have endeavored to communicate our focus and original contributions
more clearly to the reader.

 > Javelin++ supports Java standalone applications, instead of Java applets.
 > More
 > applications can be adapted to Javelin++. The authors listed four drawbacks
 > with Java applets. The authors are expected to show some examples in
 > experiments that beyond the capability of Javelin. Unfortunately, the paper
 > just simply repeats the experiment of Javelin using the same example on a
 > cluster of PC/workstations.
 >

Indeed, more applications would be desirable. However, by using the
same apps as in the previous system, we provide performance data on
Javelin++ that is _comparable_ to the performance data on the original
Javelin prototype.  We noted this virtue in the paper's Experimental
Results section.

 > Javelin++ extends Javelin with distributed brokers for scalability.
 > Associated with the distributed brokers, the paper implements a number of
 > interesting scheduling algorithms, including work stealing for task
 > distribution and eager scheduling for  fault tolerance. With more
 > work on the distributed scheduling strategies with some analysis of their
 > scalability and overhead, it might be more appropriate to present the
 > work as a scheduling paper.
 >

We agree that we could have done written this as a scheduling paper.
But, for us, scheduling is of interest only in so
far as it affects scalability and fault tolerance, which we regard
as fundamental issues---hence the focus of our paper.

 >     ---------------------------------------------------------------------
 > Referee 2 *****************************************************************
 >
 > Referee report for Javelin++
 >
 > Seems like a reasonable paper, but the presentation could be cleaned up a
 > bit.
 >
 > This paper describes a distributed work-stealing system, focusing mostly on
 > two
 > work-stealing algorithms.  The introduction is rambling and
 > repetitious,

We were disappointed by this comment; we tried hard to compactly identify
the fundamental issues, to place this paper's focus in context.  We have
changed the wording slightly to convey this context-setting intention.
The length of text to accomplish this is less than 1/2 of a page, and we
are hard-pressed to cut further.

 > talking about a number of important issues, most of which are only touched
 > upon
 > in the body of the paper.  This paper is about load balancing, which is

The paper is about scalability and fault tolerance.  Load balancing comes
as a pleasant side-effect of our eager scheduler, whose purpose is to
provide fault tolerance.  Again, we have made wording changes to
clarify our focus.

 > fine,
 > but you have to get pretty far through the paper to realize that the other
 > issues are dealt with only in passing.  I would rather see an introduction
 > that
 > says ``there are n fundamental issues, but here we focus on scalable work
 > stealing and load balancing''.
 >

What we did is not that different from what is being suggested.  Instead
of saying "there are 5 fundamental issues, but here we focus on scalability
and fault tolerance," we describe the 5 issues in a manner that,
we believe, is more concise than previously accomplished.  In fact, the
compactness of this characterization is, in our opinion, a _contribution_
of this paper.  Based on the reviewer's comment, we explicitly mention
the focus of the paper in several places, including the Introduction, where
we say, "The focus of this paper is on the fundamental issues of scalability
and fault tolerance."  The organization of the Introduction is:
1) state the goal/problem; 2) characterize the issues (which include
scalability and fault tolerance); 3) note that previous work has not
fully achieved scalability; 4) state our focus:scalabiltiy and
fault tolerance.  They are related: The latter is required to realize
the former.


 > The paper would benefit from a more complete discussion of the kinds of
 > applications Javelin++ is intended to support.  The only example considered
 > is
 > rendering, which is the very model of an ``embarrasingly parallel''
 > application.  I have no quarrel with this, but I think a discussion of how
 > other kinds of applications (or even other applications) fit into their
 > splittable interface would be helpful in understanding what this system can
 > and
 > can't do.  Are there requirements about determinism (what happens if you do
 > the
 > same piece twice, perhaps as a result of fault-tolerance?).  What if pieces

In the Introduction, after we state our focus, we place in a separate
paragraph a discussion of our computational model:
the piecework model of computation.  There we describe the nature of this
model, giving examples of architectures that support it, and applications
that are natural to it.

The question of what happens when a piece is computed twice (for reasons
of fault tolerance) is a good one.  In our original submission,
we failed to say that our system currently passes the first result to the
client, discarding the rest.
We explicitly state this in our revision.  Of course, other policies are
possible.  The important point: the client has a simple programming
model in which results are _never_ missing or duplicated.

 > need to share data?  Can concurrent pieces communicate with one
 > another?

Pieces are communicationally autonomous, apart from scheduling work and
sending results (like Cilk threads).  This is explicitly stated in our
description of the computational model. The API section also makes
this clear.

 >
 > When loading classes dynamically, does Javelin++ ensure that all the classes
 > needed by an application have been loaded before the computation starts?  I
 > can
 > imagine that pausing an application in the middle to load a user-defined
 > class
 > could be disruptive.
 >

Good point. We have added an explanation in the paper of our
load-on-demand strategy. It might be better to preload classes in
the future. But again, this issue has nothing to do with scalable code
distribution. (Btw, Sun's JavaSpaces also implements demand-driven loading).

 > What was the broker structure for the test application?  Are there any
 > performance numbers for broker lookup, reconfiguration, etc?
 >

We did not address this comment. The broker network is static at the
moment, which is described in the paper.  Actual lookup time, which
typically takes a few ms., is not relevant to the broker network's
 _scalability_.


 > Referee 3 ****************************************************************
 > Subject: C443 JGSI Review
 >
 >
 > a) publish
 >
 > b) this paper describes an improved version of the Javelin system, which
 > provides a Java-based infrastructure for global computing. The
 > impovements include replacing low level TCP-based communication by Java
 > RMI, replacing host applets with host applications (a host is a CPU
 > provider), and finally introducing distributed brokers supporting at
 > least two distinct scheduling algorithms.
 > Going for a highier level communication protocol is definitely the right
 > thing to do, while abandoning applets seems to me controversial. Indeed,
 > it simplifies implementation at the cost of software installation and
 > the idea of running the jevelin host as a screen server is cool.
 > However, I am not convinced by the author arguments that the jevelin
 > host cannot be run as an applet. This would require implementation of
 > proxies for distributed brokers, as some other project do. For enabling
 > access to the local resources, for example, a signed applet can be
 > used.

Yes, the host could be made to run as an applet.  We are more clear about
this in the new section on applets.

 > By the way, the paper does not describe the security aspects of the
 > system. An interesting issue here is how to build trust between the host
 > and the client, with a chain of brokers serving as the

Agree.  However, security, while fundamental is not the focus at the
moment.  We have made this more clear in our Introduction.

 > medeworkers.Finally, the idea of distributed workers is a good begining
 > for a scalable, and fault tolerant system.
 >
 > The paper is nicely written, clearly describing the architecure of the
 > system, its APIs, and scheduling mechanism, and is well illustarted with
 > an example application and performance analysis.

    ---------------------------------------------------------------------
                Name: top.ps
   top.ps       Type: Postscript Document (application/postscript)
            Encoding: base64
         Description: Javelin++ final version.