Subject: C443 JGFSI Reviews Resent-Date: Fri, 12 Nov 1999 08:37:46 -0500 Resent-From: Geoffrey Fox Resent-To: p_gcf@npac.syr.edu Date: Fri, 5 Nov 1999 18:36:47 -0800 (PST) From: "Michael O. Neary" To: Geoffrey Fox Dear Geoffrey, Please find attached a new version of our Javelin++ paper. We have adressed some of the referees' comments with minor changes to the paper. The main change is a totally revised section on the discussion of Java applets vs applications, where we try to clarify that we are *not* abandoning applets, but just not focusing on their development anymore. This is mainly due to the ongoing browser incompatibilities when it comes to implementing RMI. Also, with the advent of JDK 1.2, there is no principal distinction between applets and applications w.r.t. security anymore. All 3 reviewers emphasized the issue of applets vs. applications. Since this issue is not the focus of our paper, we sought to clarify that point: We added/modified several lines in the abstract, introduction, and conclusion in order to emphasize and clarify the focus of our research: scalability and fault tolerance. Below, we adress each comment specifically. Best regards, Mike Neary & Pete Cappello Geoffrey Fox writes: > I enclose 3 Referee reports on your paper. > We would be pleased to accept it and could you please send me > a new version before November 5 99 > Please send a memo describing any suggestions of the > referees that you did not address > Ignore any aggressive remarks you don't think appropriate but > please tell me. I trust you! > > Thank you for your help in writing and refereeing papers! > > > Referee 1 ********************************************************** > > C443: Javelin++: Scalability issues in global computing > Neary, et al. > > This paper presents an extension of Javelin in support of scalable > global computing. This is a good experiment work. My major concern is > its originality. As it is indicated > in Section 2 that Javelin++ improved Javelin in three aspects: > -- Java RMI implementation instead of TCP sockets; > -- Java applications instead of applets; > -- distributed broker instead of a centralized broker. The paper mentions the changes above because they are indeed changes to the architecture. However, the switch from TCP to RMI is a technical detail and not relevant to the scalability issue. Also, as indicated above, although we use applications primarily, we do not rule out the use of applets, a point which we have tried to make more clear. On the other hand, the distributed broker is an architectural change that is relevant to our quest for a scalable architecture. To improve the clarity of our intent, we have de-emphasized the first two points, and stress the latter in the new abstract/intro. The originality lies in Javelin++'s scalability and fault tolerance. Our distributed work stealing provides scalability. The smoothly integrated distributed deterministic eager scheduler provides fault tolerance. Additionally, Javelin++: 1. detects and replaces hosts that have failed or retreated from the computation; 2. distributes classes over the broker network via the Java class loader, which we extended for this purpose. > While some of the extensions may not be so trivial in implementations, > overall I am doubting if extensions are significant enough to justify > for another journal publication. RMI provides an alternative communication > mechansim to TCP. It simplifies programming, in particular, for > workstealing type of parallel applications. Expectedly, RMI is slower than > TCP. By how much in Javelin++? > The choice of RMI vs. TCP is not relevant to the question of scalability. We can always plug in a faster communication subsystem. KaRMI, the Karlsruhe RMI presented at Java Grande, comes to mind. On the other hand, the features that provide scalability and fault tolerance, mentioned above are original and non-trivial. In the revision, we have endeavored to communicate our focus and original contributions more clearly to the reader. > Javelin++ supports Java standalone applications, instead of Java applets. > More > applications can be adapted to Javelin++. The authors listed four drawbacks > with Java applets. The authors are expected to show some examples in > experiments that beyond the capability of Javelin. Unfortunately, the paper > just simply repeats the experiment of Javelin using the same example on a > cluster of PC/workstations. > Indeed, more applications would be desirable. However, by using the same apps as in the previous system, we provide performance data on Javelin++ that is _comparable_ to the performance data on the original Javelin prototype. We noted this virtue in the paper's Experimental Results section. > Javelin++ extends Javelin with distributed brokers for scalability. > Associated with the distributed brokers, the paper implements a number of > interesting scheduling algorithms, including work stealing for task > distribution and eager scheduling for fault tolerance. With more > work on the distributed scheduling strategies with some analysis of their > scalability and overhead, it might be more appropriate to present the > work as a scheduling paper. > We agree that we could have done written this as a scheduling paper. But, for us, scheduling is of interest only in so far as it affects scalability and fault tolerance, which we regard as fundamental issues---hence the focus of our paper. > --------------------------------------------------------------------- > Referee 2 ***************************************************************** > > Referee report for Javelin++ > > Seems like a reasonable paper, but the presentation could be cleaned up a > bit. > > This paper describes a distributed work-stealing system, focusing mostly on > two > work-stealing algorithms. The introduction is rambling and > repetitious, We were disappointed by this comment; we tried hard to compactly identify the fundamental issues, to place this paper's focus in context. We have changed the wording slightly to convey this context-setting intention. The length of text to accomplish this is less than 1/2 of a page, and we are hard-pressed to cut further. > talking about a number of important issues, most of which are only touched > upon > in the body of the paper. This paper is about load balancing, which is The paper is about scalability and fault tolerance. Load balancing comes as a pleasant side-effect of our eager scheduler, whose purpose is to provide fault tolerance. Again, we have made wording changes to clarify our focus. > fine, > but you have to get pretty far through the paper to realize that the other > issues are dealt with only in passing. I would rather see an introduction > that > says ``there are n fundamental issues, but here we focus on scalable work > stealing and load balancing''. > What we did is not that different from what is being suggested. Instead of saying "there are 5 fundamental issues, but here we focus on scalability and fault tolerance," we describe the 5 issues in a manner that, we believe, is more concise than previously accomplished. In fact, the compactness of this characterization is, in our opinion, a _contribution_ of this paper. Based on the reviewer's comment, we explicitly mention the focus of the paper in several places, including the Introduction, where we say, "The focus of this paper is on the fundamental issues of scalability and fault tolerance." The organization of the Introduction is: 1) state the goal/problem; 2) characterize the issues (which include scalability and fault tolerance); 3) note that previous work has not fully achieved scalability; 4) state our focus:scalabiltiy and fault tolerance. They are related: The latter is required to realize the former. > The paper would benefit from a more complete discussion of the kinds of > applications Javelin++ is intended to support. The only example considered > is > rendering, which is the very model of an ``embarrasingly parallel'' > application. I have no quarrel with this, but I think a discussion of how > other kinds of applications (or even other applications) fit into their > splittable interface would be helpful in understanding what this system can > and > can't do. Are there requirements about determinism (what happens if you do > the > same piece twice, perhaps as a result of fault-tolerance?). What if pieces In the Introduction, after we state our focus, we place in a separate paragraph a discussion of our computational model: the piecework model of computation. There we describe the nature of this model, giving examples of architectures that support it, and applications that are natural to it. The question of what happens when a piece is computed twice (for reasons of fault tolerance) is a good one. In our original submission, we failed to say that our system currently passes the first result to the client, discarding the rest. We explicitly state this in our revision. Of course, other policies are possible. The important point: the client has a simple programming model in which results are _never_ missing or duplicated. > need to share data? Can concurrent pieces communicate with one > another? Pieces are communicationally autonomous, apart from scheduling work and sending results (like Cilk threads). This is explicitly stated in our description of the computational model. The API section also makes this clear. > > When loading classes dynamically, does Javelin++ ensure that all the classes > needed by an application have been loaded before the computation starts? I > can > imagine that pausing an application in the middle to load a user-defined > class > could be disruptive. > Good point. We have added an explanation in the paper of our load-on-demand strategy. It might be better to preload classes in the future. But again, this issue has nothing to do with scalable code distribution. (Btw, Sun's JavaSpaces also implements demand-driven loading). > What was the broker structure for the test application? Are there any > performance numbers for broker lookup, reconfiguration, etc? > We did not address this comment. The broker network is static at the moment, which is described in the paper. Actual lookup time, which typically takes a few ms., is not relevant to the broker network's _scalability_. > Referee 3 **************************************************************** > Subject: C443 JGSI Review > > > a) publish > > b) this paper describes an improved version of the Javelin system, which > provides a Java-based infrastructure for global computing. The > impovements include replacing low level TCP-based communication by Java > RMI, replacing host applets with host applications (a host is a CPU > provider), and finally introducing distributed brokers supporting at > least two distinct scheduling algorithms. > Going for a highier level communication protocol is definitely the right > thing to do, while abandoning applets seems to me controversial. Indeed, > it simplifies implementation at the cost of software installation and > the idea of running the jevelin host as a screen server is cool. > However, I am not convinced by the author arguments that the jevelin > host cannot be run as an applet. This would require implementation of > proxies for distributed brokers, as some other project do. For enabling > access to the local resources, for example, a signed applet can be > used. Yes, the host could be made to run as an applet. We are more clear about this in the new section on applets. > By the way, the paper does not describe the security aspects of the > system. An interesting issue here is how to build trust between the host > and the client, with a chain of brokers serving as the Agree. However, security, while fundamental is not the focus at the moment. We have made this more clear in our Introduction. > medeworkers.Finally, the idea of distributed workers is a good begining > for a scalable, and fault tolerant system. > > The paper is nicely written, clearly describing the architecure of the > system, its APIs, and scheduling mechanism, and is well illustarted with > an example application and performance analysis. --------------------------------------------------------------------- Name: top.ps top.ps Type: Postscript Document (application/postscript) Encoding: base64 Description: Javelin++ final version.