Subject: C530 Paper Review From: Matt Welsh Date: Wed, 31 Oct 2001 10:27:25 -0800 To: gcf@indiana.edu X-UIDL: a06aa9a343280000 X-Mozilla-Status: 0001 X-Mozilla-Status2: 00000000 Received: by mailer.csit.fsu.edu (mbox gcfpc) (with Cubic Circle's cucipop (v1.31 1998/05/13) Sat Nov 10 08:23:04 2001) X-From_: fox@mailer.csit.fsu.edu Sat Nov 10 08:16:57 2001 Return-Path: Delivered-To: gcfpc@csit.fsu.edu Received: from dirac.csit.fsu.edu (dirac.csit.fsu.edu [144.174.128.44]) by mailer.csit.fsu.edu (Postfix) with ESMTP id B416723A0A for ; Sat, 10 Nov 2001 08:16:56 -0500 (EST) Received: from localhost by dirac.csit.fsu.edu (AIX4.2/UCB 8.7) id IAA66532; Sat, 10 Nov 2001 08:16:55 -0500 (EST) Resent-Message-Id: <200111101316.IAA66532@dirac.csit.fsu.edu> Replied: Wed, 31 Oct 2001 15:46:54 -0500 Replied: Matt Welsh Delivered-To: fox@csit.fsu.edu Received: from mask.uits.indiana.edu (mask.uits.indiana.edu [129.79.6.184]) by mailer.csit.fsu.edu (Postfix) with ESMTP id E2E7D23A36 for ; Wed, 31 Oct 2001 13:28:04 -0500 (EST) Received: from bhikku.cs.berkeley.edu (bhikku.CS.Berkeley.EDU [128.32.131.202]) by mask.uits.indiana.edu (8.10.1/8.10.1/IUPO) with ESMTP id f9VIPJm21402 for ; Wed, 31 Oct 2001 13:25:19 -0500 (EST) Received: from bhikku.cs.berkeley.edu (mdw@localhost) by bhikku.cs.berkeley.edu (8.11.0/8.11.0) with ESMTP id f9VIRPS19704 for ; Wed, 31 Oct 2001 10:27:25 -0800 Message-Id: <200110311827.f9VIRPS19704@bhikku.cs.berkeley.edu> X-Authentication-Warning: bhikku.cs.berkeley.edu: mdw owned process doing -bs Reply-To: Matt Welsh Resent-To: Geoffrey Fox Resent-Date: Sat, 10 Nov 2001 08:16:55 -0500 Resent-From: Geoffrey Fox Here is my review of C530. I hope you don't think I am too negative on my paper reviews for CCPE (I am setting the bar fairly high). If you think I should recalibrate let me know ... Cheers, Matt Review form - C530: The Virtual Service Grid: An Architecture for Delivering High-End Network Services Authors: Jon B. Weissman and Byoung-Dai Lee A: General Information Please return to: Geoffrey C. Fox Electronically Preferred gcf@indiana.edu Concurrency and Computation: Practice and Experience Computer Science Department 228 Lindley Hall Bloomington Indiana 47405 Office Phone 8128567977(Lab), 8128553788(CS) but best is cell phone 3152546387 FAX 8128567972 Please fill in Summary Conclusions (Sec. C) and details as appropriate in Secs. D, E and F. B: Refereeing Philosophy We encourage a broad range of readers and contributors. Please judge papers on their technical merit and separate comments on this from those on style and approach. Keep in mind the strong practical orientation that we are trying to give the journal. Note that the forms attached provide separate paper for comments that you wish only the editor to see and those that both the editor and author receive. Your identity will of course not be revealed to the author. C: Paper and Referee Metadata Paper Number Cnnn: C530 Date: Oct 30, 2001 Paper Title: The Virtual Service Grid Author(s): Weissman and Lee Referee: Matt Welsh Address: mdw@cs.berkeley.edu Referee Recommendations. Please indicate overall recommendations here, and details in following sections. reject D: Referee Comments (For Editor Only) None; see Section E and F. E: Referee Comments (For Author and Editor) This paper presents a system called the Virtual Service Grid, which manages replication and load-balancing of network services in the wide area. VSG automatically creates and destroys service replicas in response to client demand, and performs replica selection based on performance prediction for each service node. The paper suggests that this approach can be applied to a wide range of services, including high-end scientific computing applications as well as commodity Web-based services. Performance results are given on a distributed testbed demonstrating the viability of the VSG approach on a small scale. I have a number of reservations about this paper that could perhaps be addressed by the authors through a round of (heavy) revisions. Therefore, I am recommending a marginal rejection for this paper as I think these revisions would warrant re-review once completed. The first concern is that this paper appears to draw very closely from 3 other papers by the same authors on similar topics, and I am worried that this might be the "least publishable unit". Only one of the 3 others papers is cited here and it is not placed in context. What are the new results presented in this paper, and how does it build upon the previous work you have done? There are several technical concerns about this work that, while the authors acknowledge that they are areas for future work, seriously limit the applicability of the approach here. First, VSG assumes that services are stateless - a term that is not defined in this paper but presumably means that services can be started or stopped (or fail) at any time, have no side effects, are idempotent, and are homogenous. In the real world these assumptions are very rarely (if ever) true. Even purely computational services generally produce a great deal of local state, and reuse of intermediate results is an important aspect of pipelining and optimization in this regime. (This fact is even relied upon in the performance results presented in section 4 of the paper.) I have a hard time believing that this issue is really orthogonal to replication management and selection. The introduction claims that this paper addresses fault transparency, supposedly by automatically failing over to a backup replica in the event of a service failure; however, the paper does not actually address this issue, which certainly complicates most of the techniques described here. If the paper does not address failures then it should not be claimed in the introduction. My most important reservation about this paper is that it does not appear to address scalability. The mechanisms presented here rely upon (a) a centralized replica manager (RM) for each service; (b) active network probing by individual clients or groups of clients; and (c) an apparently static configuration in which clients are bound to groups. The performance results given here are for a very small testbed (only around 16 clients with 14 replica hosting machines), and obviously in this environment one can get away with these kinds of assumptions. However, in general network services must support extremely bursty demands with potentially many tens of thousands of clients with no pre-defined administrative hierarchy or configuration. The authors claim that using a recent history of replica performance is adequate; however, it is not at all clear that this is true, given that sudden bursts can throw off performance estimates by orders of magnitude. The paper does not describe the overhead of the monitoring and probing mechanisms or the frequency with which probes are made. As far as a prototype goes the authors may be justified in choosing this design, but it is unfortunate that they do not discuss the implications nor how they would overcome these limitations. I am concerned about the issues that arise with tuning the VSG system. The "P-Q Algorithm" (which looks like a classic discrete PID controller in disguise) has a number of complex parameters, and it is not clear how a system administrator would go about setting them. The choice seems to depend upon the hardware, network, application, client load, and many other factors - not all of which can be determined a priori. The paper presents a particular setting for these parameters but nothing is said about how they were derived. The related work section here is somewhat weak and could do a better job describing load balancing and replication schemes that are in wide use in other domains. The Oceano project from IBM research as well as recent work in SOSP'01 by Jeff Chase come to mind; although these deal mainly with replication and load management in local area environments it is important to draw parallels to this prior work. Akamai has done considerable work in the area of wide-area replication for (static) services, and relies upon complex techniques for replica selection based on performance criteria; however little is said here about their approach. My specific suggestions would be to reframe the introduction to state more precisely what problems this paper actually addresses, to spend less space on the myriad performance numbers (many of which do not contribute to a better understanding of the system), and discuss in some depth how you could extend this approach to cover some of the limitations described above. As the paper stands it is not obvious that VSG could be extended to address scalability, fault tolerance, or stateful services. F: Presentation Changes Overall the paper is well-written. A few minor points: Figure 6 is very difficult to understand. It would be helpful for the various acronyms (GM, RM, etc.) to be defined in the caption. The meaning of the various boxes, circles, and lines is not at all clear. There seems to be a problem with fonts, especially in the mathematical formulae. The caption on Figure 7 is truncated. I'm not sure that publishing the hostnames of your testbed machines is a good idea, lest you invite a denial-of-service attack. Section 4 degenerates into a large number of figures, and I'm not sure what high-level information the reader is supposed to derive from them. This is especially true of the scatterplots in Figures 17, 21, 24, and 26; more helpful would have been smoothed averages or histograms/CDFs. Also, I am interested to see the number of replicas created/destroyed/utilized during the discussion of replica creation and destruction. The figures only show the aggregate response time, but this is a second-order effect. .