Subject:
C530 Paper Review
From:
Matt Welsh <mdw@cs.berkeley.edu>
Date:
Wed, 31 Oct 2001 10:27:25 -0800
To:
gcf@indiana.edu
X-UIDL:
a06aa9a343280000
X-Mozilla-Status:
0001
X-Mozilla-Status2:
00000000
Received:
by mailer.csit.fsu.edu (mbox gcfpc) (with Cubic Circle's cucipop (v1.31 1998/05/13) Sat Nov 10 08:23:04 2001)
X-From_:
fox@mailer.csit.fsu.edu Sat Nov 10 08:16:57 2001
Return-Path:
<fox@mailer.csit.fsu.edu>
Delivered-To:
gcfpc@csit.fsu.edu
Received:
from dirac.csit.fsu.edu (dirac.csit.fsu.edu [144.174.128.44]) by mailer.csit.fsu.edu (Postfix) with ESMTP id B416723A0A for <gcfpc@csit.fsu.edu>; Sat, 10 Nov 2001 08:16:56 -0500 (EST)
Received:
from localhost by dirac.csit.fsu.edu (AIX4.2/UCB 8.7) id IAA66532; Sat, 10 Nov 2001 08:16:55 -0500 (EST)
Resent-Message-Id:
<200111101316.IAA66532@dirac.csit.fsu.edu>
Replied:
Wed, 31 Oct 2001 15:46:54 -0500
Replied:
Matt Welsh <mdw@cs.berkeley.edu>
Delivered-To:
fox@csit.fsu.edu
Received:
from mask.uits.indiana.edu (mask.uits.indiana.edu [129.79.6.184]) by mailer.csit.fsu.edu (Postfix) with ESMTP id E2E7D23A36 for <fox@csit.fsu.edu>; Wed, 31 Oct 2001 13:28:04 -0500 (EST)
Received:
from bhikku.cs.berkeley.edu (bhikku.CS.Berkeley.EDU [128.32.131.202]) by mask.uits.indiana.edu (8.10.1/8.10.1/IUPO) with ESMTP id f9VIPJm21402 for <gcf@indiana.edu>; Wed, 31 Oct 2001 13:25:19 -0500 (EST)
Received:
from bhikku.cs.berkeley.edu (mdw@localhost) by bhikku.cs.berkeley.edu (8.11.0/8.11.0) with ESMTP id f9VIRPS19704 for <gcf@indiana.edu>; Wed, 31 Oct 2001 10:27:25 -0800
Message-Id:
<200110311827.f9VIRPS19704@bhikku.cs.berkeley.edu>
X-Authentication-Warning:
bhikku.cs.berkeley.edu: mdw owned process doing -bs
Reply-To:
Matt Welsh <mdw@cs.berkeley.edu>
Resent-To:
Geoffrey Fox <gcfpc@csit.fsu.edu>
Resent-Date:
Sat, 10 Nov 2001 08:16:55 -0500
Resent-From:
Geoffrey Fox <fox@mailer.csit.fsu.edu>

Here is my review of C530. I hope you don't think I am too negative
on my paper reviews for CCPE (I am setting the bar fairly high).
If you think I should recalibrate let me know ...

Cheers,
Matt


Review form - 

C530: The Virtual Service Grid: An Architecture for Delivering High-End 
Network Services
Authors: Jon B. Weissman and Byoung-Dai Lee

A: General Information

Please return to:
Geoffrey C. Fox  Electronically Preferred gcf@indiana.edu

Concurrency and Computation: Practice and Experience
Computer Science Department
228 Lindley Hall
Bloomington
Indiana 47405
Office Phone 8128567977(Lab), 8128553788(CS)
but best is cell phone 3152546387
FAX 8128567972

Please fill in Summary Conclusions (Sec. C) and details as appropriate in Secs.
D, E and F.

B: Refereeing Philosophy
We encourage a broad range of readers and contributors. Please judge papers on
their technical merit and separate comments on this from those on style and
approach. Keep in mind the strong practical orientation that we are trying to
give the journal. Note that the forms attached provide separate paper for
comments that you wish only the editor to see and those that both the editor and
author receive. Your identity will of course not be revealed to the author.

C: Paper and Referee Metadata
             Paper Number Cnnn: C530
             Date: Oct 30, 2001
             Paper Title: The Virtual Service Grid
             Author(s): Weissman and Lee
             Referee: Matt Welsh
             Address: mdw@cs.berkeley.edu

Referee Recommendations. Please indicate overall recommendations here, and
details in following sections.
             reject

D: Referee Comments (For Editor Only)

None; see Section E and F.

E: Referee Comments (For Author and Editor)

This paper presents a system called the Virtual Service Grid, which manages
replication and load-balancing of network services in the wide area. VSG
automatically creates and destroys service replicas in response to client
demand, and performs replica selection based on performance prediction
for each service node. The paper suggests that this approach can be applied to
a wide range of services, including high-end scientific computing applications 
as well as commodity Web-based services. Performance results are given on a
distributed testbed demonstrating the viability of the VSG approach on a small
scale. 

I have a number of reservations about this paper that could perhaps be
addressed by the authors through a round of (heavy) revisions. Therefore, I am
recommending a marginal rejection for this paper as I think these revisions
would warrant re-review once completed.

The first concern is that this paper appears to draw very closely from 3 other
papers by the same authors on similar topics, and I am worried that this might
be the "least publishable unit". Only one of the 3 others papers is cited 
here and it is not placed in context. What are the new results presented 
in this paper, and how does it build upon the previous work you have done?

There are several technical concerns about this work that, while the authors
acknowledge that they are areas for future work, seriously limit the
applicability of the approach here. First, VSG assumes that services are
stateless - a term that is not defined in this paper but presumably means that
services can be started or stopped (or fail) at any time, have no side
effects, are idempotent, and are homogenous. In the real world these
assumptions are very rarely (if ever) true. Even purely computational
services generally produce a great deal of local state, and reuse of
intermediate results is an important aspect of pipelining and optimization
in this regime. (This fact is even relied upon in the performance results
presented in section 4 of the paper.) I have a hard time believing that this
issue is really orthogonal to replication management and selection.

The introduction claims that this paper addresses fault transparency, 
supposedly by automatically failing over to a backup replica in the event 
of a service failure; however, the paper does not actually address this 
issue, which certainly complicates most of the techniques described here. 
If the paper does not address failures then it should not be claimed in
the introduction.

My most important reservation about this paper is that it does not appear to
address scalability. The mechanisms presented here rely upon (a) a centralized
replica manager (RM) for each service; (b) active network probing by
individual clients or groups of clients; and (c) an apparently static
configuration in which clients are bound to groups. The performance results
given here are for a very small testbed (only around 16 clients with 14
replica hosting machines), and obviously in this environment one can
get away with these kinds of assumptions. However, in general network services
must support extremely bursty demands with potentially many tens of thousands
of clients with no pre-defined administrative hierarchy or configuration.
The authors claim that using a recent history of replica performance is 
adequate; however, it is not at all clear that this is true, given that 
sudden bursts can throw off performance estimates by orders of magnitude. 
The paper does not describe the overhead of the monitoring and probing 
mechanisms or the frequency with which probes are made. As far as 
a prototype goes the authors may be justified in choosing this design, 
but it is unfortunate that they do not discuss the implications nor how 
they would overcome these limitations. 

I am concerned about the issues that arise with tuning the VSG system. The
"P-Q Algorithm" (which looks like a classic discrete PID controller in
disguise) has a number of complex parameters, and it is not clear how a system
administrator would go about setting them. The choice seems to depend upon the
hardware, network, application, client load, and many other factors - not 
all of which can be determined a priori. The paper presents a particular
setting for these parameters but nothing is said about how they were derived.

The related work section here is somewhat weak and could do a better job
describing load balancing and replication schemes that are in wide use in
other domains. The Oceano project from IBM research as well as recent work in
SOSP'01 by Jeff Chase come to mind; although these deal mainly with
replication and load management in local area environments it is important to
draw parallels to this prior work. Akamai has done considerable work in the
area of wide-area replication for (static) services, and relies upon complex
techniques for replica selection based on performance criteria; however 
little is said here about their approach. 

My specific suggestions would be to reframe the introduction to state more
precisely what problems this paper actually addresses, to spend less space on
the myriad performance numbers (many of which do not contribute to a better
understanding of the system), and discuss in some depth how you could extend
this approach to cover some of the limitations described above. As the paper
stands it is not obvious that VSG could be extended to address scalability,
fault tolerance, or stateful services. 

F: Presentation Changes

Overall the paper is well-written. A few minor points:

Figure 6 is very difficult to understand. It would be helpful for the 
various acronyms (GM, RM, etc.) to be defined in the caption. The 
meaning of the various boxes, circles, and lines is not at all clear.

There seems to be a problem with fonts, especially in the mathematical
formulae.

The caption on Figure 7 is truncated.

I'm not sure that publishing the hostnames of your testbed machines is a
good idea, lest you invite a denial-of-service attack.

Section 4 degenerates into a large number of figures, and I'm not sure what
high-level information the reader is supposed to derive from them. This is
especially true of the scatterplots in Figures 17, 21, 24, and 26; more
helpful would have been smoothed averages or histograms/CDFs. Also, I am
interested to see the number of replicas created/destroyed/utilized during the
discussion of replica creation and destruction. The figures only show the
aggregate response time, but this is a second-order effect.


.