Referee 1 **************************************************

This paper presents a system called the Virtual Service Grid, which manages
replication and load-balancing of network services in the wide area. VSG
automatically creates and destroys service replicas in response to client
demand, and performs replica selection based on performance prediction
for each service node. The paper suggests that this approach can be applied to
a wide range of services, including high-end scientific computing applications 
as well as commodity Web-based services. Performance results are given on a
distributed testbed demonstrating the viability of the VSG approach on a small
scale. 

I have a number of reservations about this paper that could perhaps be
addressed by the authors through a round of (heavy) revisions. Therefore, I am
recommending a marginal rejection for this paper as I think these revisions
would warrant re-review once completed.

The first concern is that this paper appears to draw very closely from 3 other
papers by the same authors on similar topics, and I am worried that this might
be the "least publishable unit". Only one of the 3 others papers is cited 
here and it is not placed in context. What are the new results presented 
in this paper, and how does it build upon the previous work you have done?

There are several technical concerns about this work that, while the authors
acknowledge that they are areas for future work, seriously limit the
applicability of the approach here. First, VSG assumes that services are
stateless - a term that is not defined in this paper but presumably means that
services can be started or stopped (or fail) at any time, have no side
effects, are idempotent, and are homogenous. In the real world these
assumptions are very rarely (if ever) true. Even purely computational
services generally produce a great deal of local state, and reuse of
intermediate results is an important aspect of pipelining and optimization
in this regime. (This fact is even relied upon in the performance results
presented in section 4 of the paper.) I have a hard time believing that this
issue is really orthogonal to replication management and selection.

The introduction claims that this paper addresses fault transparency, 
supposedly by automatically failing over to a backup replica in the event 
of a service failure; however, the paper does not actually address this 
issue, which certainly complicates most of the techniques described here. 
If the paper does not address failures then it should not be claimed in
the introduction.

My most important reservation about this paper is that it does not appear to
address scalability. The mechanisms presented here rely upon (a) a centralized
replica manager (RM) for each service; (b) active network probing by
individual clients or groups of clients; and (c) an apparently static
configuration in which clients are bound to groups. The performance results
given here are for a very small testbed (only around 16 clients with 14
replica hosting machines), and obviously in this environment one can
get away with these kinds of assumptions. However, in general network services
must support extremely bursty demands with potentially many tens of thousands
of clients with no pre-defined administrative hierarchy or configuration.
The authors claim that using a recent history of replica performance is 
adequate; however, it is not at all clear that this is true, given that 
sudden bursts can throw off performance estimates by orders of magnitude. 
The paper does not describe the overhead of the monitoring and probing 
mechanisms or the frequency with which probes are made. As far as 
a prototype goes the authors may be justified in choosing this design, 
but it is unfortunate that they do not discuss the implications nor how 
they would overcome these limitations. 

I am concerned about the issues that arise with tuning the VSG system. The
"P-Q Algorithm" (which looks like a classic discrete PID controller in
disguise) has a number of complex parameters, and it is not clear how a system
administrator would go about setting them. The choice seems to depend upon the
hardware, network, application, client load, and many other factors - not 
all of which can be determined a priori. The paper presents a particular
setting for these parameters but nothing is said about how they were derived.

The related work section here is somewhat weak and could do a better job
describing load balancing and replication schemes that are in wide use in
other domains. The Oceano project from IBM research as well as recent work in
SOSP'01 by Jeff Chase come to mind; although these deal mainly with
replication and load management in local area environments it is important to
draw parallels to this prior work. Akamai has done considerable work in the
area of wide-area replication for (static) services, and relies upon complex
techniques for replica selection based on performance criteria; however 
little is said here about their approach. 

My specific suggestions would be to reframe the introduction to state more
precisely what problems this paper actually addresses, to spend less space on
the myriad performance numbers (many of which do not contribute to a better
understanding of the system), and discuss in some depth how you could extend
this approach to cover some of the limitations described above. As the paper
stands it is not obvious that VSG could be extended to address scalability,
fault tolerance, or stateful services. 

F: Presentation Changes

Overall the paper is well-written. A few minor points:

Figure 6 is very difficult to understand. It would be helpful for the 
various acronyms (GM, RM, etc.) to be defined in the caption. The 
meaning of the various boxes, circles, and lines is not at all clear.

There seems to be a problem with fonts, especially in the mathematical
formulae.

The caption on Figure 7 is truncated.

I'm not sure that publishing the hostnames of your testbed machines is a
good idea, lest you invite a denial-of-service attack.

Section 4 degenerates into a large number of figures, and I'm not sure what
high-level information the reader is supposed to derive from them. This is
especially true of the scatterplots in Figures 17, 21, 24, and 26; more
helpful would have been smoothed averages or histograms/CDFs. Also, I am
interested to see the number of replicas created/destroyed/utilized during the
discussion of replica creation and destruction. The figures only show the
aggregate response time, but this is a second-order effect.


.

Referee 2 **************************************************

This is an interesting paper addressing an important problem.
It is clearly written.
I was not convinced by the particular applications chosen --
matrix multiplication and full matrix jacobi Iteration are
hardly central computational science applications. i think
some of the motivating examples in the beginning were
much more to the point -- remote data and visualization
are very clear cases.
Further how does this work compare to that used in Akamai
and other commercial situations -- as the Author's say this
basic issue stretches from commercial to technical computing.
So although this paper is not the "answer" it is a useful
well written contribution. I recommend publication after
some discussion of issues such as those raised above.