The general principle behind document caching is moving the document
closer to the end user. This is achieved by transparently storing
the document on servers closer to the user. These servers are
typically called proxy cache servers. Document caching can also
be viewed as a main memory cache scheme. The concept behind main
memory caching is to move data from a slower memory to a faster
memory.
25.2.1 Caching schemes/hierarchy
Data caching is a common method to improve performance in computer
systems. CPU caches in static memory improve performance by avoiding
the need to retrieve data from slower dynamic memory. Memory
disk caches perform the same function for the slower disk drive.
The general goal is to move data closer to where it is used.
Unremarkably, Web document caching is an active research area
with several projects underway or completed.
The user access model is an important consideration when designing
caching schemes. The model attempts to characterize how the user will
access Web documents. By knowing the user access pattern, better caching
predictions
can be made as to what documents will be requested.
For a particular model, cache performance may be excellent.
But, if the model does not accurately follow the user's actual
access pattern, then performance can degrade substantially. User
models for hypermedia research are as follows (Catledge, 1994;
Valdez, 1988).
The user model used by a specific scheme to increase performance
must be suitable to the task. Designing for worstcase user
behaviors may create a suboptimal design for 99 percent
of user access patterns. For example, assume that a server is
expecting an average of 100 connections per minute. One hundred
processes are started to handle the expected connections. However,
if only a dozen requests are made, then substantial server resources
are wasted. This implies that how users are characterized and
the performance criteria that result from the characterization
are issues that need to be considered.
Other ways of providing the basis for caching include heuristics
similar to the above, statistical methods, and graph theory.
Statistical methods generally use user access information as a
probability estimate for future accesses. Concepts from graph
theory can be used to convert the hypertext space into graphs
that can be used to determine what documents to cache. A hypertext
system typically can be mapped into a hierarchical graph (Botafogo, 1992). Some researchers combine the various methods. A detailed
discussion can be found in Lee (1996).
The general caching hierarchy is as follows.
Each of the caching systems above will be discussed below. All
caching systems can either fetch the document before a request
is made (pre-fetch) or fetch the document after a request has
been made (post-fetch). Post-fetch systems are distinguished by
the fact that they already have, in cache, the document that is being
requested.
Pre-fetech systems attempt to request a document to cache before any
user requests it.
25.2.2 Metrics
Several metrics are commonly used when evaluating caching systems.
These include the following.
Individual caching algorithms may have various parameters (such
as cache size and how long to cache a document) that can significantly
affect the performance of the cache.
25.2.3 User caches
User caches can be described
as caches on the local system that a particular user works on.
Almost every browser implements some form of user caching. User
caches vary from storing every document retrieved for some specified
criteria (typically time based) to just storing the documents
retrieved in the current session.
Luotonen (1994) recommends against using user caches primarily
because there tends to be a significant amount of duplicated material
in each individual user cache. He also notes that cache servers
may be able to obtain better performance than a user cache.
25.2.4 Cache servers
Several experiments have been conducted using proxy caching servers (Abrams, 1995; Glassman, 1994; Luotonen, 1994; Pitkow and Recker, 1994). These are generally based on the concept that a browser obtains documents through a proxy server, as opposed to directly retrieving the document. If the proxy server does not have the document in its cache, it can request them from the origin server. Thus, if a group of users with the same interests accesses the same proxy cache, they will most likely access a cached document.
Multiple proxy caches may exist in the access route. The recommended
organization of these caches is to use them in a hierarchical
system, as shown in Figure 4. For example, cslab.vt.edu
would have a cache server. If the document was not in the cache
server, cslab.vt.edu
would attempt to retrieve it from a cache server at the cs.vt.edu
level. If the document was not found again, cs.vt.edu
would attempt to retrieve if from a cache server at the vt.edu
level. And if the vt.edu
cache server could not find it, it would retrieve it from the
origin server.
Most organizations are hierarchically organized, so this solution
closely matches how people work. However, the proxy cache arrangement
may be performed in any method desired. Typical commercial sites
have a security mechanism known as a firewall and HTTP requests
from machines inside the firewall are sent through a proxy located
at that firewall.
Difficulties in maintaining consistent copies is one of the obstacles
in development caching systems. HTTP 1.1 provides methods (expiration
models and cache validation) to alleviate this problem. Another
problem in using caches is that many browsers do not allow automatic
alternative routes to be taken if a cache server fails. This
can cause unnecessary service outages until the proxy is fixed
or manual browser reconfiguration is used. Note that Browsers
may allow domain based proxy access. This is useful since obtaining
local subnet document through a cache is typically longer than
directly accessing the origin server.
Caching research has indicated a wide variety of hit rates, typically
within the range of 30-50% (Glassman, 1994). Due to the differing
implementation methods and study limitations, it is hard to draw
general conclusions. Many of the limitations come from the fact
that the composition of Web documents and user populations vary
significantly.
25.2.5 Server caching
Server caching is best described as (origin) server site caching.
A large site with large number of users, such as NCSA, requires
multiple servers to properly serve all users. The NCSA design
will be discussed in detail; however, other sites may vary.
The NCSA (Kwan, 1995) system has several Andrew file servers on
a high speed fiber network (FDDI). These file servers are not
accessible by the outside world, but a set of intermediate origin
servers are available and accessible. The intermediate servers,
all which have a set of unique IP addresses and host name, are
assigned an alias to a generic host name. The system is setup
so that over a period of time, the generic host name points to
a different intermediate server. This roundrobin DNS allows
load balancing on the intermediate servers to occur. Servers
may still become overloaded and they may forward requests to other
servers.
Since the intermediate services do not have the actual documents,
they typically are setup to cache documents to improve performance.
25.2.6 Pre-fetch caching
Lee (1996) and Padmanabhan and Mogul (1996) explore the issue
of document prefetching. Document prefetching consists
of determining which documents will be accessed in the future
and preloading them into the cache. This cache can be at
any level in the cache hierarchy. Lee's research shows that a
typical user appears to spend about one minute reading each document.
This time (in which no network activity generally occurs), can
be used to preload documents. There are several methods
to perform document prefetching, and generally fall into
those that are based on Web graphs and those that are based on
statistical methods.
Web graphs (briefly discussed in section 4.2.1) based prefetching
uses the graphical nature of hypertext links to determine the
possible paths through a hypertext system. These links can be
weighted on a statistical or heuristical scale. The links can
be nonweighted as well, which is effectively a unity weight
on all links. The prefetching is done based on the weights
of the links the higher the weight (or access probability),
the more likely the user will read this document. For example,
preloading the first link from the current document is a
form of Web graphbased prefetching. All the links
within the current document are preloaded so that the user
response time is reduced when the user decides to view the document.
The other method to prefetch is based solely on statistical
methods. The graphbased method may also use statistical
methods. Cache and origin servers are typically configured to
record access statistics. The statistics gathered on a cache
server generally reflect the interests of a pool of users at a
site. The statistics gathered on an origin server tend to reflect
the interests of users on a more global scale. The performance
of the prefetching based systems depends on the accuracy
of the statistical model and the reallife access patterns.
Another form of statistical method prefetching is to use
heuristics to categorize document. For example, the root document
of any Web server is generally an index of some type. One would
assume that indices have an access probability associated with
them. Instead of using server access statistics, the statistics
are generated from the assumed document type. Problems with this
method include significant administrative overhead and inaccuracies
with the categorization method.
The downside to prefetching is that additional bandwidth is consumed by the act of preloading unviewed documents. Thus, the research activity is to develop better algorithms to perform prefetch caching. Another problem with prefetching is the question of how to get information from the server to the client (or another cache server). This can be dealt with by adding HTTP or HTML extensions to send the information to the client.
Copyright © 1996 David C. Lee, All Rights Reserved
David C. Lee
<dlee@vt.edu>
Last modified: Mon Nov 18 15:22:36 EST 1996