WWW: Beyond the Basics

25. Methods for Web Bandwidth and Response Time Improvement

25.2. Document caching strategies

The general principle behind document caching is moving the document closer to the end user. This is achieved by transparently storing the document on servers closer to the user. These servers are typically called proxy cache servers. Document caching can also be viewed as a main memory cache scheme. The concept behind main memory caching is to move data from a slower memory to a faster memory.

25.2.1 Caching schemes/hierarchy

Data caching is a common method to improve performance in computer systems. CPU caches in static memory improve performance by avoiding the need to retrieve data from slower dynamic memory. Memory disk caches perform the same function for the slower disk drive. The general goal is to move data closer to where it is used. Unremarkably, Web document caching is an active research area with several projects underway or completed.

The user access model is an important consideration when designing caching schemes. The model attempts to characterize how the user will access Web documents. By knowing the user access pattern, better caching predictions can be made as to what documents will be requested. For a particular model, cache performance may be excellent. But, if the model does not accurately follow the user's actual access pattern, then performance can degrade substantially. User models for hypermedia research are as follows (Catledge, 1994; Valdez, 1988).

  1. Directed search browsing The directed search model implies that users spend time searching for information within a specific locality of the Web. The user has specific information for which they are deliberately seeking.
  2. General­purpose browsing Users behaving according to the general­purpose model spend some time within one Web locality and then go to another. This is repeated throughout the session. A user has some unformulated idea as to what they want to find and the user is looking around for it.
  3. Random browsing The random model is the "window shopping" model. The user has no real objective in mind and is briefly looking at many documents

The user model used by a specific scheme to increase performance must be suitable to the task. Designing for worst­case user behaviors may create a sub­optimal design for 99 percent of user access patterns. For example, assume that a server is expecting an average of 100 connections per minute. One hundred processes are started to handle the expected connections. However, if only a dozen requests are made, then substantial server resources are wasted. This implies that how users are characterized and the performance criteria that result from the characterization are issues that need to be considered.

Other ways of providing the basis for caching include heuristics similar to the above, statistical methods, and graph theory. Statistical methods generally use user access information as a probability estimate for future accesses. Concepts from graph theory can be used to convert the hypertext space into graphs that can be used to determine what documents to cache. A hypertext system typically can be mapped into a hierarchical graph (Botafogo, 1992). Some researchers combine the various methods. A detailed discussion can be found in Lee (1996).

The general caching hierarchy is as follows.

  1. User caches, which are also called browser or client caches,
  2. Proxy caches, which are also called cache or proxy cache servers,
  3. Server caches, which are also called site caches.

Each of the caching systems above will be discussed below. All caching systems can either fetch the document before a request is made (pre-fetch) or fetch the document after a request has been made (post-fetch). Post-fetch systems are distinguished by the fact that they already have, in cache, the document that is being requested. Pre-fetech systems attempt to request a document to cache before any user requests it.

25.2.2 Metrics

Several metrics are commonly used when evaluating caching systems. These include the following.

Individual caching algorithms may have various parameters (such as cache size and how long to cache a document) that can significantly affect the performance of the cache.

25.2.3 User caches

User caches can be described as caches on the local system that a particular user works on. Almost every browser implements some form of user caching. User caches vary from storing every document retrieved for some specified criteria (typically time based) to just storing the documents retrieved in the current session.

Luotonen (1994) recommends against using user caches primarily because there tends to be a significant amount of duplicated material in each individual user cache. He also notes that cache servers may be able to obtain better performance than a user cache.

25.2.4 Cache servers

Several experiments have been conducted using proxy caching servers (Abrams, 1995; Glassman, 1994; Luotonen, 1994; Pitkow and Recker, 1994). These are generally based on the concept that a browser obtains documents through a proxy server, as opposed to directly retrieving the document. If the proxy server does not have the document in its cache, it can request them from the origin server. Thus, if a group of users with the same interests accesses the same proxy cache, they will most likely access a cached document.

Multiple proxy caches may exist in the access route. The recommended organization of these caches is to use them in a hierarchical system, as shown in Figure 4. For example, cslab.vt.edu would have a cache server. If the document was not in the cache server, cslab.vt.edu would attempt to retrieve it from a cache server at the cs.vt.edu level. If the document was not found again, cs.vt.edu would attempt to retrieve if from a cache server at the vt.edu level. And if the vt.edu cache server could not find it, it would retrieve it from the origin server.

[IMAGE]
Figure 4. A Cache Hierarchy

Most organizations are hierarchically organized, so this solution closely matches how people work. However, the proxy cache arrangement may be performed in any method desired. Typical commercial sites have a security mechanism known as a firewall and HTTP requests from machines inside the firewall are sent through a proxy located at that firewall.

Difficulties in maintaining consistent copies is one of the obstacles in development caching systems. HTTP 1.1 provides methods (expiration models and cache validation) to alleviate this problem. Another problem in using caches is that many browsers do not allow automatic alternative routes to be taken if a cache server fails. This can cause unnecessary service outages until the proxy is fixed or manual browser reconfiguration is used. Note that Browsers may allow domain based proxy access. This is useful since obtaining local subnet document through a cache is typically longer than directly accessing the origin server.

Caching research has indicated a wide variety of hit rates, typically within the range of 30-50% (Glassman, 1994). Due to the differing implementation methods and study limitations, it is hard to draw general conclusions. Many of the limitations come from the fact that the composition of Web documents and user populations vary significantly.

25.2.5 Server caching

Server caching is best described as (origin) server site caching. A large site with large number of users, such as NCSA, requires multiple servers to properly serve all users. The NCSA design will be discussed in detail; however, other sites may vary.

The NCSA (Kwan, 1995) system has several Andrew file servers on a high speed fiber network (FDDI). These file servers are not accessible by the outside world, but a set of intermediate origin servers are available and accessible. The intermediate servers, all which have a set of unique IP addresses and host name, are assigned an alias to a generic host name. The system is setup so that over a period of time, the generic host name points to a different intermediate server. This round­robin DNS allows load balancing on the intermediate servers to occur. Servers may still become overloaded and they may forward requests to other servers.

Since the intermediate services do not have the actual documents, they typically are setup to cache documents to improve performance.

25.2.6 Pre-fetch caching

Lee (1996) and Padmanabhan and Mogul (1996) explore the issue of document pre­fetching. Document pre­fetching consists of determining which documents will be accessed in the future and pre­loading them into the cache. This cache can be at any level in the cache hierarchy. Lee's research shows that a typical user appears to spend about one minute reading each document. This time (in which no network activity generally occurs), can be used to pre­load documents. There are several methods to perform document pre­fetching, and generally fall into those that are based on Web graphs and those that are based on statistical methods.

Web graphs (briefly discussed in section 4.2.1) based pre­fetching uses the graphical nature of hypertext links to determine the possible paths through a hypertext system. These links can be weighted on a statistical or heuristical scale. The links can be non­weighted as well, which is effectively a unity weight on all links. The pre­fetching is done based on the weights of the links ­ the higher the weight (or access probability), the more likely the user will read this document. For example, pre­loading the first link from the current document is a form of Web graph­based pre­fetching. All the links within the current document are pre­loaded so that the user response time is reduced when the user decides to view the document.

The other method to pre­fetch is based solely on statistical methods. The graph­based method may also use statistical methods. Cache and origin servers are typically configured to record access statistics. The statistics gathered on a cache server generally reflect the interests of a pool of users at a site. The statistics gathered on an origin server tend to reflect the interests of users on a more global scale. The performance of the pre­fetching based systems depends on the accuracy of the statistical model and the real­life access patterns.

Another form of statistical method pre­fetching is to use heuristics to categorize document. For example, the root document of any Web server is generally an index of some type. One would assume that indices have an access probability associated with them. Instead of using server access statistics, the statistics are generated from the assumed document type. Problems with this method include significant administrative overhead and inaccuracies with the categorization method.

The downside to pre­fetching is that additional bandwidth is consumed by the act of pre­loading unviewed documents. Thus, the research activity is to develop better algorithms to perform pre­fetch caching. Another problem with pre­fetching is the question of how to get information from the server to the client (or another cache server). This can be dealt with by adding HTTP or HTML extensions to send the information to the client.

[PREV][NEXT][UP][HOME][VT CS]

Copyright © 1996 David C. Lee, All Rights Reserved

David C. Lee <dlee@vt.edu>
Last modified: Mon Nov 18 15:22:36 EST 1996