WWW: Beyond the Basics

2. Demographics and Demographic Tools

2.2 Collecting Web Demographics

According to the Webster dictionary, "demography" is the "statistical study of human population, especially with reference to size and density, distribution, and vital statistics" (Webster dictionary). For the World Wide Web, demography is the study of the characteristics of the users, their tastes, preferences, and behaviour on the Web.

Why collect demographics? Aren't you curious about who is accessing your web pages? To the web service provider and other information content providers, demographics help them improve their web page content and services. All web service providers would like to know what information users would like to have and how they want this information presented and delivered. Knowing the tastes, preferences and demographics of the users that access their web pages, quality content and an effective presentation of the information can be provided. To businesses and commercial institutions, it helps them to determine effective ways of marketing their products. Most of the information on the Web is currently free, provided to users by users. Because of this, users have the tendency to expect information on the Web to be free. Businesses who wish to sell their information on the Web have the challenge of breaking through this barrier; knowledge of user behaviour and characteristics will help in this. Businesses would like to know what products sell best, and how to produce effective, attractive advertisements on the Web. Web service providers too would like to analyze the traffic flow to ensure a good balance between load and service to the users.

Web demographics are essentially unique; people's behaviour in private is usually different from public behaviour. The World Wide Web is a world where users may be nameless, faceless, and even voiceless, where the users have a large control of identity information. The flow of electronic data in the Web is inherently difficult to track because it is ephemeral and intangible. Thus, those who conduct demographics have to be creative in their ways of capturing that information. The fact that the World Wide Web is evolving fast without a central control also makes collecting demographics more difficult. The current and proposed methods of collecting web demographics are presented here.

2.2.1 Current Methods

The current methods of collecting information on the Web are through HTTP logging, surveys and guestbooks, user/password authentication and cookies.

HTTP Logging

The first method of collecting information, HTTP logging, extracts information regarding user/machine requests. "HTTP" (Hypertext Transfer Protocol) is the communication protocol used in the World Wide Web. HTTP logs contain information about document requests, ie. the HTTP GET, POST, and HEAD messages received, the client and server machine, the date and time of the request, the URL of the request, browser information, and possibly user ID. (For further information regarding the HTTP the protocol, refer to Chapter 16.) Therefore, user requests on the Web can be recorded to a certain extent. Note that this type of information can be collected by the information provider without the user's knowledge. This method is often employed by network administrators and web service providers to collect information regarding network load (number of bytes requested and transferred), web page requests, and response time to provide better service to the users. Advertisers may use these methods determine the effectiveness of an advertistment.

Logging may occur at four levels: client, server, proxy and network level (Abrams and Williams, 96). At the client level, the client machine collects users' web page request information that is sent out to different servers. The possible advantage of client logging is that actions of each user may be identified. In the server level, WWW requests from clients may be logged. A proxy is a machine usually at the client network that filters information that flows in and out of the network. HTTP requests from different machines may be logged at this level. The advantage of proxy caching is that HTTP requests from different machines may be logged at the same time. Network level logging is HTTP logging by an independent machine (neither the client, server or proxy machine). In their paper Complementing Surveying and Demographics with Automated Network Monitoring, Abrams and Williams propose logging at the network level to collect information on web usage, using tcpdump (Abrams and Williams, 96). The advantages of network logging is that it is is secure; only the owner of the network monitoring device can acess to the data, and it does not impact the performance of clients, proxies, servers or networks. One problem that may occur with logging is caching; caching at client, server or network levels may distort information collected from HTTP logs. Caching usually occurs at the client level to reduce network traffic and retrieval time.

Problems that result from caching are:

These problems may be solved by specifying a short time-to-live period for each document retrieved. However, this will nullify the purpose of a cache proxy to reduce network traffic and improve user response time. Mirroring also distorts log information collected through at the server level.

Other limitations of logging are:

These limitations create difficulty in making generalizations of user behaviour. However, there are benefits to logging, such as:

User Surveys

The second form of collecting demographics, complimentary to the first, is through user surveys. Personal information such as gender, age, preferences, beliefs, lifestyles, and opinions can be collected through this method. Businesses often hire or sponsor organizations to conduct these surveys. Some organizations also conduct surveys to record the evolution of the Web and to provide this information on-line for the general public. These surveys are conducted on the Internet, through forms on web browsers and e-mail. Surveys are also conducted through phone surveys, postal mail and personal interviews. From all the on-line surveys I have seen, they are typically extremely long (at least several pages), and require many types of reply formats, including paragraph-long replies! These surveys run from the typical demographic surveys, to surveys on net addiction, and surveys for sea surfers. However, there are limitations to surveys. Respondents select themselves for the surveys; hence these surveys may not accurately represent the general web population as a whole. Furthermore, respondents may provide inaccurate or false information; survey results may be affected by the type of questions used, and the order and phrasing of the questions. Forms-based surveys limit their respondents to those who use use forms-supported browsers. However, despite their limitations, surveys today remain the main source of demographic information because of privacy concerns.

Web users value their anonymity highly and are very concerned with having control of demographic information. Surveys provide the simplest solution to user concerns at the moment. It is also a way to gain information regarding web users and their usage patterns that cannot be collected from HTTP logs. Surveys need not be limited to a specific site or network of users as it is in logging. Guestbooks are similar to surveys except that they are placed on a specific set of web pages in which the guestbook author is interested in.

Cookies and User/Password Authentication

Netscape introduced a way to collect some limited information, yet protect user anonymity to a certain extent through Netscape cookies. Now also available through the Microsoft Internet Explorer, cookies provide a way that servers can save a set of limited information to the user's files. The information may include the domain server, unique value of the cookie, and a field for a server-defined code to identify the machine. When the user accesses the web site again, the server gets access to the information the server had originally saved. One of the limitations of cookies is that it is browser-dependent, and enables only limited information about the web user.

Another method of collecting information that is widely used is through user/password authentication. Many users are turned off by this feature because it lacks security: any other user with knowledge of a user's ID and password may pretend to be that user.

Demographics from Collaborative Filtering

An interesting source of demographic data is from collaborative filtering environments in the Web. An example of such a site is firefly. This site provides ways that users can view sites that are recommended by other members, and recommend different sites to each other. Members can also provide information about themselves through web pages. Although this service is free, users provide some general information to the site and access the site through an alias/password authentication. Such collaborative systems are a gold mine of information as not only do they provide user profiles, they also shows users' interests, tastes and preferences in web sites.

2.2.2 Proposed Methods

A few ideas have been proposed for alternative or complementing ways to user surveys.

A few suggestions have been made by the World Wide Web Consortium (W3C) for gathering consumer demographics (World Wide Web Consortium). One proposal is that the user agent, i.e. a client, sends a randomly generated sessionID at the beginning of each HTTP session. That sessionID is then monotonically increased at each request in that session. This will allow the server to track the information that is being requested by the client. Another suggestion is using a business card record, where an HTML user agent maintains a user profile on behalf of the user. The profile includes the user's full name, email address, home url, affiliation, postal address, and business phone number. When a form is processed, the initial values of the matching user information will be filled with the given user profile. Security is an important issue in this, so users must have control of their own profile access.

Another tool that is suggested is anonymous authentication. This works in a similar manner to Netscape's cookie mentioned earlier, without being browser-dependent. A 128-bit random number is chosen for a user for each site server, and saved in the user's own file. This unique number is accessed when the user agent requests information from a server.

There are other methods that are being employed by companies to collect demographic information. Companies like I/PRO, NetCount and WebTrac market tools to monitor and analyze Web sites. Some of the problems currently faced by all these demographic tools are firewalls, cache proxies, and site mirroring. However these tools provide a useful technique to understanding consumer online behaviour.

[PREV][NEXT][UP][HOME][VT CS]

Copyright © 1996 Mei See Yeoh, All Rights Reserved

Mei See Yeoh <myeoh@vt.edu>
Last modified: Sat Oct 26 13:15:51 1996