2. Demographics and Demographic Tools
2.2 Collecting Web Demographics
According to the
Webster dictionary,
"demography" is the "statistical
study of human population, especially with
reference to size and density, distribution, and
vital statistics"
(Webster dictionary).
For the World Wide Web,
demography is the study of the characteristics
of the users, their tastes, preferences, and behaviour
on the Web.
Why collect demographics?
Aren't you curious about who is accessing your
web pages? To the web service provider and
other information content providers, demographics
help them improve their web page content and services.
All web service providers would like to know what
information users would like to have and how they
want this information presented and delivered.
Knowing the tastes, preferences and demographics
of the users that access their web pages,
quality content and an effective presentation
of the information can be provided.
To businesses and commercial institutions, it helps
them to determine effective ways of marketing
their products. Most of the information on the
Web is currently free, provided to users by users.
Because of this, users have the tendency to expect
information on the Web to be free. Businesses
who wish to sell their information on
the Web have the challenge of breaking through this barrier;
knowledge of user behaviour and characteristics
will help in this.
Businesses would like to know what products sell
best, and how to produce effective, attractive
advertisements on the Web. Web service providers
too would like to analyze the traffic flow to ensure
a good balance between load and service to the
users.
Web demographics are essentially unique;
people's behaviour in private is usually different from
public behaviour. The World Wide Web is a world
where users may be nameless, faceless, and even
voiceless, where the users have a large control of
identity information.
The flow of electronic data in the Web is
inherently difficult to track because
it is ephemeral and intangible.
Thus, those who conduct
demographics have to be creative in their ways of
capturing that information. The fact that
the World Wide Web is evolving fast without
a central control also makes collecting
demographics more difficult.
The current and proposed methods of collecting
web demographics are presented here.
2.2.1 Current Methods
The current methods of collecting information on
the Web are through HTTP logging, surveys and guestbooks,
user/password authentication and cookies.
HTTP Logging
The first method of collecting
information, HTTP logging, extracts
information regarding user/machine requests.
"HTTP" (Hypertext Transfer Protocol)
is the communication protocol used in the
World Wide Web.
HTTP logs contain information about document requests,
ie. the HTTP
GET, POST, and HEAD messages received, the client
and server machine,
the date and time of the request, the URL of the request,
browser information, and possibly user ID.
(For further information regarding the
HTTP the protocol, refer
to
Chapter 16.)
Therefore, user requests on the Web can be
recorded to a certain extent. Note that
this type of information can be collected by
the information provider without the user's
knowledge.
This method is often employed by network administrators
and web service providers to collect information
regarding
network load (number of bytes requested and
transferred), web page requests, and response time
to provide better service to the users.
Advertisers may use these methods determine
the effectiveness of an advertistment.
Logging may occur at four levels:
client, server, proxy and network level
(Abrams and Williams,
96).
At the client level, the client machine collects
users' web page request information that is sent
out to different servers. The possible advantage
of client logging is that actions of each user may
be identified. In the server level, WWW requests from
clients may be logged. A proxy is a machine
usually at the client network that filters information
that flows in and out of the network. HTTP requests
from different machines may be logged at this level.
The advantage of proxy caching is that HTTP requests
from different machines may be logged at the same
time. Network level logging is HTTP logging by an
independent machine (neither the client, server or
proxy machine).
In their paper
Complementing Surveying and Demographics with
Automated Network Monitoring,
Abrams and Williams propose logging
at the network level to collect information on
web usage, using tcpdump
(Abrams and Williams,
96).
The advantages of network logging is that
it is is secure; only the owner of the network
monitoring device can acess to the data, and it
does not impact
the performance of clients, proxies, servers or networks.
One problem that may occur with logging is caching;
caching at client, server or network levels
may distort information
collected from HTTP logs. Caching usually occurs
at the client level to reduce network traffic and
retrieval time.
Problems that result from caching are:
- Web users may access
web pages many times without the client
requesting information directly from the server.
- More than one user may request the same pages
but this won't be reflected at the server log.
These problems may be solved by specifying
a short time-to-live period for each document
retrieved. However, this will nullify the purpose
of a cache proxy to reduce network traffic and
improve user response time. Mirroring also
distorts log information collected through at the
server level.
Other limitations of logging are:
- Logging is limited to only collecting
information at each site, usually at the client
or server site. It cannot collect
information about user behaviour
of other sites, and unless web administrators
pull their resources together, limited inference
can be made about user actions.
- Logs at the server level can only tell
user's behaviour for one session, i.e. when a
user accesses pages from one web site at one sitting.
Logs usually cannot identify users
over different sessions unless the user-id is
specified in the HTTP requests. This too makes
the data very limited
in terms of its use.
These limitations create difficulty
in making generalizations of user behaviour.
However, there are benefits to logging, such as:
- User anonymity is usually preserved.
Only the machines that users use, and not user-id
is logged.
- Information can be collected automatically
using scripts.
- Since participation is automatic,
users don't select themselves.
User Surveys
The second form of collecting demographics,
complimentary to the first, is through
user surveys. Personal information such as
gender, age, preferences, beliefs,
lifestyles, and opinions
can be collected through this method.
Businesses often hire or sponsor organizations
to conduct these surveys. Some organizations
also conduct surveys to record the evolution
of the Web and to provide this information on-line
for the general public.
These surveys are conducted on the Internet,
through forms on web browsers and e-mail.
Surveys are also conducted through phone
surveys, postal mail and personal interviews.
From all the on-line surveys I have seen,
they are typically extremely long (at least several pages),
and require many types of reply formats, including
paragraph-long replies! These
surveys run from the typical demographic surveys,
to surveys on net addiction, and surveys for sea
surfers.
However, there are limitations to surveys.
Respondents select themselves for the surveys;
hence these surveys
may not accurately represent the general web population
as a whole. Furthermore, respondents may provide
inaccurate or false
information; survey results may be affected by
the type of questions used, and the order and phrasing of
the questions. Forms-based surveys limit
their respondents to those who use use forms-supported
browsers.
However, despite their limitations, surveys today remain
the main source of demographic information because
of privacy concerns.
Web users value their anonymity highly and
are very concerned with having control of demographic
information. Surveys provide the simplest solution
to user concerns at the moment.
It is also a way to gain information
regarding web users and their usage patterns
that cannot be collected from HTTP logs.
Surveys need not be limited to a specific
site or network of users as it is in logging.
Guestbooks are similar to surveys except that
they are placed on a specific set of web pages
in which the guestbook author is interested in.
Cookies and User/Password Authentication
Netscape introduced a way to collect some
limited information,
yet protect user anonymity to a certain extent through
Netscape cookies.
Now also available through the Microsoft
Internet Explorer, cookies provide a way that servers can
save a set of limited information to the user's files.
The information may include the domain server,
unique value of the cookie, and a field for a
server-defined code to identify the machine.
When the user accesses the web site again, the server
gets access to the information the server had originally
saved. One of the limitations of cookies is
that it is browser-dependent, and enables
only limited information about the web user.
Another method of collecting information
that is widely used is through user/password
authentication. Many users are turned off by
this feature because it lacks security: any
other user with knowledge of a user's ID and password
may pretend to be that user.
Demographics from Collaborative Filtering
An interesting source of demographic data is from
collaborative filtering environments in the Web.
An example of such a site is
firefly.
This site provides ways that users can view sites
that are recommended by other members, and
recommend different sites to each other.
Members can also provide information about themselves
through web pages.
Although this service is free, users
provide some general information to the site and
access the site through
an alias/password authentication.
Such collaborative systems are a gold mine of
information as not only do they provide user profiles,
they also shows users' interests, tastes and
preferences in web sites.
2.2.2 Proposed Methods
A few ideas have been proposed for alternative
or complementing ways to user surveys.
A few
suggestions
have been made by the World Wide Web Consortium (W3C)
for gathering consumer demographics
(World Wide Web Consortium).
One proposal is that
the user agent, i.e. a client, sends
a randomly generated sessionID at the beginning
of each HTTP session. That sessionID is then
monotonically increased
at each request in that session. This will allow
the server to track the information that is being requested by
the client.
Another suggestion is using
a business card record, where
an HTML user agent maintains a user profile on behalf
of the user. The profile includes
the user's full name, email address,
home url, affiliation, postal address, and business
phone number. When a form is processed, the initial values
of the matching user information will be filled with the given
user profile. Security is an important issue in this, so
users must have control of their own profile access.
Another tool that is suggested is anonymous authentication.
This works in a similar manner to Netscape's cookie mentioned earlier, without
being browser-dependent. A 128-bit random number is
chosen for a user for each site server, and saved in the user's own file.
This unique number
is accessed when the user agent requests information from a server.
There are other methods that are being employed by companies
to collect demographic information. Companies like I/PRO,
NetCount and WebTrac market tools to monitor and
analyze Web sites. Some of the problems currently faced by
all these demographic tools are firewalls, cache proxies, and site mirroring.
However these tools provide a useful technique to understanding
consumer online behaviour.
![[PREV]](http://ei.cs.vt.edu/~wwwbtb/book/images/Nav/Prev.gif)
![[NEXT]](http://ei.cs.vt.edu/~wwwbtb/book/images/Nav/Next.gif)
![[UP]](http://ei.cs.vt.edu/~wwwbtb/book/images/Nav/Up.gif)
![[HOME]](http://ei.cs.vt.edu/~wwwbtb/book/images/Nav/Home.gif)
Copyright © 1996 Mei See Yeoh, All Rights Reserved
Mei See Yeoh <myeoh@vt.edu>
Last modified: Sat Oct 26 13:15:51 1996