wwwstat manual
NAME
wwwstat - summarize WWW server (httpd) access statistics
SYNOPSIS
wwwstat [-F system_config] [-f user_config] [options...] [--] [ summary | logfile | + | - ]... DESCRIPTION
wwwstat reads a sequence of
httpd common logfile format (CLF) access_log files
and/or prior wwwstat output summary files and/or the standard input
and outputs a summary of the access statistics in HTML.
Since
wwwstat does not make any changes to the input files or write any files in the
server directories, it can be run by any user with read access to the input
logfile(s) and summary file(s). This allows people other than the webmaster
to run specialized analyses of just the things they are interested in
summarizing.
wwwstat provides World Wide Web (WWW) access statistics, which does not necessarily
correspond to statistics on individual users. It counts the number of
HTTP requests received by the server
and the amount of bytes transmitted in response to those requests,
according to what is in the logfile(s), and outputs those counts
as tables broken down by category of request.
wwwstat output summaries can be read by
gwstat to produce fancy graphs of the summarized statistics.
The splitlog program can be used to split a large logfile into separate
files by entry prefix or URL path.
wwwstat is a
perl script, which means you need to have a
perl interpreter to run the program. It has been tested with
perl versions 4.036 and 5.002.
Output Sections
wwwstat's output consists of a set of cross-reference links,
the sum totals and averages for the processed data, and
a sequence of amount-by-category tables partitioned into sections.
The section categories are based on the characteristics evident from
the access request, as provided by the
common logfile format. These include:
Request Date
| e.g., "Feb 2 1996"
|
---|
Request Hour
| e.g., "00" through "23"
|
---|
Client Domain
| The Fully-Qualified Domain Name (FQDN) suffix that corresponds to
an organization type or country name.
|
---|
Reversed Subdomain
| The FQDN, usually minus the first (machine name) component,
and reversed so that it is easier to read when sorted.
|
---|
URL/Archive
| Grouping based on Request-URI or non-success status code.
|
---|
Identity
| The user identity based on IdentityCheck token or Authorization field.
|
---|
Each section can be enabled/disabled using the configuration files or
command-line options (see
Section Display Options).
Output Table Format
Inside each section, the statistics are presented as a preformatted table.
%Reqs %Byte Bytes Sent Requests
category-type
----- ----- ------------ -------- |---------------
NN.NN NN.NN NNNNNNNNNNNN NNNNNNNN |
category-value
100.0 100.0 NNNNNNNNNNNN NNNNNNNN |
category-value
Requests
| Requests received for this category-value.
|
---|
Bytes Sent
| Bytes transmitted for this category-value.
|
---|
%Reqs
| (<Requests>/<Total Requests>)*100.
|
---|
%Byte
| (<Bytes Sent>/<Total Bytes>)*100.
|
---|
The table can be sorted by category-value
(-sort key), number of requests received
(-sort req), or number of bytes received
(-sort byte). It can also be limited to the
-top N entries.
OPTIONS
Configuration Options
These options define how
wwwstat should establish defaults and interpret the command-line.
- -F filename
- Get system configuration defaults from the given file. If used, this
must be the first argument on the command-line, since it needs to be
interpreted before the other command options. The file wwwstat.rc
is included with the distribution as an example of this file; it contains
perl source code which directly sets the control and display options
provided by wwwstat.
If filename is not a pathname,
the include path (see FILES) is searched for
filename. An empty string as filename will disable this feature.
[-F "wwwstat.rc"]
- -f filename
- Get user configuration defaults from the given file. If used, this
must be the first argument on the command-line after
-F (if any). The file is the same format as for the -F
option (see wwwstat.rc).
If filename is not a pathname,
the include path (see FILES) is searched for
filename. An empty string as filename will disable this feature.
[-f ".wwwstatrc"]
- --
- Last option (the remaining arguments are treated as input files).
Diagnostic Options
These options provide information about
wwwstat usage or about some unusual aspects of the logfile(s) being processed.
- -h
- Help - display usage information to STDERR and then exit.
- -v
- Verbose display to STDERR of each log entry processed.
- -x
- Display to STDERR all requests resulting in HTTP error responses.
- -e
- Display to STDERR all invalid log entries. Invalid log entries can occur
if the server is miswriting or overwriting its own log, if the request is
made by a broken client or proxy, or if a malicious attacker is trying to
gain privileged access to your system. For the latter reason, the webmaster
should run
wwwstat with this option on a regular basis.
Display Options
These options modify the output format.
- -H string
- Use the given string as the HTML title and heading for output.
- -X string
- Use the given string as the cross-reference URL to the last summary output.
Any occurrence of the characters "%M" or "%Y" are replaced by the month
and year, respectively, of the month prior to the first log entry date.
The empty string will exclude any cross-reference.
- -R
- Display the daily stats table sorted in reverse. This option is primarily
for use with the
gwstat program for producing graphs of the output.
- -l
- -L
- Do
(-l) or don't
(-L) display the full DNS hostname of clients in your local domain (which is
determined by the configured value of $AppendToLocalhost) in the section
on subdomain statistics. The default
[-L] is to strip the machine name from local addresses.
- -o
- -O
- Do
(-o) or don't
(-O) display the full DNS hostname of clients outside your local domain
in the section on subdomain statistics. The default
[-O] is to strip the machine name from outside addresses.
- -u
- -U
- Do
(-u) or don't
(-U) display the IP address of clients with unresolved domain names in the section
on subdomain statistics. The
-dns option can be used to resolve some names, but not all IP hosts have
a DNS name (SLIP/PPP connections) and sometimes a host's DNS service
is inaccessible. The default
[-U] is to group all such addresses under the category "Unresolved".
- -dns
- -nodns
- Do
(-dns) or don't
(-nodns) use the system's hostname lookup facilities to find the DNS hostname
associated with any unresolved IP addresses. Looking up a DNS name may be
very slow, particularly when the results are negative (no DNS name),
which is why a caching capability is included as well.
[-nodns]
- -cache filename
- Use the given DBM database as the read/write persistent DNS cache
(the .dir and .pag extensions are appended automatically). Cached entries
(including negative results) are removed after the time configured for
$DNSexpires [two months]. No caching is performed if
filename is the empty string, which may be needed if your system does not support
DBM or NDBM functionality. Running
-dns without a persistent cache is not recommended.
[-cache "dnscache"]
- -trunc N
- Truncate the URLs listed in the archive section after the
Nth hierarchy level. This option is commonly used to reduce the output size
and memory requirements of
wwwstat by grouping the requests by directory tree instead of listing every URL.
The default
[-trunc 0] is to display every requested URL.
- -files
- -nofiles
- Do
(-files) or don't
(-nofiles) include the last component of a URL (usually the filename) in the
archive section. This option is commonly used to reduce the output size
and memory requirements of
wwwstat by grouping the requests by directory instead of listing every URL.
The default
[-files] is to display the entire requested URL.
- -link
- -nolink
- Do
(-link) or don't
(-nolink) add a hypertext link around each archive URL. This option is useful for
local maintenance, but it is not recommended for publication of the HTML
results (it often results in links to temporary or nonexistant resources,
and leads people/robots to resources that might not be publically available).
[-nolink]
- -cgi
- -nocgi
- Do
(-cgi) or don't
(-nocgi) prefix the summary output with CGI header fields appropriate for use
with the HTTP common gateway interface. Using
wwwstat as a CGI script is not recommended - it is usually better to simply
run the wwwstat program periodically and serve the static output file.
[-nocgi]
These options change the display of entire sections (as opposed to the
entries within those sections). They allow the user to enable or disable
an entire section, set the sorting method for that section, and limit the
number of displayed entries for that section. These options are
context-sensitive and processed in the order given.
- -all
- -noall
- Include
(-all) or exclude
(-noall) all of the display sections. The
-noall option is commonly used just prior to one or more of the other section
options, such that only the listed sections are displayed.
- -daily
- -nodaily
- Include
(-daily) or exclude
(-nodaily) the section of statistics by
request date
and set the scope for later
-sort and -top options to this section.
- -hourly
- -nohourly
- Include
(-hourly) or exclude
(-nohourly) the section of statistics by
request hour
and set the scope for later
-sort and -top options to this section.
- -domain
- -nodomain
- Include
(-domain) or exclude
(-nodomain) the section of statistics by
the client's Internet domain
and set the scope for later
-sort and -top options to this section.
- -subdomain
- -nosubdomain
- Include
(-subdomain) or exclude
(-nosubdomain) the section of statistics by
the client's Internet subdomain (reversed for display)
and set the scope for later
-sort and -top options to this section.
- -archive
- -noarchive
- Include
(-archive) or exclude
(-noarchive) the section of statistics by
requested URL/archive
and set the scope for later
-sort and -top options to this section.
- -r
- -ident
- -noident
- Include
(-r or -ident) or exclude
(-noident) the section of statistics by
the identity of the user (if IdentityCheck is ON) or the authentication
userid (if supplied)
and set the scope for later
-sort and -top options to this section.
DO NOT PUBLISH this information, as that would reveal security-related identities
and be a violation of privacy. This option is provided for administrative
purposes only.
- -sort (key|byte|req)
- Sort this section by its primary key, the number of bytes transmitted,
or the number of requests received. [-sort key]
- -top N
- Display only the top N entries for this section. This option assumes that
the
-sort option has been set to either bytes or requests.
- -both
- Display both the top N entries for this section [10, sorted by requests],
and then the full section (all entries) sorted by key.
Search Options
These options are used to limit the analysis to requests matching a
pattern. The pattern is supplied in the form of a
perl regular expression, except that the characters "+" and "." are escaped automatically
unless the
-noescape option is given.
Enclose the pattern in single-quotes to prevent the command shell
from interpreting some special characters.
Multiple occurrences of the same option results in an OR-ing of the
regular expressions. Search options are only applied to logfile entries;
any summary files input must have been created with the same search options.
- -a regexp
- -A regexp
- Include
(-a) or exclude
(-A) all requests containing a hostname/IP address
matching the given perl regular expression.
- -c regexp
- -C regexp
- Include
(-c) or exclude
(-C) all requests resulting in an
HTTP status code
matching the given perl regular expression.
- -d regexp
- -D regexp
- Include
(-d) or exclude
(-D) all requests occurring on a date (e.g., "Feb 2 1994")
matching the given perl regular expression.
- -t regexp
- -T regexp
- Include
(-t) or exclude
(-T) all requests occurring during the hour (e.g., "23" is 11pm - 12pm)
matching the given perl regular expression.
- -m regexp
- -M regexp
- Include
(-m) or exclude
(-M) all requests using an HTTP method (e.g., "HEAD")
matching the given perl regular expression.
- -n regexp
- -N regexp
- Include
(-n) or exclude
(-N) all requests on a URL (archive name)
matching the given perl regular expression.
- -noescape
- Do not escape the special characters ("+" and ".") in the remaining
search options.
INPUT
After parsing the options, the remaining arguments on the command-line
are treated as input arguments and are read in the order given.
If no input arguments are given, the configured default logfile is read
[+].
- -
- Read from standard input (STDIN).
- +
- Read the default logfile. [as configured]
- filename...
- Read the given file and determine from the first line whether it is a
previous output summary or a CLF logfile. If the
filename's extension indicates that is is compressed (gz|z|Z), then pipe it through
the configured decompression program
[gunzip -c] first. Summary files must have been created with the same (or similar)
configuration and command-line options as the currently running program;
if not, weird things will happen.
USAGE
wwwstat is used for many purposes:
- as a diagnostic utility for measuring server activity, finding incorrect
URL references, and detecting attempted misuse of the server;
- as a public relations tool for measuring technology or information transfer
(i.e., Is the message getting out? To the right people?);
- as an archival tool for tracking web usage over time without storing
the entire logfile; and,
- most often, as an easy mechanism for justifying all the hard work that went
into creating the web content that people out there are requesting.
In most cases,
wwwstat is run on a periodic basis (nightly, weekly, and/or monthly) by a wrapper
program as a
crontab entry shortly after midnight, typically in conjunction
with rotating the current logfile. The output is usually directed
to a temporary file which can later be moved to a published location.
The temporary file is necessary to avoid erasing your published file
during wwwstat's processing (which would look very odd if someone tried
to GET it from your web).
wwwstat can be run as a CGI script
(-cgi), but that is not recommended unless the input logfile is very small.
All of the command-line options, and a few options that are not available
from the command-line, can be changed within the user and system configuration
files (see
wwwstat.rc). These files are actually
perl library modules which are executed as part of the program's initialization.
The example provided with the distribution includes complete documentation
on what variables can be set and their range of values.
Perl Regular Expressions
The Search Options and many of the configuration file settings
allow for full use of perl regular expressions
(with the exception that the -a, -A, -n and -N options treat '+' and '.'
characters as normal alphabetic characters unless they are preceded by the
-noescape option). Most people only need to know the following special characters:
- ^
- at start of pattern, means "starts with pattern".
- $
- at end of pattern, means "ends with pattern".
- (...)
- groups pattern elements as a single element.
- ?
- matches preceding element zero or one times.
- *
- matches preceding element zero or more times.
- +
- matches preceding element one or more times.
- .
- matches any single character.
- [...]
- denotes a class of characters to match. [^...] negates the class.
Inside a class, '-' indicates a range of characters.
- (A|B|C)
- matches if A or B or C matches.
Depending on your command shell, some special characters may need to be
escaped on the command line or enclosed in single-quotes to avoid shell
interpretation.
EXAMPLES
- Summarize requests from commercial domains.
- wwwstat -a '.com$'
- Summarize requests from the host kiwi.ics.uci.edu
- wwwstat -a '^kiwi.ics.uci.edu$'
- Summarize requests not from kiwi.ics.uci.edu
- wwwstat -A '^kiwi.ics.uci.edu$'
- Summarize requests resulting in temporary redirects
- wwwstat -c '302'
- Summarize requests resulting in server errors
- wwwstat -c '^5'
- Summarize unsuccessful requests
- wwwstat -C '^2' -C '304'
- Summarize requests in first week of the month
- wwwstat -d ' [1-7] '
- Summarize requests in second week of the month
- wwwstat -d ' ([89]|1[0-4]) '
- Summarize requests in third week of the month
- wwwstat -d ' (1[5-9]|2[01]) '
- Summarize requests in fourth week of the month
- wwwstat -d ' 2[2-8] '
- Summarize requests in leftover days of the month
- wwwstat -d ' (29|30|31) '
- Summarize requests in February
- wwwstat -d 'Feb'
- Summarize requests in year 1994
- wwwstat -d '1994'
- Summarize requests not in April
- wwwstat -D 'Apr'
- Summarize requests between midnight and 1am
- wwwstat -t '00'
- Summarize requests not received between noon and 1pm
- wwwstat -T '12'
- Summarize requests with a gif extension
- wwwstat -n '.gif$'
- Summarize requests under user's URL
- wwwstat -n '^/~user/'
- Summarize requests not under "hidden" paths
- wwwstat -N '/hidden/'
ENVIRONMENT
- HOME
- Location of user's home directory, placed on INC path.
- LOGDIR
- Used instead of HOME if latter is undefined.
- PERLLIB
- A colon-separated list of directories in which to look for
include and configuration files.
Unless a pathname is supplied, the configuration files are
obtained from the current directory, the user's home
directory (HOME or LOGDIR), the standard library path
(PERLLIB), and the directory indicated by the command
pathname (in that order).
- .wwwstatrc
- User configuration file.
- wwwstat.rc
- System configuration file.
- domains.pl
- Mapping of Internet domain to country or organization.
- dnscache.dir
- dnscache.pag
- DBM files for persistent DNS cache.
SEE ALSO
crontab(1), gwstat(1), httpd(1m), perl(1),
splitlog(1)
- More info and the latest version of wwwstat
can be obtained from
- http://www.ics.uci.edu/pub/websoft/wwwstat/
ftp://www.ics.uci.edu/pub/websoft/wwwstat/
If you have any suggestions, bug reports, fixes, or enhancements,
please join the <wwwstat-users@ics.uci.edu> mailing list by sending
e-mail with "subscribe" in the subject of the message to the request
address <wwwstat-users-request@ics.uci.edu>. The list is archived at
the above address.
More About HTTP
- HTTP/1.1 Proposed Standard
- R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, and T. Berners-Lee.
"Hypertext Transfer Protocol -- HTTP/1.1", U.C. Irvine, DEC, MIT/LCS,
August 1996.
http://www.ics.uci.edu/pub/ietf/http/
More About Perl
- The Perl Language Home Page
- http://www.perl.com/perl/index.html
- Johan Vromans' Perl Reference Guide
- http://www.xs4all.nl/~jvromans/perlref.html
DIAGNOSTICS
See also the
Diagnostic Options above.
- "[none] to [none]" dates
- wwwstat did not find any matching data to summarize. If you get such an empty
summary, it means that either:
1) there was no valid data (the input files are all invalid or empty), or
2) none of the data matched the search options given. Try using the
-e option to show invalid data.
- 100% unresolved
- If the subdomain section indicates that all of the client requests come
from unresolved hostnames (IP addresses), this probably means that your
server is running without DNS resolution (common for very busy sites).
You can use the
-dns option to have
wwwstat perform the hostname lookups. If 100% of the hosts are still unresolved
with the
-dns option in effect, then it may be that all of the clients accessing your
server are doing so from temporary SLIP/PPP addresses without DNS names, or
it may be a problem with wwwstat's DNS cache (delete the cache files),
with your system's DNS software (contact your system administrator),
or with your network connection.
Hits vs Requests vs Visitors
wwwstat counts HTTP requests received by the server. When a request is successful,
it is often referred to as a "hit". Retrieving a single image is one GET
request. Retrieving an HTML page is also one GET request, but that does not
include the separate requests made for in-line images or related objects.
Checking to see if a cached image is still valid (a HEAD or conditional GET)
is also one request.
In all sections except the archive section,
wwwstat shows the statistics for all requests (successful or not). In the archive
section, it normally shows all non-successful requests under a special category
for the status code and only successful requests (hits) under the URL or
archive tree associated with the request. However, this grouping of
non-successful requests is disabled when
wwwstat is used with the search options
-n, -c, and -C, since those options are normally used for finding error conditions.
wwwstat does not count "visitors" -- individual people or programs making the
requests. HTTP does not, by default, provide any information that can be
accurately correlated to an individual person, though it is possible
(in an unreliable manner) to use HTTP extensions and request profiles
as a means of tracking individual client programs. Such tracking
requires extensive resources (memory and diskspace) and is often considered
a violation of privacy.
With the exception of the ident section,
wwwstat does not reveal information about the individual people making requests.
Unless the output is limited to a specific URL or a specific hostname,
wwwstat's output does not connect the requester to the URL being requested.
The httpd common logfile format (CLF) was defined in early 1994 as the
result of discussions among server and access_log analyzer developers
(Roy Fielding, John Franks, Kevin Hughes, Ari Luotonen, Rob McCool,
and Tony Sanders) on how to make it easier for analysis tools to be
used across multiple servers. The format is:
remote_host ident authuser [date-time zone] "Request-Line" Status-Code bytes
where
| means
|
---|
remote_host
| Client DNS hostname or IP address
|
ident
| Identity check token or "-"
|
authuser
| Authorization user-id or "-"
|
date-time
| dd/Mmm/yyyy:hh:mm:ss
|
zone
| +dddd or -dddd
|
Request-Line
| The first line of the HTTP request, which normally includes the
method, URL, and HTTP-version.
|
Status-Code
| Response status from server or "-"
|
bytes
| Size of Entity-Body transmitted or "-"
|
with each field separated by a single space (it turns out that problems
occur if the ident token contains a space, which was not anticipated
by the original designers).
LIMITATIONS
wwwstat cannot be more accurate than its input.
The common logfile format does not include the amount of bytes transferred
in HTTP header fields and in error responses.
wwwstat attempts to estimate those bytes based on the response code. Although
the built-in estimates will suffice for most applications, your results
will be more accurate if the estimates are customized for the particular
server software that generated the logfile.
Modern httpd servers have extended the CLF to include additional fields
(Referer and User-Agent) or to make the entire format configurable.
Although
wwwstat is able to read logfiles which append information to the CLF, it
will not make use of that additional information. However,
wwwstat is written in
perl, so if you want to parse a different format all you have to do is change
the parsing code.
wwwstat does not do anything with Referer [sic] or User-Agent information that may be
present in extended logfiles. In order to do anything interesting with
Referer, the program would have to build a Request-URI x Referer x Count
table, which would require huge gobs of memory and is better done using
a separate program with a persistent database. Naturally, this is easy
to do once you learn
perl.
AUTHOR
Roy Fielding (fielding@ics.uci.edu), University of California, Irvine.
Please do not send questions or requests to the author, since the number
of requests has long since overwhelmed his ability to reply, and all
future support will be through the mailing list.
wwwstat was originally based on a multi-server statistics program called
fwgstat-0.035 by Jonathan Magid (jem@sunsite.unc.edu) which, in turn, was heavily based on
xferstats (packaged with the version 17 of the Wuarchive FTP daemon)
by Chris Myers (chris@wugate.wustl.edu).
This work has been sponsored in part by the Defense Advanced Research Projects
Agency under Grant Numbers MDA972-91-J-1010 and F30602-94-C-0218.
This software does not necessarily reflect the position or policy of the
U.S. Government and no official endorsement should be inferred.