Full HTML for

Basic foilset Summary of Database Storage and Search of Web Logs at ARL Review

Given by Jake Kim, Deepak Ramanathan at ARL End of Year 2 Review Aberdeen Sheraton 4 Points on July 29-30 98. Foils prepared August 2 98
Outside Index Summary of Material


This system logs Web Access and Provides a batch and interactive view of archive
We describe system to map IP's into names

Table of Contents for full HTML of Summary of Database Storage and Search of Web Logs at ARL Review

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 Web Logs Database
2 Objectives
3 Anatomy
4 Host Identification
5 Unknown IP Addresses
6 IP Resolution - I
7 IP Resolution - 11
8 The IP Database
9 The IP Database -II
10 Schematic of IP query
11 Visualization
12 Databases
13 Mining
14 Other Tools

Outside Index Summary of Material



HTML version of Basic Foils prepared August 2 98

Foil 1 Web Logs Database

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Storage DataMining and Visualization
ARL Year 2 Review Aberdeen July 29-30 98
Jake Kim and Deepak Ramanathan
NPAC Syracuse University
111 College Place
Syracuse NY 13244-4100

HTML version of Basic Foils prepared August 2 98

Foil 2 Objectives

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
The primary objective of the project is to create a set of tools that would permit visualization of accesses to a Web Site. Besides Visualization, these tools would also serve as as implements to perform Data Mining operations on the Log data.
Database Backend is provided by Informix/ Illustra and Oracle.

HTML version of Basic Foils prepared August 2 98

Foil 3 Anatomy

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Access Data Standardization - Superfluous access statistics need not be considered while loading the logs into the database. Only mime types of type text/html will be considered as valid access entries. However this need not be binding on the user and an option to choose exclusion types will be given.

HTML version of Basic Foils prepared August 2 98

Foil 4 Host Identification

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Since the Web Server logs the hostname of the client, a decomposition of this hostname would reveal geographic and/or organizational information about the client.
Thus a hostname such as saturn5.sun.com would reveal the client to originate from Sun Microsystems
A client from Canada would have a hostname like ts2-08.edm.istar.ca where the .ca reveals the country identity being Canada.

HTML version of Basic Foils prepared August 2 98

Foil 5 Unknown IP Addresses

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Roughly 30% of accesses to the NPAC site are from hosts whose hostname is an IP address. This accounts for nearly 1200 entries a day.
Rather than put these IP's down as unknown, we decided to create a database of resolved IP's with their Organizational Information.
This permits us to identify with a moderate degree of certainty the network location of an IP address

HTML version of Basic Foils prepared August 2 98

Foil 6 IP Resolution - I

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
To understand the temporal and geographic patterns of WWW server access, we developed a set of heuristics for mapping IP addresses to their geographic location. These heuristics rely on the domain names and the InterNIC whois database. The whois database contains information on domains, hosts, networks, and other Internet administrators. The information usually, though not always, includes a postal address.

HTML version of Basic Foils prepared August 2 98

Foil 7 IP Resolution - 11

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Thus an IP address of 128.230.21.133 which when queried onto the Internic database would return the following information: Syracuse University (NET-SYR-UNIV-NET) Syracuse, NY 13244
IP's can now be mapped to a geographic location. For domains outside the US, the IP is mapped to the domain provider address.

HTML version of Basic Foils prepared August 2 98

Foil 8 The IP Database

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
We have created a database that stores these resolved IP's with their Organizational Information.
This allows us to query our database locally before querying the whois database, thus saving precious bandwidth.
Current database size stands at 20,000 entries of various IP's that have accessed the NPAC Web Site.

HTML version of Basic Foils prepared August 2 98

Foil 9 The IP Database -II

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Initially the query frequency to the Whois database was high, but with the growth of our database, 50% of queries are now answered locally.
This in turn translates to a "repeat clientele" that visit the NPAC site whose organizational identities have been established locally.

HTML version of Basic Foils prepared August 2 98

Foil 10 Schematic of IP query

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
NPAC Web Server
Internic Whois Database
Remote Host IP - 209.21.23.155
whois 209.21.23.155
West Coast Online (result)
Local Database
West Coast Online (result)

HTML version of Basic Foils prepared August 2 98

Foil 11 Visualization

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Visualization of Datasets are done using SciVis application
Database is queried using Statistic Java Application Tool and retrieved data is then visualized using SciVis application
Databases are queried using Java Database Connectivity (JDBC) routines

HTML version of Basic Foils prepared August 2 98

Foil 12 Databases

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Since we are using Java/JDBC as a application, we demonstrate the capabilities of the Statistics Package across multiple databases
Databases that are being queries in this case are Illustra/Informix and Oracle
The databases are Unix and NT based respectively.

HTML version of Basic Foils prepared August 2 98

Foil 13 Mining

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Owing to the large nature of the datasets (~3 MB) of data a day, this provides a comprehensive range for Mining application including pattern and correlation analysis
SciVis has built in tools that permit us to perform some of these functions.

HTML version of Basic Foils prepared August 2 98

Foil 14 Other Tools

From Summary of Database Storage and Search of Web Logs at ARL Review ARL End of Year 2 Review Aberdeen Sheraton 4 Points -- July 29-30 98. *
Full HTML Index
Scripts analyze the referrer log and the agent log to extract the relevant information.
Referrer Logs contain information as to the "point of entry" of a web client into the site.
Keywords used by visitors to find your site in the various Internet search engines and directories. The major search engines included are Alta Vista, Infoseek, Yahoo, Excite.

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Nov 28 1998