Full HTML for

Basic foilset Summary of Database Storage and Search of Web Logs

Given by Jake Kim, Deepak Ramanathan at Tutorial: ITEA HPCC Conference Aberdeen Md. on July 13 98. Foils prepared July 15 1998
Outside Index Summary of Material


This system logs Web Access and Provides a batch and interactive view of archive
We describe system to map IP's into names

Table of Contents for full HTML of Summary of Database Storage and Search of Web Logs

Denote Foils where Image Critical
Denote Foils where HTML is sufficient

1 Web Logs Database
2 Objectives
3 Anatomy
4 IP Resolution - 1
5 IP Resolution - 11
6 Schematic of IP query
7 Locally stored IP addresses
8 Visualization
9 Databases
10 Mining
11 Other Tools

Outside Index Summary of Material



HTML version of Basic Foils prepared July 15 1998

Foil 1 Web Logs Database

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Storage DataMining and Visualization
Jake Kim and Deepak Ramanathan
NPAC Syracuse University
111 College Place
Syracuse NY 13244-4100

HTML version of Basic Foils prepared July 15 1998

Foil 2 Objectives

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
The primary objective of the project is to create a set of tools that would permit visualization of accesses to a Web Site. Besides Visualization, these tools would also serve as as implements to perform Data Mining operations on this data.
Database Backend is provided by Informix/ Illustra and Oracle using JDBC interface for user interface.

HTML version of Basic Foils prepared July 15 1998

Foil 3 Anatomy

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Access Data Standardization - Superfluous access statistics need not be considered while loading the logs into the database. Only mime types of type text/html will be considered as valid access entries. However this need not be binding on the user and an option to choose exclusion types will be given

HTML version of Basic Foils prepared July 15 1998

Foil 4 IP Resolution - 1

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
To understand the temporal and geographic patterns of WWW server access, we developed a set of heuristics for mapping IP addresses to their geographic location. These heuristics rely on the domain names and the InterNIC whois database. The whois database contains information on domains, hosts, networks, and other Internet administrators. The information usually, though not always, includes a postal address.

HTML version of Basic Foils prepared July 15 1998

Foil 5 IP Resolution - 11

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Thus an IP address of 128.230.21.133 which when queried onto the Internic database would return the following information: Syracuse University (NET-SYR-UNIV-NET) Syracuse, NY 13244
IP's cab now be mapped to a geographic location. For domains outside the US, the IP is mapped to the domain provider address.

HTML version of Basic Foils prepared July 15 1998

Foil 6 Schematic of IP query

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
NPAC Web Server
Internic Whois Database
Remote Host IP - 209.21.23.155
whois 209.21.23.155
West Coast Online (result)
Local Database
West Coast Online (result)

HTML version of Basic Foils prepared July 15 1998

Foil 7 Locally stored IP addresses

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Once the whois query results have been obtained, the data is then stored locally in our database. Currently, we have 20,000 such IP addresses and their resolved queries stored in our database
Thus, before launching a query on the Internic database, we first check to see if the IP already resides on our database thus saving precious bandwidth.

HTML version of Basic Foils prepared July 15 1998

Foil 8 Visualization

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Visualization of Datasets are done using NPAC's Java visualization package SciVis
Database is queried using Statistical analysis Java Application Tool and retrieved data is then visualized using SciVis application
Databases are queried using Java Database Connectivity (JDBC) routines

HTML version of Basic Foils prepared July 15 1998

Foil 9 Databases

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Since we are using Java/JDBC as a application, we demonstrate the capabilities of the Statistics Package across multiple databases
Databases that are being queried in this case are Illustra/Informix and Oracle
The databases are Unix and NT based respectively.

HTML version of Basic Foils prepared July 15 1998

Foil 10 Mining

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Owing to the large nature of the datasets (~3 MB) of data a day, this provides a comprehensive range for Mining application including pattern and correlation analysis
SciVis has built in tools that permit us to perform some of these functions.

HTML version of Basic Foils prepared July 15 1998

Foil 11 Other Tools

From Summary of Database Storage and Search of Web Logs Tutorial: ITEA HPCC Conference Aberdeen Md. -- July 13 98. *
Full HTML Index
Scripts analyze the referrer log and the agent log to extract the relevant log information to put in database.
Referrer Logs contain information as to the "point of entry" of a web client into the site.
Keywords used by visitors to find your site in the various Internet search engines and directories. The major search engines included are Alta Vista, Infoseek, Yahoo, Excite.

© Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Sat Jul 18 1998