Given by Jake Kim, Deepak Ramanathan at Tutorial: ITEA HPCC Conference Aberdeen Md. on July 13 98. Foils prepared July 15 1998
Outside Index
Summary of Material
This system logs Web Access and Provides a batch and interactive view of archive |
We describe system to map IP's into names |
Outside Index Summary of Material
Storage DataMining and Visualization |
Jake Kim and Deepak Ramanathan |
NPAC Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
The primary objective of the project is to create a set of tools that would permit visualization of accesses to a Web Site. Besides Visualization, these tools would also serve as as implements to perform Data Mining operations on this data. |
Database Backend is provided by Informix/ Illustra and Oracle using JDBC interface for user interface. |
Access Data Standardization - Superfluous access statistics need not be considered while loading the logs into the database. Only mime types of type text/html will be considered as valid access entries. However this need not be binding on the user and an option to choose exclusion types will be given |
To understand the temporal and geographic patterns of WWW server access, we developed a set of heuristics for mapping IP addresses to their geographic location. These heuristics rely on the domain names and the InterNIC whois database. The whois database contains information on domains, hosts, networks, and other Internet administrators. The information usually, though not always, includes a postal address. |
Thus an IP address of 128.230.21.133 which when queried onto the Internic database would return the following information: Syracuse University (NET-SYR-UNIV-NET) Syracuse, NY 13244 |
IP's cab now be mapped to a geographic location. For domains outside the US, the IP is mapped to the domain provider address. |
NPAC Web Server |
Internic Whois Database |
Remote Host IP - 209.21.23.155 |
whois 209.21.23.155 |
West Coast Online (result) |
Local Database |
West Coast Online (result) |
Once the whois query results have been obtained, the data is then stored locally in our database. Currently, we have 20,000 such IP addresses and their resolved queries stored in our database |
Thus, before launching a query on the Internic database, we first check to see if the IP already resides on our database thus saving precious bandwidth. |
Visualization of Datasets are done using NPAC's Java visualization package SciVis |
Database is queried using Statistical analysis Java Application Tool and retrieved data is then visualized using SciVis application |
Databases are queried using Java Database Connectivity (JDBC) routines |
Since we are using Java/JDBC as a application, we demonstrate the capabilities of the Statistics Package across multiple databases |
Databases that are being queried in this case are Illustra/Informix and Oracle |
The databases are Unix and NT based respectively. |
Owing to the large nature of the datasets (~3 MB) of data a day, this provides a comprehensive range for Mining application including pattern and correlation analysis |
SciVis has built in tools that permit us to perform some of these functions. |
Scripts analyze the referrer log and the agent log to extract the relevant log information to put in database. |
Referrer Logs contain information as to the "point of entry" of a web client into the site. |
Keywords used by visitors to find your site in the various Internet search engines and directories. The major search engines included are Alta Vista, Infoseek, Yahoo, Excite. |