Given by Jake Kim, Deepak Ramanathan at ARL End of Year 2 Review Aberdeen Sheraton 4 Points on July 29-30 98. Foils prepared August 2 98
Outside Index
Summary of Material
This system logs Web Access and Provides a batch and interactive view of archive |
We describe system to map IP's into names |
Outside Index Summary of Material
Storage DataMining and Visualization |
ARL Year 2 Review Aberdeen July 29-30 98 |
Jake Kim and Deepak Ramanathan |
NPAC Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
The primary objective of the project is to create a set of tools that would permit visualization of accesses to a Web Site. Besides Visualization, these tools would also serve as as implements to perform Data Mining operations on the Log data. |
Database Backend is provided by Informix/ Illustra and Oracle. |
Access Data Standardization - Superfluous access statistics need not be considered while loading the logs into the database. Only mime types of type text/html will be considered as valid access entries. However this need not be binding on the user and an option to choose exclusion types will be given. |
Since the Web Server logs the hostname of the client, a decomposition of this hostname would reveal geographic and/or organizational information about the client. |
Thus a hostname such as saturn5.sun.com would reveal the client to originate from Sun Microsystems |
A client from Canada would have a hostname like ts2-08.edm.istar.ca where the .ca reveals the country identity being Canada. |
Roughly 30% of accesses to the NPAC site are from hosts whose hostname is an IP address. This accounts for nearly 1200 entries a day. |
Rather than put these IP's down as unknown, we decided to create a database of resolved IP's with their Organizational Information. |
This permits us to identify with a moderate degree of certainty the network location of an IP address |
To understand the temporal and geographic patterns of WWW server access, we developed a set of heuristics for mapping IP addresses to their geographic location. These heuristics rely on the domain names and the InterNIC whois database. The whois database contains information on domains, hosts, networks, and other Internet administrators. The information usually, though not always, includes a postal address. |
Thus an IP address of 128.230.21.133 which when queried onto the Internic database would return the following information: Syracuse University (NET-SYR-UNIV-NET) Syracuse, NY 13244 |
IP's can now be mapped to a geographic location. For domains outside the US, the IP is mapped to the domain provider address. |
We have created a database that stores these resolved IP's with their Organizational Information. |
This allows us to query our database locally before querying the whois database, thus saving precious bandwidth. |
Current database size stands at 20,000 entries of various IP's that have accessed the NPAC Web Site. |
Initially the query frequency to the Whois database was high, but with the growth of our database, 50% of queries are now answered locally. |
This in turn translates to a "repeat clientele" that visit the NPAC site whose organizational identities have been established locally. |
NPAC Web Server |
Internic Whois Database |
Remote Host IP - 209.21.23.155 |
whois 209.21.23.155 |
West Coast Online (result) |
Local Database |
West Coast Online (result) |
Visualization of Datasets are done using SciVis application |
Database is queried using Statistic Java Application Tool and retrieved data is then visualized using SciVis application |
Databases are queried using Java Database Connectivity (JDBC) routines |
Since we are using Java/JDBC as a application, we demonstrate the capabilities of the Statistics Package across multiple databases |
Databases that are being queries in this case are Illustra/Informix and Oracle |
The databases are Unix and NT based respectively. |
Owing to the large nature of the datasets (~3 MB) of data a day, this provides a comprehensive range for Mining application including pattern and correlation analysis |
SciVis has built in tools that permit us to perform some of these functions. |
Scripts analyze the referrer log and the agent log to extract the relevant information. |
Referrer Logs contain information as to the "point of entry" of a web client into the site. |
Keywords used by visitors to find your site in the various Internet search engines and directories. The major search engines included are Alta Vista, Infoseek, Yahoo, Excite. |