Abstract
Technological improvements enhanced our education experience and involved new terms like distance-learning. However, todays education systems are lack of proper assessment tools parallel to technology involved educations. The new hot topics in Computer Science, i.e. Data Mining, improving potential technologies which are candidates to fulfill these needs. This project suggests an architecture combining data mining and commodity technologies together in an education environment.
Recent technological innovations tremendously effected the education systems. Distance learning became a common technique in education. Technologies are adapted to many academic courses like the ones offered by Syracuse University to Jackson State students. In these courses, collaboration tools, mailing lists, bulletin boards are used, and rich set of online materials and reference links in WWW are presented to students.
Both synchronous and asynchronous techniques and resources are subject to continuous improvements. The computational science education group at the Northeast Parallel Architectures Center (NPAC) has developed a huge repository of online course material, which includes lectures, tutorials, and programming examples in various languages. To provide a regular basis synchronous interaction with students involving teachers and other learners, in addition to asynchronous learning materials, TANGO system was used to deliver CSC 499 over the Internet.
All these improvements in the technology based education methods bring new needs. Our education experience showed that the new face of learning has lack of assessment tools, which were used to be done by human interactions in traditional systems. The recent hot topic in Computer Science, Data Mining, opened the way to construct atomized assessment tools for the technology based educations.
Traditional educational assessment uses student work such as exams and homework to assess student performance and uses questionnaires, interviews, etc. to assess the effectiveness of a particular course (with respect to other courses).
Many types of distance education involve the student's remote access of on-line course materials. This opens up additional possibilities for assessment that use access patterns to the materials. Analysis of how students are using a site is critical for assessing both the students and the course materials. A new architecture of assessment on World Wide Web is necessary for today's innovative distance education technologies. This architecture should involve with collection of students access data to web resources, transformation of the collected data to assessment information, integration of personal registration information of students in the final evaluations, and process of all the data and information available with the recent data mining techniques.
The purpose of these project is to design an architecture using Data Mining Tools and Commodity Technologies for the assessment of distance education. Basically, using the online resources asynchronously reflects the students learning abilities. Current education experiences do not have the ability to evaluate the students informal responses to the given materials. The main architecture of the proposed system is based on the collecting access information to the web materials, i.e., web mining, and discovering access patterns of users using data mining tools. The results will be presented to the teacher and students as necessary combined with online analytical processing tools. Some other applications like Students Records will be involved in the analysis also.
Student Record Database, NPAC Grading System, has useful information about the users of the Web resources. It stores all the user registration information and the student's performance records. The system also keep access logs of the students. Since the purposed system involves with the same users, the combination of the two data group will provide opportunities to make new kind of analyzes. Because other systems usually do not have user information on web logs, or they do not have registration information in case they have authentication mechanisms, the analyses does not provide any assessment opportunity on users. Furthermore, a long term following a user is rather a new topic in this purposed project.
Two other projects at NPAC are doing related work in collecting data about user access of materials. The first is the Smart Desk project by Jiangang Guo of the Pulsar group. In this project,Guo is collecting information about how the user is interacting with the Smart Desk. In addition to data from the web server logs, her data includes such things as mouse clicks and mouse trajectory time. She stores the data in a database, but has not yet done significant analysis. See http://www.pulsar.org/stats.html.
The other web access project is Deepak's collection of data from the NPAC web server logs. He analyzes the log files and stores the resulting processed data in a database. He also does some analysis and visualization of the data. See http://www.npac.syr.edu/stats
Our project would be significantly different from both of these in that we would want to track how individuals access materials. This data is not available from web server logs, and we would have to make significant software changes to obtain the data.
The other way in which this project has related work to another project at NPAC is in using a database to keep information about people and the groups that they are in. Our projects keeps information on students, both personal and performance, and the courses that they take. In the Tango project, there are two ways in which information is also kept about people. The first is that in the WebWisdom NT authoring system, information is kept about the authors. The second is that in the collaborative system, a prototype database has been started for grouping participants into communities. After a brief discussion with Marek, we propose to collaborate to have a unified design, where we treat courses as communities. We might call this Database Design for Distance Learning and Collaboration (or perhaps Object Design for ...). No reference is currently available for the work on collaboration communities.
As the popularity of the Web has exploded, there is a strong desire to understand user surfing behavior. Many organizations have invested a tremendous amount of capital to operate sites on the Web. These Web sites provide communications and services to their employees, customers, and suppliers. With money invested in these sites, there is a strong desire to understand the effectiveness of such investments and to find ways to realize the potential opportunities provided by the Internet. As a result, it has become important to understand user surfing behavior.[IBM_ST]
Several Web server log analysis tools have been implemented.However, it is difficult to perform user-oriented data mining and analysis directly on the server log files because they tend to be ambiguous and incomplete. Some of these tools are very simple and do not attempt to identify individual user sessions. These packages are simply mechanisms through which a Web master can view the raw Web server statistics, such as hit counts and distributions based on geographic regions,the times or time intervals of visits. Examples of this type of tool include wwwstat (http://www.ics.uci.edu/pub/websoft/wwwstat) and Analog (http://www.statslab.cam.ac.uk/~sret1/analog)[IBM_ST], Open market web reporter. http://www.openmarket.com, Webtrends. http://www.webtrends.com,net.analysis desktop. http://www.netgen.com. Mostly, these kind of analyses packets are designed to deal handle low to moderate traffic servers, and furthermore, they usually provide little or no analysis of data relationships among the accessed files and directories within the Web space. [WebMiner]
To provide user-oriented Web usage analysis, user sessions must first be identified. More sophisticated analysis packages identify user sessions with some or all of the following three mechanisms. First, if the Web server provides cookies, it is a trivial task to formulate the session. Every access to the Web server with the same cookie value makes a single session. Second, if the server does not provide cookies, it may require a log-in ID for each browser. The analysis tool can use the log-in IDs to identify sessions. WEBMINER, a data mining tool was developed on the assumption that log-in IDs are available. Without log-in IDs, the data mining tool cannot perform its intended functions. In fact, most log records do not contain log-in IDs. Lastly, if the Web server does not provide either cookies or user IDs, the analyzer identifies sessions with host addresses. All accesses to the Web server from a given host address are considered to be a session until a predefined amount of time has passed between accesses. As mentioned previously, the use of a proxy and firewall causes all browsers from a given proxy or firewall to be considered a single user. As a result, an identified session may in fact contain many independent user sessions. Several Web analyzers only use the host address to identify sessions, such as SurfReport (http://software.bienlogic.com/SurfReport) and NetTracker (http://www.sane.com/products/NetTracker). Other tools use a combination of methods to identify sessions, such as the Usage Analyst by Microsoft Corporation (previously Interse http://www.interse.com) and WebTrends (http://www.webtrends.com).
WEBMINER automatically discovers association rules and sequential patterns from server access logs. By using different kind of algorithms discovering reference sequences in [CPY96], it performs various types of user traversal path analysis such as identifying the most traversed paths thorough a Web locality.[WEBMINER]
There are a number of projects with innovative algorithms trying to identify user sessions by reconstructing user traversal paths. IBM's SpeedTracer, a World Wide Web usage mining and analysis tool, was developed to understand user surfing behavior by exploring the Web server log files with data mining techniques. When popular formats of Web log files used,it is difficult to understand user surfing behavior, and perform user-oriented data mining and analysis directly on the server log files because they tend to be ambiguous and incomplete.In SpeedTracer, IBM used some new heuristic algorithms to identify users from the ambiguous Web log files. The basic idea is to analyze the user traversal paths using time intervals, IPs, URL and its referrers, and the agents on the successive accesses. This approach does not require "cookies" or user registration for session identification. User privacy is completely protected. Once user sessions are identified, data mining algorithms are then applied to discover the most common traversal paths and groups of pages frequently visited together. Important user browsing patterns are manifested through the frequent traversal paths and page groups, helping the understanding of user surfing behavior. Three types of reports are prepared: user-based reports, path-based reports and group-based reports. [IBM_ST]
Once user sessions are identified, statistics related to user behavior can be obtained, data mining algorithms can be applied to discover the most common traversal paths and groups of pages frequently visited together. Interesting user-based statistics include the top N referrers to a Web site, the top N pages most frequently visited by users, the top N pages from or into which users most frequently exit or enter a Web site, the top N browsers most frequently used, the top N IP hosts from which most users come, the demographics (by organization or by country) of users, the distribution of user session durations, the distribution of numbers of pages visited during a user session, and the distribution of depth or breadth of a user session. Important user browsing patterns are manifested through the frequent traversal paths and page groups,helping the understanding of user surfing behavior. Three types of reports are prepared: user-based reports, path-based reports and group-based reports.[IBM_ST]
To understand how students using Web resources and improve assessment tools, the Web server log files might be analyzed. The draft architecture of the system is as following;
The system first starts with processing the web server logs. Typical server log files contain the following information about a request:
Well known techniques to solve the problem of proxy servers or firewalls masking user IPs, it generally requires either user registrations or log-ins or the employment of "cookies" between the Web server and client browsers. With log-ins or cookies, a Web server can identify distinct requests made by individual users through a token carried between the user's browser and the server.
Some tools like IBM's SpeedTracer try to analyze user-oriented behavior from the regular server log files without requiring cookies or registrations. Although its trivial solutions may be satisfactory, it may not give the most reliable user identification. Furthermore, it lacks of user identification in the long run. A user can be identified just for one session, or possibly a couple of times, assuming connections from the same machine by accepting increasing doubts about who s who. [SILK] proposed an alternative approach to reconstructing user traversal paths. Instead of using a referrer page, information about the topology (i.e., hyperlink structure) of a Web site (together with other heuristics) was used to identify legitimate traversals. A software agent was first used to perform an exhaustive breadth-first traversal of pages within the Web site in order to construct the topology. However, the topology is not really needed if referrer information is available, which is available most of the times. The remaining possible solutions are using cookies or user registration. Using cookies brings flexibility to the users, but causes difficulties in the system. First, cookies may not be continued, a user may always delete a cookie or may not accept it. Second, it is not easy to match user identification from the cookies when other user data is considered in a combined analysis of the student behaviors.
As a result the best solution of the problem is using user registration, though it may not be too flexible for the students. The difference of this system from others should always be considered in this regard; (i) the long run web resource usage of the students is an issue, and (ii) the web log patterns will be used in combination with other student records, like NPAC Grading System records, in the purposed assessment tools.
The current web server stores the following data from the users' browsing home pages:
client host Internet Protocol (IP) address, time stamp, method, URL1 (uniform
resource locator) address of the requested document, HTTP2 (HyperText Transfer
Protocol) version, user identifier, return code (status of the request,
i.e., success or error codes), bytes transferred, referrer page URL, and
agent (browser and client operating system), etc.
In the web logs, it is possible that several
fields may be missing for some records. Also, a "gif" or "jpg"
file in a html document generally does not reflect a user browsing behavior. After solving data collection issues, data reduction and cleaning may be necessary.
The next step is to identify the user sessions. One way to identify the user sessions is using time stamps and user ids. For example, having a predefined time interval between the two access sequences of a user means two separate sessions.
During evaluating the user logs, we are only interested in forward traversal subpaths. As a result, first all maximum forward paths in each user session should be found, and then discovers all common subpaths among all the maximum forward paths of user sessions. A maximum forward path is a sequence of maximum connected pages in a Web presentation where no page is previously visited.
The need for inferring backward access pairs in session identification due to proxy or client caching can be illustrated with the following example. Note that any access pair (x - - > y) can be the result of clicking a hyperlink to page y on page x or by clicking on the backward button on the Web browser to page y after the viewer has looked at page x. The user session represented by the connected traversal path can be described as follows: (- - - > a), (a - - > b), (b - - > c), (c - - > d), (d - - > c), (c - - > b), (b - - > e), (e - - > b), (b - - > a), and (a - - > f). However, since browsers usually cache recently visited pages, some of the actual traversal steps may be missing from the server log files. For example, (d - - > c), (c - - > b), (e - - > b), and (b - - > a) may be missing. These missing traversal steps may need to be inferred in order to identify traversal paths and user sessions.
Each identified user session can be mapped into a transaction.Once transactions are identified, the next issue is to combine these data with the other logs or user registration information.This process will be more clear in the future during the analysis of the records.
The best way to store the combined data is using a database. The half processed data is ready for low level SQL querying. Through a web interface, it will be possible to get some interesting user-based statistics.For example, the most frequent N external referrers to a page, the most frequent N visited pages by users, the most frequent N pages that users most often come into and exit from a page, the top N hosts from which most users come to visit a site, the distribution of user session durations, and the number of pages visited in a session.
From the collected data and user sessions, statistics related to student behavior can be extracted. Some popular user-based statistics include the top N referrers to a Web site, the top N pages most frequently visited by users, the top N pages from or into which users most frequently exit or enter a Web site, the top N most frequented user traversal paths and the top N groups of pages most frequently visited together, the top N browsers most frequently used, the top N IP hosts from which most users come, the demographics (by organization or by country) of users, the distribution of user session durations, the distribution of numbers of pages visited during a user session, and the distribution of depth or breadth of a user session.
With user sessions, data mining techniques can be applied to obtain above user browsing patterns. Data mining has recently been used to discover customer buying patterns by many retailers and other service corporations. One of the most important data mining problems concerns mining association rules. Given a set of transactions, where each transaction is a set of items, an association rule is an expression of the form X ==> Y, where X and Y are sets of items. An example of an association rule is: "30 percent of transactions that contain bread and butter also contain milk; 2 percent of all transactions contain both of these items." Here 30 percent is called the confidence of the rule, and 2 percent the support of the rule. The thrust of mining association rules is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints. In mining association rules, the most important problem is to generate all combinations of items that have the minimal support. These combinations of items are called large item-sets. [IBM_ST]
Discovering the top N most frequented user traversal paths and the top N groups of pages most frequently visited together are to some extent similar to finding the top N large item-sets for traversal paths and groups of pages. But specific differences exist. A traversal path is a collection of consecutive URL pages in a Web presentation, where one URL is referred to by the immediately preceding URL. The URLs in a traversal path are connected in the Web presentation graph. In contrast, the pages in a group are not necessarily connected among themselves. A frequently visited group of pages may contain two or more disjoint traversal paths. By examining the traversal paths and groups of pages, valuable user browsing patterns can be obtained to improve the organization and linkage of the Web presentation.[IBM_ST]
Note that finding frequent traversal paths is also to some extent similar to the problem of mining sequential patterns. However, the results from the sequential-pattern-mining may contain sequences that do not represent a traversal path in the Web presentation graph. The reason is there may be many backward traversal steps involved in a user session, and pages on two different paths may be recognized as part of a sequential pattern. [IBM_ST]
Mining groups of pages most frequently visited. Frequent traversal paths identify pages that are on the same forward path in a Web presentation. These pages represent consecutive subsequences in the maximum forward paths of user sessions. However, there might be groups of pages not on the same traversal path but frequently visited together by users. By examining both frequented traversal paths and frequently visited groups, valuable information can be obtained to improve the representation of the Web foils. For example, foils a, b, e, and f may be visited most frequently by users, but these four pages are not on the same path in the Web presentation. Thus, it may be better to provide an HTTP link from page e to page f so that most users would not have to traverse backwards from page e to b, then to page a before they can go to page f.
To mine the frequently visited groups from user sessions, we need the distinct pages in each session. Thus, any duplication of pages caused by backward traversals was first eliminated in each session. Unlike traversal paths where the ordering of pages in a sequence is important, there is no ordering in a group of pages.
Though frequent traversal paths and groups of pages visited together may be enough for many tool-users in the industry, it may not be satisfactory in the NPAC education program. First of all, the common analysis tools like SpeedTracer, works on constant interests. The users are continuously check the same pages. Different users with different interests do not make change in this subject. In our case, a user,a student, is supposed to change his interests by the time course subjects updated in the classes. Second, the amount of the web pages are mostly constant size, i.e., the site assumed not growing by time, at least before seeing the results of the analyzer. A reconstruction of the pages with respect to user traversal paths is a common issue for most of the tools. However, NPAC course materials have a dynamic structure. They are increasing always. Therefore, this system has two responsibility, one is common in the other tools which is site reconstruction, and the other is user assessment in a dynamic site with dynamic interests.Using other types of data-mining, such as clustering pages, clustering users, or classification are the goals of this project.
Other possible results can be obtained using data mining algorithms, like clustering and classification of students from their web resource usage. The correlations between students grades and the web resources study can be checked both as assessment and resource improvement perspective.
The discovery of Web usage patterns, carried out by techniques described earlier, does not address to the final users. They would not be very useful unless there were mechanisms and tools to help an analyst better understand them. Hence, in addition to developing techniques for mining usage patterns form Web logs, more understandable and applicable reports should be produced by the system in front of the analyst. Presenting results are expected to draw upon from a number of fields including statistics, graphics and visualization, usability analysis, and database querying. Primarily,in this system, the reports can be divided into two areas; users (students) assessment and web resource reports like visit paths and most popular grouped pages, etc. It may also possible to simulate a student interests about the web resources. From the instructor perspective, the questions to ask may be ; How the students are using the resources?, Which pages are being accessed most frequently?, who is visiting which documents? what are the frequency of and most recent use of each hyperlink, Does the student performance in correlation with resource using? How should resources organized in a more useful manner? What are the long time tendencies of the students? , etc. The answers would be appreciative as a result of such queries that the system can assesst one student and return a grading suggestions for the student.
The regular web server logs usually contains data more than necessary. Sometimes these data may mislead the analysis process. Elimination of the garbage data can be done by excluding image accesses in a page, such as Jpeg an GIf file accesses. Data Filtering does not only means removing irrelevant data, but converting special accesses to regular access formats. For example, a script in the HTML files may report url of the access in case the HTML is loaded from the browser cache.
These kind of processes are naturally fit to perl script programming. Initially, a perl script is planned to take care of this process. However, later it may be handled by some Java program for the purpose of real time analysis of the records from the frontend application interfaces.
The same kind of technologies are appropriate for the date filtering on NPAC Grading SYstem Logs.
In order to make better analysis further than giving access statistics, it is necessary to find the user sessions on the access logs. The page references should be grouped together to match a user session, i.e., a transaction. Although depending on the criteria the definition of a transaction may change, currently a user session is accepted as a transaction.
A general approach for transactions is defined by [CPY96]. [CPY96] defines the concept of maximal forward reference in order to identify transactions. Each transaction is defined to be the set of pages in the path from the first page in the log for a user up the page before a backward reference is made. A new transaction is started when the next forward reference is made. A forward reference is defined to be a page not already in the set of pages for the current transaction. Similarly, a backward reference is defined to be a page that is already contained in the set of pages for the current transaction. For example, an access sequence of A B C D C B E F E G would be broken into three transactions, i.e. A B C D, A B E F, and A B E G.
The possible technology is using Java modules in this process because of its more sophisticated needs.
The pre-processed data in the database is appropriate for any SQL level querying. This provides more technical queries on the data. A front end through the web interface may provide any sql querying on the data. Basically, such an interface may be useful to prepare straight-forward statistics on the collected data, which is more common in the un-intelligent analysis tools. The results of these querying may lead more sophisticated front-end implementations.
The technologies involved may be HTML, JavaScript, Java-JDBC, Database-SQL.
Reference[IBM_ST]:
SpeedTracer: A Web usage mining and analysis tool, http://www.almaden.ibm.com/journal/sj/371/wu.txt
Reference[Pitkow]:
J. Pitkow, "In Search of Reliable Usage Data on the WWW," Proceedings of Sixth International World Wide Web Conference (1997).
Reference[Mobasher]:
B. Mobasher et al., Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Report 96-050, Department of Computer Science, University of Minnesota, Minneapolis (September 1996).
Web Mining Survey & WEBMINER HTML documentation at http://www-users.cs.umn.edu/~mobasher/webminer/survey/
Reference[SILK]:
P. Pirolli, R. Rao, and J. Pitkow, "Silk from a Sow's Ear: Extracting Usable Structures from the Web," Proceedings of 1996 Conference on Human Factors in Computing Systems (1996), pp. 118-125.
CPY96
M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems, pages 385--392, 1996.