<title>




Abstract




Technological improvements enhanced our education experience and involved new terms like distance-learning. Todays education systems are need to be supported by new architectures managing course, instruction, student, and performance records. We have developed a 3-tier architecture integrating backend tools and systems with commodity interfaces. We implemented an asynchronous system supporting curriculum management, student databases, statistic data collection and presentation, assignment submission and performance evaluations, personalized data access on the web, securtiy issues, different user levels and access lists. The resulting system reached a large audience, more than 350 users, including in-campus and online distance courses, had used it. This paper presents a technical overview of the architecture and discusses its functionalities gained from the experience.












1. Introduction

Recent technological innovations tremendously effected the education systems. Distance learning became a common technique in education. Technologies are adapted to many academic courses like the ones offered by Syracuse University to Jackson State students. In these courses, collaboration tools, mailing lists, bulletin boards are used, and rich set of online materials and reference links in WWW are presented to students.

Both synchronous and asynchronous techniques and resources are subject to continuous improvements. The computational science education group at the Northeast Parallel Architectures Center (NPAC) has developed a huge repository of online course material, which includes lectures, tutorials, and programming examples in various languages. To provide a regular basis synchronous interaction with students involving teachers and other learners, in addition to asynchronous learning materials, TANGO system was used to deliver CSC 499 over the Internet.

All these improvements in the technology based education methods bring new needs. Our education experience showed that the new face of learning has lack of online course related tools both for students and teaching staff like registering, passwords assignments, automized mail lists, students' data recording, observing students, preparation and presentation of performance records, long term reference needs for a particular student work, assessment tools, which were used to be done by human interactions in traditional systems.

In a distributed environment, the system provides accesses from different locations through a complete security mechanism and user authentication interface integrated to commodity interfaces. The student records and their performance records are planned to be kept in a database continuously that provides a base for a virtual university student services. A student can access his performance page, at any time from any where having web access; and/or can include his resume for the applications, which causes more reliable evaluations.

Besides convenience at submitting the grades to the students, NPAC grading system helps to instructors as an assessment tool. The graders can see various statistics about the students at any time while grading from a web browser. Preparing class surveys through the Grading System provides keep tracking of the student performances, understanding course quality, and increased adaptability to the student needs. Possible customizations at the grading brings more flexibilities at the grading. In the system, additional to seing a simple grade, students themselves are able to see their performance, like average or expected grade, grader comments, and suggestions, etc., through the semester .

The main functionalities of the system in the collaboration of supervisors, instructor, co-instructors, TA, and students through the Web Browsers;

  Course Manipulations; add,update,delete, and surveys.  
  Student Records Manipulations  
  Assignment Records Manipulations  
  Customizing Grading  
  Preparing surveys 
  Grading and Statistics  
  System User Records; defining new users, manipulating user
  accesses  
  System Services like backups, and class list construction  
  Students' Access Menu, personal information and grades   
 



2.Architecture

2.1 Student Records Database

Student Record Database, NPAC Grading System, has useful information about the users of the Web resources. It stores all the user registration information and the student's performance records. The system also keep access logs of the students. Since the purposed system involves with the same users, the combination of the two data group will provide opportunities to make new kind of analyzes. Because other systems usually do not have user information on web logs, or they do not have registration information in case they have authentication mechanisms, the analyses does not provide any assessment opportunity on users. Furthermore, a long term following a user is rather a new topic in this purposed project.

To understand how students using Web resources and improve assessment tools, the Web server log files might be analyzed. The draft architecture of the system is as following;

Click on image to see larger

The system first starts with processing the web server logs. Typical server log files contain the following information about a request:

In general, users tend to have privacy and remain as anonymous as possible may force many Web servers not to ask for registration or not to use cookies. Since the user identifier is usually not available in the log file, the analyzer identifies sessions with host addresses.All accesses to the Web server from a given host address are considered to be a session until a predefined amount of time has passed between accesses. Due to the use of proxy servers by Internet Service Providers (ISPs) and firewalls by commercial corporate gateways, true client IP addresses are not available to the Web server. Instead of various distinct client IPs, the same proxy server or firewall IP will be recorded in the server log files, representing requests of different users, who come to the Web site through the same proxy server or firewall, as a single user.As a result, an identified session may in fact contain many independent user sessions. This situation creates ambiguity in the log records. Furthermore, some Web pages are generally cached by local clients or various proxy servers, or both, in order to reduce network traffic. As a result, log records will be missing for the corresponding accesses to the cached Web pages, resulting in an incomplete log.[Pitkow] made a more complete discussion of the problems towards getting usable user oriented Web usage data. [IBM_ST]

Well known techniques to solve the problem of proxy servers or firewalls masking user IPs, it generally requires either user registrations or log-ins or the employment of "cookies" between the Web server and client browsers. With log-ins or cookies, a Web server can identify distinct requests made by individual users through a token carried between the user's browser and the server.

Some tools like IBM's SpeedTracer try to analyze user-oriented behavior from the regular server log files without requiring cookies or registrations. Although its trivial solutions may be satisfactory, it may not give the most reliable user identification. Furthermore, it lacks of user identification in the long run. A user can be identified just for one session, or possibly a couple of times, assuming connections from the same machine by accepting increasing doubts about who s who. [SILK] proposed an alternative approach to reconstructing user traversal paths. Instead of using a referrer page, information about the topology (i.e., hyperlink structure) of a Web site (together with other heuristics) was used to identify legitimate traversals. A software agent was first used to perform an exhaustive breadth-first traversal of pages within the Web site in order to construct the topology. However, the topology is not really needed if referrer information is available, which is available most of the times. The remaining possible solutions are using cookies or user registration. Using cookies brings flexibility to the users, but causes difficulties in the system. First, cookies may not be continued, a user may always delete a cookie or may not accept it. Second, it is not easy to match user identification from the cookies when other user data is considered in a combined analysis of the student behaviors.

As a result the best solution of the problem is using user registration, though it may not be too flexible for the students. The difference of this system from others should always be considered in this regard; (i) the long run web resource usage of the students is an issue, and (ii) the web log patterns will be used in combination with other student records, like NPAC Grading System records, in the purposed assessment tools.

The current web server stores the following data from the users' browsing home pages:

In the web logs, it is possible that several fields may be missing for some records. Also, a "gif" or "jpg" file in a html document generally does not reflect a user browsing behavior. After solving data collection issues, data reduction and cleaning may be necessary.

* * *

The next step is to identify the user sessions. One way to identify the user sessions is using time stamps and user ids. For example, having a predefined time interval between the two access sequences of a user means two separate sessions.

During evaluating the user logs, we are only interested in forward traversal subpaths. As a result, first all maximum forward paths in each user session should be found, and then discovers all common subpaths among all the maximum forward paths of user sessions. A maximum forward path is a sequence of maximum connected pages in a Web presentation where no page is previously visited.

The need for inferring backward access pairs in session identification due to proxy or client caching can be illustrated with the following example. Note that any access pair (x - - > y) can be the result of clicking a hyperlink to page y on page x or by clicking on the backward button on the Web browser to page y after the viewer has looked at page x. The user session represented by the connected traversal path can be described as follows: (- - - > a), (a - - > b), (b - - > c), (c - - > d), (d - - > c), (c - - > b), (b - - > e), (e - - > b), (b - - > a), and (a - - > f). However, since browsers usually cache recently visited pages, some of the actual traversal steps may be missing from the server log files. For example, (d - - > c), (c - - > b), (e - - > b), and (b - - > a) may be missing. These missing traversal steps may need to be inferred in order to identify traversal paths and user sessions.

Each identified user session can be mapped into a transaction.Once transactions are identified, the next issue is to combine these data with the other logs or user registration information.This process will be more clear in the future during the analysis of the records.

The best way to store the combined data is using a database. The half processed data is ready for low level SQL querying. Through a web interface, it will be possible to get some interesting user-based statistics.For example, the most frequent N external referrers to a page, the most frequent N visited pages by users, the most frequent N pages that users most often come into and exit from a page, the top N hosts from which most users come to visit a site, the distribution of user session durations, and the number of pages visited in a session.

From the collected data and user sessions, statistics related to student behavior can be extracted. Some popular user-based statistics include the top N referrers to a Web site, the top N pages most frequently visited by users, the top N pages from or into which users most frequently exit or enter a Web site, the top N most frequented user traversal paths and the top N groups of pages most frequently visited together, the top N browsers most frequently used, the top N IP hosts from which most users come, the demographics (by organization or by country) of users, the distribution of user session durations, the distribution of numbers of pages visited during a user session, and the distribution of depth or breadth of a user session.

With user sessions, data mining techniques can be applied to obtain above user browsing patterns. Data mining has recently been used to discover customer buying patterns by many retailers and other service corporations. One of the most important data mining problems concerns mining association rules. Given a set of transactions, where each transaction is a set of items, an association rule is an expression of the form X ==> Y, where X and Y are sets of items. An example of an association rule is: "30 percent of transactions that contain bread and butter also contain milk; 2 percent of all transactions contain both of these items." Here 30 percent is called the confidence of the rule, and 2 percent the support of the rule. The thrust of mining association rules is to find all association rules that satisfy user-specified minimum support and minimum confidence constraints. In mining association rules, the most important problem is to generate all combinations of items that have the minimal support. These combinations of items are called large item-sets. [IBM_ST]

Discovering the top N most frequented user traversal paths and the top N groups of pages most frequently visited together are to some extent similar to finding the top N large item-sets for traversal paths and groups of pages. But specific differences exist. A traversal path is a collection of consecutive URL pages in a Web presentation, where one URL is referred to by the immediately preceding URL. The URLs in a traversal path are connected in the Web presentation graph. In contrast, the pages in a group are not necessarily connected among themselves. A frequently visited group of pages may contain two or more disjoint traversal paths. By examining the traversal paths and groups of pages, valuable user browsing patterns can be obtained to improve the organization and linkage of the Web presentation.[IBM_ST]

Note that finding frequent traversal paths is also to some extent similar to the problem of mining sequential patterns. However, the results from the sequential-pattern-mining may contain sequences that do not represent a traversal path in the Web presentation graph. The reason is there may be many backward traversal steps involved in a user session, and pages on two different paths may be recognized as part of a sequential pattern. [IBM_ST]


Mining groups of pages most frequently visited. Frequent traversal paths identify pages that are on the same forward path in a Web presentation. These pages represent consecutive subsequences in the maximum forward paths of user sessions. However, there might be groups of pages not on the same traversal path but frequently visited together by users. By examining both frequented traversal paths and frequently visited groups, valuable information can be obtained to improve the representation of the Web foils. For example, foils a, b, e, and f may be visited most frequently by users, but these four pages are not on the same path in the Web presentation. Thus, it may be better to provide an HTTP link from page e to page f so that most users would not have to traverse backwards from page e to b, then to page a before they can go to page f.

To mine the frequently visited groups from user sessions, we need the distinct pages in each session. Thus, any duplication of pages caused by backward traversals was first eliminated in each session. Unlike traversal paths where the ordering of pages in a sequence is important, there is no ordering in a group of pages.

Though frequent traversal paths and groups of pages visited together may be enough for many tool-users in the industry, it may not be satisfactory in the NPAC education program. First of all, the common analysis tools like SpeedTracer, works on constant interests. The users are continuously check the same pages. Different users with different interests do not make change in this subject. In our case, a user,a student, is supposed to change his interests by the time course subjects updated in the classes. Second, the amount of the web pages are mostly constant size, i.e., the site assumed not growing by time, at least before seeing the results of the analyzer. A reconstruction of the pages with respect to user traversal paths is a common issue for most of the tools. However, NPAC course materials have a dynamic structure. They are increasing always. Therefore, this system has two responsibility, one is common in the other tools which is site reconstruction, and the other is user assessment in a dynamic site with dynamic interests.Using other types of data-mining, such as clustering pages, clustering users, or classification are the goals of this project.

* * *

Other possible results can be obtained using data mining algorithms, like clustering and classification of students from their web resource usage. The correlations between students grades and the web resources study can be checked both as assessment and resource improvement perspective.

* * *

The discovery of Web usage patterns, carried out by techniques described earlier, does not address to the final users. They would not be very useful unless there were mechanisms and tools to help an analyst better understand them. Hence, in addition to developing techniques for mining usage patterns form Web logs, more understandable and applicable reports should be produced by the system in front of the analyst. Presenting results are expected to draw upon from a number of fields including statistics, graphics and visualization, usability analysis, and database querying. Primarily,in this system, the reports can be divided into two areas; users (students) assessment and web resource reports like visit paths and most popular grouped pages, etc. It may also possible to simulate a student interests about the web resources. From the instructor perspective, the questions to ask may be ; How the students are using the resources?, Which pages are being accessed most frequently?, who is visiting which documents? what are the frequency of and most recent use of each hyperlink, Does the student performance in correlation with resource using? How should resources organized in a more useful manner? What are the long time tendencies of the students? , etc. The answers would be appreciative as a result of such queries that the system can assesst one student and return a grading suggestions for the student.




5. Near Term Implementation Issues

Preprocessing on Data

Data Filtering

The regular web server logs usually contains data more than necessary. Sometimes these data may mislead the analysis process. Elimination of the garbage data can be done by excluding image accesses in a page, such as Jpeg an GIf file accesses. Data Filtering does not only means removing irrelevant data, but converting special accesses to regular access formats. For example, a script in the HTML files may report url of the access in case the HTML is loaded from the browser cache.

These kind of processes are naturally fit to perl script programming. Initially, a perl script is planned to take care of this process. However, later it may be handled by some Java program for the purpose of real time analysis of the records from the frontend application interfaces.

The same kind of technologies are appropriate for the date filtering on NPAC Grading SYstem Logs.

Transaction Clarification

In order to make better analysis further than giving access statistics, it is necessary to find the user sessions on the access logs. The page references should be grouped together to match a user session, i.e., a transaction. Although depending on the criteria the definition of a transaction may change, currently a user session is accepted as a transaction.

A general approach for transactions is defined by [CPY96]. [CPY96] defines the concept of maximal forward reference in order to identify transactions. Each transaction is defined to be the set of pages in the path from the first page in the log for a user up the page before a backward reference is made. A new transaction is started when the next forward reference is made. A forward reference is defined to be a page not already in the set of pages for the current transaction. Similarly, a backward reference is defined to be a page that is already contained in the set of pages for the current transaction. For example, an access sequence of A B C D C B E F E G would be broken into three transactions, i.e. A B C D, A B E F, and A B E G.

The possible technology is using Java modules in this process because of its more sophisticated needs.

DATA Combination

The purpose of this stage is the prepare proper formats of data from the three resources, namely Web logs, Grading System logs and Student Registration. All three data source should be put in a database in a correlation for the future analyses.

Knowledge Discovery

SQL Based Querying on the combined data

The pre-processed data in the database is appropriate for any SQL level querying. This provides more technical queries on the data. A front end through the web interface may provide any sql querying on the data. Basically, such an interface may be useful to prepare straight-forward statistics on the collected data, which is more common in the un-intelligent analysis tools. The results of these querying may lead more sophisticated front-end implementations.

The technologies involved may be HTML, JavaScript, Java-JDBC, Database-SQL.

Applying Data Mining Algorithms

The integrated data subject to apply the discovery techniques and analysis methods. Examples of these methods are like; path analysis to determine most frequently visited paths in the site, finding association rules to find correlations between the pages accessed by the students, which means that the presence of one set of resource visits implies the presence of another set, sequential pattern discovering to find out common characteristics of the students within a time period, clustering and classification to draw profiles about the students and grouping them. However, such data mining mechanisms may need a different format of the data. Before applying such techniques the present data in the database should be transformed to more proper format. Also, after applying the techniques, the result may need to be stored in a correlation with the transformed data for both user look-ups and future analyses based on the current status. The technologies involved in these process includes java, JDBC, SQL and recent Data Mining tools.

Analysis of knowledge

Presenting The Results

The technical results would not be very useful to the user to understand the overall phenomenon. Therefore, various data presentation and visualization techniques are necessary. The more appropriate technologies planned for the interpretation of the results are using Java, JavaScript, HTML and including OLAP techniques whichever is usable and compatible in the system. [Dyr97] has shown that the analysis needs of Web usage data have much in common with those of a data warehouse, and hence OLAP techniques are quite applicable. The technologies planned require to cover the areas of statistics, graphics and visualization, usability analysis, and database querying.

Current Status

The current stage of the project is collecting data from the students accesses to the resources. The log mechanisms for web resources and NPAC grading System has been constructed and/or implemented. Students are registered through the NPAC Grading system. The authentication passwords for the web resources can be changed through the grading system.

Further Study

Many types of distance education involve the student's remote access of on-line course materials. This opens up additional possibilities for assessment that use access patterns to the materials. Analysis of how students are using a site is critical for assessing both the students and the course materials. A new architecture of assessment on World Wide Web is necessary for today's innovative distance education technologies. This architecture should involve with collection of students access data to web resources, transformation of the collected data to assessment information, integration of personal registration information of students in the final evaluations, and process of all the data and information available with the recent data mining techniques.

The purpose of these project is to design an architecture using Data Mining Tools and Commodity Technologies for the assessment of distance education. Basically, using the online resources asynchronously reflects the students learning abilities. Current education experiences do not have the ability to evaluate the students informal responses to the given materials. The main architecture of the proposed system is based on the collecting access information to the web materials, i.e., web mining, and discovering access patterns of users using data mining tools. The results will be presented to the teacher and students as necessary combined with online analytical processing tools. Some other applications like Students Records will be involved in the analysis also.




6.References

Reference[IBM_ST]:

SpeedTracer: A Web usage mining and analysis tool, http://www.almaden.ibm.com/journal/sj/371/wu.txt

Reference[Pitkow]:

J. Pitkow, "In Search of Reliable Usage Data on the WWW," Proceedings of Sixth International World Wide Web Conference (1997).

Reference[Mobasher]:

B. Mobasher et al., Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Report 96-050, Department of Computer Science, University of Minnesota, Minneapolis (September 1996).

Web Mining Survey & WEBMINER HTML documentation at http://www-users.cs.umn.edu/~mobasher/webminer/survey/

Reference[SILK]:

P. Pirolli, R. Rao, and J. Pitkow, "Silk from a Sow's Ear: Extracting Usable Structures from the Web," Proceedings of 1996 Conference on Human Factors in Computing Systems (1996), pp. 118-125.

CPY96

M.S. Chen, J.S. Park, and P.S. Yu. Data mining for path traversal patterns in a web environment. In Proceedings of the 16th International Conference on Distributed Computing Systems, pages 385--392, 1996.