Introduction Problems Addressed: Info Search in large Distributed systems like the WWW Large number of resources available on the Web Manual traversal of the Web is virtually impossible Automatic traversal cannot locate and track exactly What we are looking for No pre-defined data set or fields for a document on the Web Introduction Problems Addressed: Info Search in large Distributed systems like the Web Improperly maintained Web sites produce dead links Contents of the resources change as also the hyperlinks The information resources, and Web servers keep increasing in number daily in an exponential fashion WebRobots - A Solution What are they? Programs that traverse the Web hypertext structure automatically These can also be called Spiders, Web Worms or Web Wanderers Do some useful work while they traverse the Web, like, retrieving the documents recursively etc.. WebRobots Uses Statistical Analysis - Count number of Web servers, avg number of documents per server etc Maintenance - Assist authors in locating dead links, and help maintain content, structure etc of a document Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc Resource Discovery - Summarize large parts of Web, and provide access to them WebRobots Implementation Issues Past Implementations Browsing - A small list was maintained at CERN for browsing through all available servers Listing - Lists of references to resources on the Web, such as HCC NetServices List, NCSA MetaIndex etc Searching - Searchable databases like the GNA Meta Library WebRobots Implemantation Issues Present - Automatic Collection - Automatically retrieve, a fixed set of documents that the Robot has been programmed to parse regularly like the NCSA WhatÕs new list Automatic Discovery - Exploit automation, analyze and store the documents encountered Automatic Indexing - Automatically index the set of documents cached WebRobots Implementation Issues Future Natural Language Processing (NLP) Distributed Queries (DQ) Meta-Data Format (MDF) Artificial Intelligence (AI) Clien-Side Data Caching/Mining (CDCM) A combination of the above technologies WebRobots Future Implementations - NLP Robots should be user-friendly, allow natural language expressions Documents should be treated as linear list of words for search (ex: WAIS) Improved accuracy and richer representation of search output WebRobots Future -Implementation - DQ Conventional Robots take a lot of time to traverse and locate the information DQ enables faster processing, and low response time Larger data processed in a short time Lets user do distributed search at several databases at once This is also called ÒThe Webants approachÓ WebRobots Future Implementations - DQ WebRobots Future Implementations - MDF Relational attributes pave way for NLP and SQL queries MetaData Format most suitable Subject: The topic addressed Title: The name of the object Author: Person primarily responsible for content Publisher: Agent who made the object available Other Agent: Editors, transcribers etc Date: The date of publication WebRobots Future Implementations - MDF MetaData Format (contd.) Object Type: novel, poem, dictionary etc Form: Postscript file, Windows text file etc Identifier: Unique identifier (no.) for the object Relation: Relationship to other objects Source: Objects, from which this one is derived Language: Language of the intellectual content Coverage: Spatial locations and temporal durations characteristic of the object WebRobots Future Implementations - AI User preferences - Keep track of a user personalised preferences and adapt to his needs User feedback - Let users mark what documents are relevant/irrelevant and build new queries based on this information Dyanamic link creation - link to related documents in the bottom links to discussions on Ôa phrase in the currentÕ WebRobots Future Implementations - CSCM Cache/Mirror data for faster access, and indexing When - the user requested, or by the maintainer of the cache What - frequently used documents, and those which arenÕt already cached Who has access to copies - proper cache is accessed according to user preferences What way to present the copies - users shouldnÕt realize that they are looking at cached copies Costs & Dangers Network Resource & Server Load Robots require considerable bandwidth over the internet, disadvantaging corporate users Heavy load on server, resulting in lower service rate and frequent server crashes Remote parts of the network is also effected Continuos requests, which the robots usually make (Òrapid firesÓ) are even more dangerous Current HTTP protocol is inefficient to robot attacks Bad implementation of the robot traversing algorithm may lead to the robot running in loops Costs & Dangers Updating Overhead No efficient change control mechanism for URL documents that got changed after caching Client-side robot disadvantages once distributed, bugs cannot be fixed no knowledge of problem areas can be added no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version Writing WebRobots Guidelines Reconsider Make sure whether you really need a robot to do your task deliberating with all the disadvantages Be Accountable Identify your WebRobot Identify yourself with the user community, in case of the robot abusing the server Announce the robot Announce it to the target server Writing WebRobots Guidelines Be Accountable (contd.) Be informative to the target server Monitor continously, and stop the robot if you are not present Notify your authorities Test locally Run repeatedly on local servers before going off-site Writing WebRobots Guidelines DonÕt hog resources Slow down the robot, and be friendly with remote servers (donÕt put a lot of strain on them) Use HEAD instead of GET if possible Ask only what you want (use Accept field) Check URLs for correctness, sematically also Check the results from the server DonÕt loop or repeat Writing WebRobots Guidelines DonÕt hog resources (contd.) Run at oppurtune times (some servers provide timings) DonÕt run it often, they put too much load DonÕt try queries (ie,. ones with ISINDEX) Stay with it Log all data pertaining to errors, successes etc Be interactive with your robot Be prepared to respond to all sorts of enquiries Writing WebRobots Guidelines Stay with it (contd.) Be understanding, your needs might not be others needs Share results Keep results, you might save others time and resources Report errors and make them available on the web Make an FTP or Web site for obtaining your raw, and polished results from running the robot Support Robot Exclusion Writing WebRobots Robot Exclusion Principle Proposed because of the need for established mechanisms for WWW servers to indicate robots of their approval for accessing their documents Method Create a file on the server which specifies an access policy for robots This file must be accessible via the HTTP on the local URL Ò/robots.txtÓ The file contains mainly ÔUser-agentÕ and ÔDisallowÕ fields Writing WebRobots Robot Exclusion Principle User-agent value of this field is the name of the robot more of this field define more access policies for more robots robot should be liberal in interpreting this field value of Ô*Õ, means all robots Disallow value specifies a partial URL that is not to be visited empty value, implies all URLs can be retrieved Writing WebRobots Robot Exclusion Principle Example Ò/robots.txtÓ file Tools libwww-perl - Distribution It is a set of Perl4 packages which provides a simple and consistent programming interface to the WWW Tools based on libwww-perl MOMSpider - Multi-Owner Maintenance Spider w3new - A hotlist sorter, and modifier Contains packages for supporting ÒRobot ExclusionÓ, testing URLs and other basic robot tasks Available freely