Introduction
	Problems Addressed: Info Search in large Distributed systems like the WWW
	Large number of resources available on the Web
	Manual traversal of the Web is virtually impossible
	Automatic traversal cannot locate and track exactly What we are looking for
	No pre-defined data set or fields for a document on the Web
Introduction
	Problems Addressed: Info Search in large Distributed systems like the Web
	Improperly maintained Web sites produce dead links
	Contents of the resources change as also the hyperlinks
	The information resources, and Web servers keep increasing in number daily in 
	an exponential fashion
WebRobots - A Solution
	What are they?
	Programs that traverse the Web hypertext structure automatically
	These can also be called Spiders, Web Worms or Web Wanderers
	Do some useful work while they traverse the Web, like, retrieving the documents recursively etc..
WebRobots
	Uses
	Statistical Analysis - Count number of Web servers, avg number of documents per server etc
	Maintenance - Assist authors in locating dead links, and help maintain content, structure etc of a document
	Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc
	Resource Discovery - Summarize large parts of Web, and provide access to them
WebRobots
	Implementation Issues
	Past Implementations
		Browsing - A small list was maintained at CERN for browsing through all available servers
		Listing - Lists of references to resources on the Web, such as HCC NetServices List, NCSA MetaIndex etc
		Searching - Searchable databases like the GNA Meta Library
WebRobots
	Implemantation Issues
	Present - 
		Automatic Collection - Automatically retrieve, a fixed set of documents that the Robot has been programmed to parse regularly like the NCSA WhatÕs new list
		Automatic Discovery - Exploit automation, analyze and store the documents encountered
		Automatic Indexing - Automatically index the set of documents cached
WebRobots
	Implementation Issues
	Future
		Natural Language Processing (NLP)
		Distributed Queries (DQ)
		Meta-Data Format (MDF)
		Artificial Intelligence (AI)
		Clien-Side Data Caching/Mining (CDCM)
	A combination of the above technologies
WebRobots
	Future Implementations - NLP
	Robots should be user-friendly, allow natural language expressions
	Documents should be treated as linear list of words for search (ex: WAIS)
	Improved accuracy and richer representation of search output
WebRobots
	Future -Implementation - DQ
	Conventional Robots take a lot of time to traverse and locate the information
	DQ enables faster processing, and low response time
	Larger data processed in a short time
	Lets user do distributed search at several databases at once
	This is also called ÒThe Webants approachÓ
WebRobots
	Future Implementations - DQ
WebRobots
	Future Implementations - MDF
	Relational attributes pave way for NLP and SQL queries
	MetaData Format most suitable
		Subject: The topic addressed
		Title: The name of the object
		Author: Person primarily responsible for content
		Publisher: Agent who made the object available
		Other Agent: Editors, transcribers etc
		Date: The date of publication
WebRobots
	Future Implementations - MDF
	MetaData Format (contd.)
		Object Type: novel, poem, dictionary etc
		Form: Postscript file, Windows text file etc
		Identifier: Unique identifier (no.) for the object
		Relation: Relationship to other objects
		Source: Objects, from which this one is derived
		Language: Language of the intellectual content
		Coverage: Spatial locations and temporal durations characteristic of the object
WebRobots
	Future Implementations - AI
	User preferences - Keep track of a user personalised preferences and adapt to his needs
	User feedback - Let users mark what documents are relevant/irrelevant and build new queries based on this information 
	Dyanamic link creation -
		link to related documents in the bottom
		links to discussions on Ôa phrase in the currentÕ
WebRobots
	Future Implementations - CSCM
	Cache/Mirror data for faster access, and indexing
		When - the user requested, or by the maintainer of the cache
		What - frequently used documents, and those which arenÕt already cached
		Who has access to copies - proper cache is accessed according to user preferences
		What way to present the copies - users shouldnÕt realize that they are looking at cached copies
Costs & Dangers
	Network Resource & Server Load
	Robots require considerable bandwidth over the internet, disadvantaging corporate users
	Heavy load on server, resulting in lower service rate and frequent server crashes
	Remote parts of the network is also effected
	Continuos requests, which the robots usually make (Òrapid firesÓ) are even more dangerous
	Current HTTP protocol is inefficient to robot attacks
	Bad implementation of the robot traversing algorithm may lead to the robot running in loops
Costs & Dangers
	Updating Overhead
	No efficient change control mechanism for URL documents that got changed after caching
	Client-side robot disadvantages 
		once distributed, bugs cannot be fixed
		no knowledge of problem areas can be added
		no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version
Writing WebRobots
	Guidelines
	Reconsider  
		Make sure whether you really need a robot to do your task deliberating with all the disadvantages
	Be Accountable 
		Identify your WebRobot
		Identify yourself with the user community, in case of the robot abusing the server
		Announce the robot 
		Announce it to the target server
Writing WebRobots
	Guidelines
	Be Accountable (contd.)
		Be informative to the target server
		Monitor continously, and stop the robot if you are not present 
		Notify your authorities
	Test locally
		Run repeatedly on local servers before going off-site
Writing WebRobots
	Guidelines
	DonÕt hog resources
		Slow down the robot, and be friendly with remote servers (donÕt put a lot of strain on them)
		Use HEAD instead of GET if possible
		Ask only what you want (use Accept field)
		Check URLs for correctness, sematically also
		Check the results from the server
		DonÕt loop or repeat
Writing WebRobots
	Guidelines
	DonÕt hog resources (contd.)
		Run at oppurtune times (some servers provide timings)
		DonÕt run it often, they put too much load
		DonÕt try queries (ie,. ones with ISINDEX)
	Stay with it
		Log all data pertaining to errors, successes etc
		Be interactive with your robot
		Be prepared to respond to all sorts of enquiries
Writing WebRobots
	Guidelines
	Stay with it (contd.)
		Be understanding, your needs might not be others needs
	Share results
		Keep results, you might save others time and resources
		Report errors and make them available on the web
		Make an FTP or Web site for obtaining your raw, and polished results from running the robot
	Support Robot Exclusion 
Writing WebRobots
	Robot Exclusion Principle
	Proposed because of the need for established mechanisms for WWW servers to indicate robots of their approval for accessing their documents
	Method 
		Create a file on the server which specifies an access policy for robots
		This file must be accessible via the HTTP on the local URL Ò/robots.txtÓ
		The file contains mainly ÔUser-agentÕ and ÔDisallowÕ fields
Writing WebRobots
	Robot Exclusion Principle
	
		User-agent
			value of this field is the name of the robot
			more of this field define more access policies for more robots
			robot should be liberal in interpreting this field
			value of Ô*Õ, means all robots
		Disallow
			value specifies a partial URL that is not to be visited
			empty value, implies all URLs can be retrieved
Writing WebRobots
	Robot Exclusion Principle
	Example Ò/robots.txtÓ file
Tools
	libwww-perl - Distribution
	It is a set of Perl4 packages which provides a simple and consistent programming interface to the WWW
	Tools based on libwww-perl
		MOMSpider - Multi-Owner Maintenance Spider
		w3new - A hotlist sorter, and modifier
	Contains packages for supporting ÒRobot ExclusionÓ, testing URLs and other basic robot tasks
	Available freely