Untitled presentation
	Web Search
		Tutorial Presentation at HPDC95
		Pentagon City
		August 1,1995
		
		Gang Cheng,Srinivas Polisetty
		Presented by Geoffrey Fox
		NPAC -- Syracuse University
		111 College Place
		Syracuse NY 13244-4100
	Abstract of Web Search Presentation
		This was prepared for tutorial at HPDC-4 Conference
		It starts with motivation and Identification of four components of a Web Search system -- Information Gathering and Filtering, Indexing, Searching and User Interface
		Web Robots (gatherers) are reviewed followed by
		Discussion in detail of 3 examples Lycos, FreeWAIS and Harvest -- the associated demonstrations also include Oracle Free text search
		We end with discussion of future technologies including natural language frontends, distributed queries, metadata, caching and artificial intelligence
	Motivations
		Information discovery, locate relevant sources with reasonable efforts/time
		Cache/Replicate information to alleviate regional network and server overhead
		A unified Web search interface to many different information resources. e.g.: HTTP,FTP,Gopher,WAIS,Usenet newsgroup and various on-line databases
	Challenges
		Data Volume
			Estimated Web total size: 30 GB, 5 million documents, grows daily
			require more sophisticated information search interfaces than browsing and organizing in hyperlinks 
		Data Diversity
			Web - a huge distributed database, unstructured, non-relational, hierarchical (multimedia) information entities with various data formats: MIME -- html,plain text,PostScript, LaTex, images,audio/video clips, etc.
			Web repositories are heterogeneous . inconsistent . incomplete
		User Base
			different requirements in query patterns, search topics and response time
			rapid growth in number and search requests daily
	Major components in a web search system
		an information gathering and filtering subsystem
			gather source data from remote/local web repositories to local indexing database. Source (searchable text based information on Internet): Web space, including information on servers of HTTP,FTP,Gopher,WAIS etc. Usenet newsgroup, various databases on-line, e.g. periodicals
			information conversion and extraction 
			deal with update of source 
			usually implemented by a ÒWeb RobotÓ
		an indexer
			meta-data
			determine the scope and accuracy of the search engine
			size of the indexed database
			common indexing schemes: object names, selected keywords, full text
			pre-computed indexing, automatic indexers
	Major components in a web search system (cont.)
		a search engine
			built on the indexed database
			client-based v.s. server-based
			common search capability: keyword-based search, attribute-based search, content-based search, concept-based search, NLP(Natural Language Processing)-based search, SQL queries (with RDBMS server) etc.
		a search interface 
			 HTML Form based
			 CGI gateway from HTTP server to search engine
			 user customizable (for forming query and query results)
	Web Robots (also called Spiders, Web Worms or Web Wanderers)
		Definition: web robots are a class of software programs that traverse network hosts gathering information from and about resources -- lists, information and collections 
		Problems Addressed
			No pre-defined data set or fields for a document on the Web
			Manual traversal of the Web is virtually impossible
			Contents of the resources change as also the hyperlinks
			Improperly maintained Web sites produce dead links
		Limitation: Information generated Òon-the-flyÓ (e.g. by CGI scripts) cannot be retrieved by Web rebots
	Major uses of Web Robots 
		Resource Discovery - Summarize large parts of Web, and provide access to them
		Statistical Analysis - Count number of Web servers, avg number of documents per server etc
		Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc
		Maintenance - Assist authors in locating dead links, and help maintain content, structure etc of a document
	Implementation Issues of Web Rebots
		 Past Implementations
			Browsing - A small list was maintained at CERN for browsing through all available servers
			Listing - Lists of references to resources on the Web, such as Yahoo, HCC NetServices List, NCSA MetaIndex etc
			Searching - Searchable databases like the GNA Meta Library
		Present Implementations
			Automatic Collection - Automatically retrieve, a fixed set of documents that the Robot has been programmed to parse regularly like the NCSA What new list
			Automatic Discovery - Exploit automation, analyze and store the documents encountered
			Automatic Indexing - Automatically index the set of documents cached
			Web robot can be a standalone program plugged into a search system or a built-in gathering component in a indexing/search engine
	Costs & Dangers of Using Web Robots
		Network Resource & Server Load
			Robots require considerable bandwidth over the internet, disadvantaging corporate users
			Heavy load on server, resulting in lower service rate and frequent server crashes
			Remote parts of the network is also effected
		Continuous requests, which the robots usually make (Òrapid firesÓ) are even more dangerous
		Current HTTP protocol is inefficient to robot attacks
	Costs & Dangers of Using Web Robots (conÕt)
		Bad implementation of the robot traversing algorithm may lead to the robot running in loops
		Updating Overhead
		No efficient change control mechanism for URL documents that got changed after caching
		Client-side robot disadvantages 
			once distributed, bugs cannot be fixed
			no knowledge of problem areas can be added
			no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version
	Examples of Web Search Systems
		Lycos
		FreeWAIS
		Harvest
		Web Search Demos - The demo set is designed for three different search systems to search in the same database: G. FoxÕs newest book ÒParallel Computing Works !Ó
			Harvest
			WWWWAIS
			Oracle Text*Retrieval
	Lycos
		developed at Carnegie MellonÕs Center for Machine Translation in the School of Computer Science 
		Hardware Architecture: 7 workstations used as server engines, independently accessing the same index database using replicated data on local disks
		Gathering software: Written in Perl; uses C and libwww to fetch URLs
		SCOUT Indexer: full boolean retrieval with spelling correction and approximate phonetic matching
		Search Capability: document title, heading, link and keyword, approximate matching and probabilistic retrieval
	Total Volume and Data Capture in Lycos
		descriptors (title, heading and first 20 lines) of 4.24 million (large base) or 505k (small base) URLs
		keywords from 767,000 documents (for more than 5.4GB of text), update each day, 5000 new documents/day
		so far 16 million queries answered, 175,000 hits/week
		total 1493,787 URLs from 23,550 unique HTTP servers 
		averaged size of a text file downloaded: 7,920 characters  .
		Current total size of the index database: 737 MB
	Data Content in Lycos
		Web URLs, including HTTP, FTP, Gopher servers. Not included: WAIS, UseNet News, Telnet Services, Email, CGI scripts
		for each URL: Title, Headings, Subheadings, 100 most ÒweightyÓ words, First 20 lines, Size in bytes, Number of words
		Performance Related Info:
			response time is frequently > 20 sec.
			Limitations:
				with 7-10 servers running full time the poor response time (or rejection) when high load query requests, due to not scale up, no parallelization and the replicated database approach
	FreeWAIS - Wide Area Information Server
		Internet indexing and search engine suitable for use on any set of textual documents
		client server model allows effective and local access
			X-client available (xwais) 
			WWW cgi-bin gateway interface available (wwwwais) as form-based web interface
			waisindex command allows construction of index data for any Unix directory tree of text documents
	Indexing in FreeWAIS 
		index data can be bulky - pure text documents will require almost same file space for index data
		index process can be run as a daemon process running nightly
		document + index set can be registered with central wais registry making information available to Internet community
		docroot of a WWW server can be indexed or partially indexed.
		disadvantage of freeWAIS over Harvest for example is that need access to the directory of Unix files to make index data.  Although remote documents/index data can be accessed, you cannot index someone elseÕs WWW server with freeWAIS, unlike Harvest
	Search in FreeWAIS
		Queries are on list of keywords.  freeWAIS does not allow complex query formulation, but commercial WAIS product is Z39.50 compliant and does allow Boolean and other complex query formulation
		freeWAIS is entirely based on free-text search.  It has no ability to recognize special data fields in documents. This can be a problem if documents do not have meaningful filenames, as no title can be returned by the search process
		search results are scored according to a matching criterion and can be ordered by search score
		all these deficiencies are addressed by the commercial WAIS product
	Harvest System
		Major components
			efficient distributed Gathering architecture with customizable content extraction in each Gatherer
			topic/community focussed Brokers
			plug-and-play index/search engine in each Broker
			network-aware caching and replication for scalable access
			structured content summary interchange protocol
	Harvest Architecture
		 
	Harvest Overview
		Gatherer collects info from resources available at Provider sites
		Broker retrieves indexing info from gatherers, suppresses duplicate info, indexes collected info, and provides WWW interface to it
		Replicator replicates Brokers around the Internet
		Users retrieve located info from through the Cache
	Harvest Gatherer
		Collection specified as enumerated URLs with stop lists and other parameters
		Actions performed
			Enumeration
			Retrieval
			Unnesting (compression, tar, etc.)
			Type-specific summarizing
			Content summary serving
	Customized Content Extraction (Essence)
		 
	Summary Object Interchange Format (SOIF)
		A type of Meta-Data
		Gatherer-Broker protocol
		Structured object summary stream
		Arbitrary binary data & extensible fields
		Type + URL + set of attribute-value pairs
		Easy interoperation with other standards for communication/data exchange
	Harvest Broker
		Less than a dozen routines define index/search API
		Current Engines: freeWAIS, Glimpse, Nebula, Verity, WAIS
		Current query interface: WWW
		Customizable result formatting
		MD5-based duplicate removal
			MD5: an algorithm for computing a 128 bit ÒdigestÓ of arbitrary-length data such that any alterations in the data will be reflected in the digest
		Provide data for indexing from other brokers as well as well as themselves
	Distributed Gatherer-Broker Arrangement
		 
	Index & Search in Harvest
		General Broker-Indexer interface to accomodate variety of index/search engines
		Can use many backends (WAIS, Ingres etc.)
		Index & Search Implemented using two specialized backends:
			Glimpse for space-efficiency & flexible interactive queries
			Nebula for time-efficiency & complex standing queries
	Cache in Harvest
		alleviates bottlenecks on web servers from popular objects
		improve gathering efficiency in fetching data from file system
		two modes: a standalone HTTPD accelerator and an object cache hierarchy using Mosaic proxy interface
	Perfomance of Harvest -- Gatherer
		Server load efficiency:
			scans objects periodically & builds cache
			allows info to be retirieved in single stream
			retrieving data via Harvest GathererÕs cache has 6,600 times less server loading than via local file system
		Network traffic efficiency:
			supports incremental updates
			transmittal of only the content summary is 59 times faster than that of full-text indexes
	Performance of Search in Harvest -- Glimpse
		booleans, reg. expressions, approximate searches
		incremental indexing
		searches take several seconds 
		indexing ~6000 files (~70MB) on SparcStation IPC:
			incremental indexing in 2-4 min.
			complete indexing in 20 min.
	Implementation of Harvest
	Standalone
		A complete, information resource discovery, and access system
		
		

		Archie, Veronica, WWW etc.
			Gatherer configuration + Essence extraction script
		Content Router
			WAIS enumerator + Essence extraction script
	Implementations of Harvest
	with Other Systems -- Continued
		WAIS
			Essence full text extraction + ranking search engine
		WebCrawler
			Gatherer configuration + Essence full text extraction
		WHOIS++
			Gatherer configuration + Essence extraction script for centroids. Query front-end to search SOIF records
	Future Technologies in Web Search 
		Natural Language Processing (NLP)
		Distributed Queries (DQ)
		Meta-Data Format (MDF)
		Artificial Intelligence (AI)
		Client-Side Data Caching/Mining (CDCM)
		A combination of the above technologies
	Future Technologies in
	 Web Search - NLP
		Web Interface and web robot should be more user-friendly, allow natural language expressions
		Documents should be treated as linear list of words for search (ex: WAIS)
		Improved accuracy and richer representation of search output
	Future Technologies in 
	Web Search - DQ
		Conventional web robots take a lot of time to traverse and locate the information
		Distributed Query enables faster processing, and low response time
		Larger data processed in a short time
		Let user do distributed search at several databases at once
		This is also called ÒThe Webants approachÓ
	Future Technologies in Web Search - MDF
		Relational attributes pave way for NLP and SQL queries
		MetaData Format most suitable
			Subject: The topic addressed
			Title: The name of the object
			Author: Person primarily responsible for content
			Publisher: Agent who made the object available
			Other Agent: Editors, transcribers etc
			Date: The date of publication
			 Object Type: novel, poem, dictionary etc
			Form: Postscript file, Windows text file etc
			Identifier: Unique identifier (no.) for the object
			Relation: Relationship to other objects
			Source: Objects, from which this one is derived
			Language: Language of the intellectual content
			Coverage: Spatial locations and temporal durations characteristic of the object
	Future Technologies in
	 Web Search - AI
		User preferences - Keep track of a user personalized preferences and adapt to his/her needs
		User feedback - Let users mark what documents are relevant/irrelevant and build new queries based on this information 
		Dynamic link creation -
			link to related documents in the bottom
			links to discussions on Ôa phrase in the currentÕ
	Future Technologies in 
	Web Search - CSCM
		Cache/Mirror data for faster access, and indexing
			When - the user requested, or by the maintainer of the cache
			What - frequently used documents, and those which arenÕt already cached
			Who has access to copies - proper cache is accessed according to user preferences
			What way to present the copies - users shouldnÕt realize that they are looking at cached copies
	More about Web Search Systems and Web Robots-- Yahoo
		use listing/browsing as a simplified and restricted (yet more structured) way for indexing/searching web sites
		hierarchical lists of URLs by topics - thousands of increasingly refined topic-specific areas
		search by browsing the predefined links, while keyword-based title-search is available on the local web site
		information gathering by manual categorization or user submissions (not clear how Yahoo discovers all the listed web sites)
		Yahoo is a web site, not a search service
	More about Web Search Systems and Web Robots
		InfoSeek -- A popular commercial net search provider
		WebCrawler -- gathers indexes of the total contents of documents, as well as URLSs titles
		WWW Worm -- gathers information about titles and URLs from Web servers
		WebAnts -- a project for cooperating, distributed Web spiders 
		JumpStation -- indexes the titles and headers of documents on the Web
		MOMspider -- a spider that you can install on your system (Unix/Perl) 
		NIKOS -- allows a topic-oriented search of a spider database 
		RBSE URL -- a database of URL references, with full WAIS indexing of the contents of the documents
		SG-Scout -- a robot for finding Web servers 
		Wandex -- -- index from the World Wide Web Wanderer
	Recent Capital Ventures in Web Search Business
		America Online ---> (acquisition) WebCrawler (U. of Washington), WAIS (WAIS Inc.)
		InfoSeek ---> (the first commercialized web search provider)  net search service charged on a pay-per-document basis
		Sequoia Capital ---> (captital venture)  Yahoo (Standford U.)
		CMG@Ventures ---> (captital venture)  Lycos (Carnegie Mellon U.)
		Microsoft Inc. ---> (licensed)  Lycos (CMU)
		KPC & Byers and Institutional Venture Partners ---> (capital venture)  Architext Inc. (formed by six Standford graduates)