Web Search

Motivations:
- Information discorvery, locate relevant sources with resonable efforts/time
- Cache/Replicate information to alleviate regional network and server overhead
- A unified Web search interface to many different information resources. e.g.:
  HTTP,FTP,Gopher,WAIS,Usenet newsgroup,various on-line databases.

Challenges:

- Data Volume
 . Estimated Web total size: 30 GB, 5 million documents, grows daily
 . require more sophisticasted information search interfaces 
   than browsing and organizing in hyperlinks

- Data Diversity
 . Web - a huge distributed database, unstructured, non-relational, 
   hierarchical
 . (multia-media) infomation entities with various data formats:
   MIME -- html,plain text,PostScript, LaTex, images,audio/vidio clips, etc.
 . Web reporsitories are
   . heterogeneous . inconsistent . incomplete

- User Base
 . different requirements in query patterns, search topics and
   response time.
 . rapid growth in number and search requests daily

Major components in a web search system:

- an information gathering and filtering subsystem
  . gather source data from remote/local web repository to local indexing database
  source (searchable text based infomation on Internet): 
    Web space, including information on servers of HTTP,FTP,Gopher,WAIS etc.
    Usenet newsgroup
    various databases on-line, e.g. periodicals
  . information conversion and extraction 
  . deal with update of source 
  . implemented by Web Robot
- an indexer
  . meta-data
  . determine the scope and accuracy of the search engine
  . size of the indexed database
  . common indexing schemes: object names, selected keywords, full text.
  . pre-computed indexing, automatic indexers
- a search engine
  . built on the indexed database
  . client-based v.s. server-based
  . common search capability: keyword-based search, content-based search,
    NLP(Natural Language Processing)-based search, attribute-based search.
- a search interface 
 . HTML Form based
 . CGI gateway from HTTP server to search engine
 . user customizable (for forming query and query results)

Web Robots (also called Spiders, Web Worms or Web Wanderers)
- Definition: Spiders are a class of software programs that 
traverse network hosts gathering information from and about resources. 
-- Lists, information and collections 

- Problems Addressed
 . Manual traversal of the Web is virtually impossible
 . No pre-defined data set or fields for a document on the Web
 . Contents of the resources change as also the hyperlinks
 . Improperly maintained Web sites produce dead links
- Major uses
 . Resource Discovery - Summarize large parts of Web, and provide access to them
 . Statistical Analysis - Count number of Web servers, avg number of documents 
	per server etc
 . Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc
 . Maintenance - Assist authors in locating dead links, and help maintain content, 
	structure etc of a document
- Implementation Issues
 . Past Implementations
		Browsing - A small list was maintained at CERN for browsing through 
			all available servers
		Listing - Lists of references to resources on the Web, such as 
			HCC NetServices List, NCSA MetaIndex etc
		Searching - Searchable databases like the GNA Meta Library
 . Present - 
		Automatic Collection - Automatically retrieve, a fixed set of 
			documents that the Robot has been programmed to parse 
			regularly like the NCSA What new list
		Automatic Discovery - Exploit automation, analyze and 
			store the documents encountered
		Automatic Indexing - Automatically index the set of documents cached
 . Future
		Natural Language Processing (NLP)
		Distributed Queries (DQ)
		Meta-Data Format (MDF)
		Artificial Intelligence (AI)
		Clien-Side Data Caching/Mining (CDCM)
	A combination of the above technologies
- Future Implementations - NLP
	Robots should be user-friendly, allow natural language expressions
	Documents should be treated as linear list of words for search (ex: WAIS)
	Improved accuracy and richer representation of search output
- Future -Implementation - DQ
	Conventional Robots take a lot of time to traverse and locate the information
	DQ enables faster processing, and low response time
	Larger data processed in a short time
	Lets user do distributed search at several databases at once
	This is also called The Webants approachÓ
- Future Implementations - DQ (see figure ?)
- Future Implementations - MDF
	Relational attributes pave way for NLP and SQL queries
	MetaData Format most suitable
		Subject: The topic addressed
		Title: The name of the object
		Author: Person primarily responsible for content
		Publisher: Agent who made the object available
		Other Agent: Editors, transcribers etc
		Date: The date of publication
 		Object Type: novel, poem, dictionary etc
		Form: Postscript file, Windows text file etc
		Identifier: Unique identifier (no.) for the object
		Relation: Relationship to other objects
		Source: Objects, from which this one is derived
		Language: Language of the intellectual content
		Coverage: Spatial locations and temporal durations characteristic of the object
- Future Implementations - AI
	User preferences - Keep track of a user personalised preferences and adapt to his needs
	User feedback - Let users mark what documents are relevant/irrelevant and build new queries 
        based on this information 
	Dyanamic link creation -
		link to related documents in the bottom
		links to discussions on Ôa phrase in the currentÕ
- Future Implementations - CSCM
	Cache/Mirror data for faster access, and indexing
		When - the user requested, or by the maintainer of the cache
		What - frequently used documents, and those which arenÕt already cached
		Who has access to copies - proper cache is accessed according to user preferences
		What way to present the copies - users shouldnÕt realize that they are 
		looking at cached copies
- Costs & Dangers
	Network Resource & Server Load
	Robots require considerable bandwidth over the internet, disadvantaging corporate users
	Heavy load on server, resulting in lower service rate and frequent server crashes
	Remote parts of the network is also effected
	Continuos requests, which the robots usually make (Òrapid firesÓ) are even more dangerous
	Current HTTP protocol is inefficient to robot attacks
	Bad implementation of the robot traversing algorithm may lead to the robot running in loops
	Updating Overhead
	No efficient change control mechanism for URL documents that got changed after caching
	Client-side robot disadvantages 
		once distributed, bugs cannot be fixed
		no knowledge of problem areas can be added
		no new efficient facilities can be taken advantage of, as not everyone 
                will upgrade to the latest version

Examples of Web Search System
- Lycos
- Harvest
- FreeWAIS

Web Search Demos
The demo set is designed for three different search systems to search in
the same database: G. Fox's newest book "Parallel Computing Works !".
- Harvest
- WWWWAIS
- Oracle Text*Retrieval

FreeWAIS - Wide Area Information Server

- Internet search engine suitable for use on any set of textual documents
- client server model allows effective and local access
- X- client available (xwais) 
- WWW cgi-bin gateway interface available (wwwwais) as forms interface
- waisindex command allows construction of index data for any Unix
directory tree of text documents
- index data can be bulky - pure text documents wil require almost
same file space for index data
- index process can be run as a daemon process running nightly
- document + index set can be registered with central wais registry
making information available to Internet community.
- docroot of a WWW server can be indexed or partially indexed.
- queries are on list of keywords.  freeWAIS does not allow complex
query formulation, but commercial WAIS product is Z39.50 compliant and
does allow Boolean and other complex query formulation.
- disadvantage of freeWAIS over Harvest for example is that need
access to the directory of Unix files to make index data.  Although
remote documents/index data can be accessed, you cannot index someone
else's WWW server with freeWAIS, unlike Harvest.
- freeWAIS is entirely based on free-text search.  It has no ability
to recognize special data fields in documents.  This can be a problem
if documents do not have meaningful filenames, as no title can be
returned by the search process.
- search results are scored according to a matching criterion and can
be ordered by search score.
- all these defficiencies are addressed by the commercial WAIS product.


Lycos
- developed at Carnegie Mellon's Center for Machine Translation in the School 
   of Computer Science. 
- Hardware Architecture
    7 workstations used as server engines, independently accessing the same 
    index database using replicated data on local disks
- Gathering software:
    Written in Perl; uses C and libwww to fetch URLs
- Search Capability:
  document title, heading, link and keyword
  approximate matching and probabilistic retrieval
  SCOUT Indexer: full boolean retrieval with spelling correction 
                 and approximate phonetic matching
- Total Volume and Data Capture:
   . descriptors (title, heading and first 20 lines) of 
     4.24 million (large base) or 505k (small base) URLs
   . keywords from 767,000 documents (for more than 5.4GB of text)
   . update each day, 5000 new documents/day
   . so far 16 million queries answered
   . 175,000 hits/week
   . Total 1493,787 URLs from 23,550 unique HTTP servers
   . averaged size of a text file downloaded: 7,920 characters
   . Current total size of the index database: 737 MB
- Data Content
Web URLs, including HTTP, FTP, Gopher servers
Not included: WAIS, UseNet News, Telnet Services, Email, CGI scripts
for each URL
 Title, Headings, Subheadings, 100 most "weighty" words 
 First 20 lines, Size in bytes, Number of words 
- Performance Related Info:
with 7-10 servers running full time the 
response time is frequently > 20 sec.
- Limitations:
poor response time (or rejection) when high load query requests,
due to not scale up, no parallelization and the replicated
database approach

Harvest System
- Major components
 . efficient distributed Gathering architecture
 . topic/community focussed Brokers
 . customizable content extraction in each Gatherer
 . plug-and-play index/search engine in each Broker
 . network-aware caching and replication for scalable access
 . structured content summary interchange protocol
- Overview (see figure ?)
 . Gatherer collects info from resources available at Provider sites
 . Broker retrieves indexing info from gatherers, suppresses duplicate info, indexes collected info, and provides WWW interface to it
 . Replicator replicates Brokers around the Internet
 . Users retrieve located info from through the Cache
- Gatherer
 . Collection specified as enumerated URLs with stop lists and other parameters
 . Actions performed
	Enumeration
	Retrieval
	Unnesting (compression, tar, etc.)
	Type-specific summarizing
	Content summary serving
 . Customized Content Extraction (Essence) (see figure ?)
- Summary Object Interchange Format (SOIF)
 . Gatherer-Broker protocol
 . Structured object summary stream
 . Arbitrary binary data & extensible fields
 . Type + URL + set of attribute-value pairs
 . Easy interoperation with other standards for communication/data exchange

- Broker
 . Less than a dozen routines define index/seasrch API
 . Current Engines: freeWAIS, Glimpse, Nebula, Verity, WAIS Inc.
 . Current query interface: WWW
 . Customizable result formatting
 . MD5-based duplicate removal
 . Provide data for indexing from other brokers as well as well as themselves
- Distributed Gatherer-Broker Arrangement (see figure ?)
- Index & Search
 . General Broker-Indexer interface to accomodate variety of index/search engines
	Can use many backends (WAIS, Ingres etc.)
 . Index & Search Implemented using two specialized backends:
	Glimpse for space-efficiency & flexible interactive queries
	Nebula for time-efficiency & complex standing queries
- Cache (see figure ?)
 . alleviates bottlenecks from popular objects
 . Uses Mosaic proxy interface
- Replicator (see figure ?)
 . alleviates bottlenecks from popular servers
 . minimizes network traffic and server workload
- Perfomance
 . Gatherer
	Server load efficiency:
		scans objects periodically & builds cache
		allows info to be retirieved in single stream
		local gatherers vs. unco-ordinated gathering:
			6,600 times more server load savings 
	Network traffic efficiency:
		supports incremental updates
		local vs unco-ordinated gathering : 
			59 times more efficient
 . Glimpse
	booleans, reg. expressions, approximate searches
	incremental indexing
	searches take several seconds 
	indexing ~6000 files (~70MB) on SparcStation IPC:
		incremental indexing in 2-4 min.
		complete indexing in 20 min.
- Implementation
 . Standalone
 . Implementing other systems 
	Archie, Veronica, WWW etc.
		Gatherer configuration + Essence extraction script
	Content Router
		WAIS enumerator + Essence extraction script
	WAIS
		Essence full text extraction + ranking search engine
	WebCrawler
		Gatherer configuration + Essence full text extraction
	WHOIS++
		Gatherer configuration + Essence extraction script for centroids. 
   		Query front-end to search SOIF records

==============================================================================================
A list of Web Search systems and Web Spiders

Web Spiders (http://web.nexor.co.uk/mak/doc/robots/robots.html) 
--- World Wide Web Robots, Wanderers, and Spiders, by Martijn Koster 
          
Yahoo List Reference (http://www.yahoo.com/Reference/Searching_the_Web/Robots__Spiders__etc_/)
--- Searching the Web--Robots, Spiders, etc., from Yahoo 
         
WebAnts (http://thule.mt.cs.cmu.edu:8001/webants/)
--- a project for cooperating, distributed Web spiders 
          
Harvest (http://harvest.cs.colorado.edu)
--- tools to gather, extract, organize, search, cache, and replicate 
information available over the Net 
          
WebCrawler (http://webcrawler.cs.washington.edu/WebCrawler/WebQuery.html)
--- gathers indexes of the total contents of documents, as well as URLSs and
titles, by Brian Pinkerton 
          
JumpStation (http://www.stir.ac.uk/jsbin/js/)
--- indexes the titles and headers of documents on the Web, by Jonathon Fletcher 
          
Lycos (http://lycos.cs.cmu.edu)
--- uses information metrics to record the 100 most important words 
in a document, along with the first 20 lines, so that users can 
often determine the value of a WWW document without retrieving it 
          
MOMspider (http://www.ics.uci.edu/WebSoft/MOMspider/)
--- a spider that you can install on your system (Unix/Perl) 
          
NIKOS (http://www.rns.com/cgi-bin/nikos)
--- allows a topic-oriented search of a spider database 
        
RBSE URL (http://rbse.jsc.nasa.gov/eichmann/urlsearch.html) 
--- a database of URL references, with full WAIS indexing of the
contents of the documents, by David Eichmann 
          
SG-Scout (http://www-swiss.ai.mit.edu/~ptbb/SG-Scout.html) 
--- a robot for finding Web servers 
          
Wandex (http://www.mit.edu:8001/cgi/wandex/index) 
--- index from the World Wide Web Wanderer, by Matthew Gray 
          
Worm (http://www.cs.colorado.edu/home/mcbryan/WWWW.html) 
--- gathers information about titles and URLs from Web servers, by Oliver
McBryan 

alias http://www.yahoo.com/Reference/Searching_the_Web/ Searching the Web
alias http://cuiwww.unige.ch/meta-index.html 
http://www.albany.net/~wcross/all1srch.html
http://www.amdahl.com/internet/meta-index.html
http://condor.bcm.tmc.edu/search.html
http://vatech.lib.vt.edu/lib/search.html
http://ccwf.cc.utexas.edu/~sbartels/bookmarks.html
http://www.verity.com/library.html