Untitled presentation Web Search Tutorial Presentation at HPDC95 Pentagon City August 1,1995 Gang Cheng,Srinivas Polisetty Presented by Geoffrey Fox NPAC -- Syracuse University 111 College Place Syracuse NY 13244-4100 Abstract of Web Search Presentation This was prepared for tutorial at HPDC-4 Conference It starts with motivation and Identification of four components of a Web Search system -- Information Gathering and Filtering, Indexing, Searching and User Interface Web Robots (gatherers) are reviewed followed by Discussion in detail of 3 examples Lycos, FreeWAIS and Harvest -- the associated demonstrations also include Oracle Free text search We end with discussion of future technologies including natural language frontends, distributed queries, metadata, caching and artificial intelligence Motivations Information discovery, locate relevant sources with reasonable efforts/time Cache/Replicate information to alleviate regional network and server overhead A unified Web search interface to many different information resources. e.g.: HTTP,FTP,Gopher,WAIS,Usenet newsgroup and various on-line databases Challenges Data Volume Estimated Web total size: 30 GB, 5 million documents, grows daily require more sophisticated information search interfaces than browsing and organizing in hyperlinks Data Diversity Web - a huge distributed database, unstructured, non-relational, hierarchical (multimedia) information entities with various data formats: MIME -- html,plain text,PostScript, LaTex, images,audio/video clips, etc. Web repositories are heterogeneous . inconsistent . incomplete User Base different requirements in query patterns, search topics and response time rapid growth in number and search requests daily Major components in a web search system an information gathering and filtering subsystem gather source data from remote/local web repositories to local indexing database. Source (searchable text based information on Internet): Web space, including information on servers of HTTP,FTP,Gopher,WAIS etc. Usenet newsgroup, various databases on-line, e.g. periodicals information conversion and extraction deal with update of source usually implemented by a ÒWeb RobotÓ an indexer meta-data determine the scope and accuracy of the search engine size of the indexed database common indexing schemes: object names, selected keywords, full text pre-computed indexing, automatic indexers Major components in a web search system (cont.) a search engine built on the indexed database client-based v.s. server-based common search capability: keyword-based search, attribute-based search, content-based search, concept-based search, NLP(Natural Language Processing)-based search, SQL queries (with RDBMS server) etc. a search interface HTML Form based CGI gateway from HTTP server to search engine user customizable (for forming query and query results) Web Robots (also called Spiders, Web Worms or Web Wanderers) Definition: web robots are a class of software programs that traverse network hosts gathering information from and about resources -- lists, information and collections Problems Addressed No pre-defined data set or fields for a document on the Web Manual traversal of the Web is virtually impossible Contents of the resources change as also the hyperlinks Improperly maintained Web sites produce dead links Limitation: Information generated Òon-the-flyÓ (e.g. by CGI scripts) cannot be retrieved by Web rebots Major uses of Web Robots Resource Discovery - Summarize large parts of Web, and provide access to them Statistical Analysis - Count number of Web servers, avg number of documents per server etc Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc Maintenance - Assist authors in locating dead links, and help maintain content, structure etc of a document Implementation Issues of Web Rebots Past Implementations Browsing - A small list was maintained at CERN for browsing through all available servers Listing - Lists of references to resources on the Web, such as Yahoo, HCC NetServices List, NCSA MetaIndex etc Searching - Searchable databases like the GNA Meta Library Present Implementations Automatic Collection - Automatically retrieve, a fixed set of documents that the Robot has been programmed to parse regularly like the NCSA What new list Automatic Discovery - Exploit automation, analyze and store the documents encountered Automatic Indexing - Automatically index the set of documents cached Web robot can be a standalone program plugged into a search system or a built-in gathering component in a indexing/search engine Costs & Dangers of Using Web Robots Network Resource & Server Load Robots require considerable bandwidth over the internet, disadvantaging corporate users Heavy load on server, resulting in lower service rate and frequent server crashes Remote parts of the network is also effected Continuous requests, which the robots usually make (Òrapid firesÓ) are even more dangerous Current HTTP protocol is inefficient to robot attacks Costs & Dangers of Using Web Robots (conÕt) Bad implementation of the robot traversing algorithm may lead to the robot running in loops Updating Overhead No efficient change control mechanism for URL documents that got changed after caching Client-side robot disadvantages once distributed, bugs cannot be fixed no knowledge of problem areas can be added no new efficient facilities can be taken advantage of, as not everyone will upgrade to the latest version Examples of Web Search Systems Lycos FreeWAIS Harvest Web Search Demos - The demo set is designed for three different search systems to search in the same database: G. FoxÕs newest book ÒParallel Computing Works !Ó Harvest WWWWAIS Oracle Text*Retrieval Lycos developed at Carnegie MellonÕs Center for Machine Translation in the School of Computer Science Hardware Architecture: 7 workstations used as server engines, independently accessing the same index database using replicated data on local disks Gathering software: Written in Perl; uses C and libwww to fetch URLs SCOUT Indexer: full boolean retrieval with spelling correction and approximate phonetic matching Search Capability: document title, heading, link and keyword, approximate matching and probabilistic retrieval Total Volume and Data Capture in Lycos descriptors (title, heading and first 20 lines) of 4.24 million (large base) or 505k (small base) URLs keywords from 767,000 documents (for more than 5.4GB of text), update each day, 5000 new documents/day so far 16 million queries answered, 175,000 hits/week total 1493,787 URLs from 23,550 unique HTTP servers averaged size of a text file downloaded: 7,920 characters . Current total size of the index database: 737 MB Data Content in Lycos Web URLs, including HTTP, FTP, Gopher servers. Not included: WAIS, UseNet News, Telnet Services, Email, CGI scripts for each URL: Title, Headings, Subheadings, 100 most ÒweightyÓ words, First 20 lines, Size in bytes, Number of words Performance Related Info: response time is frequently > 20 sec. Limitations: with 7-10 servers running full time the poor response time (or rejection) when high load query requests, due to not scale up, no parallelization and the replicated database approach FreeWAIS - Wide Area Information Server Internet indexing and search engine suitable for use on any set of textual documents client server model allows effective and local access X-client available (xwais) WWW cgi-bin gateway interface available (wwwwais) as form-based web interface waisindex command allows construction of index data for any Unix directory tree of text documents Indexing in FreeWAIS index data can be bulky - pure text documents will require almost same file space for index data index process can be run as a daemon process running nightly document + index set can be registered with central wais registry making information available to Internet community docroot of a WWW server can be indexed or partially indexed. disadvantage of freeWAIS over Harvest for example is that need access to the directory of Unix files to make index data. Although remote documents/index data can be accessed, you cannot index someone elseÕs WWW server with freeWAIS, unlike Harvest Search in FreeWAIS Queries are on list of keywords. freeWAIS does not allow complex query formulation, but commercial WAIS product is Z39.50 compliant and does allow Boolean and other complex query formulation freeWAIS is entirely based on free-text search. It has no ability to recognize special data fields in documents. This can be a problem if documents do not have meaningful filenames, as no title can be returned by the search process search results are scored according to a matching criterion and can be ordered by search score all these deficiencies are addressed by the commercial WAIS product Harvest System Major components efficient distributed Gathering architecture with customizable content extraction in each Gatherer topic/community focussed Brokers plug-and-play index/search engine in each Broker network-aware caching and replication for scalable access structured content summary interchange protocol Harvest Architecture Harvest Overview Gatherer collects info from resources available at Provider sites Broker retrieves indexing info from gatherers, suppresses duplicate info, indexes collected info, and provides WWW interface to it Replicator replicates Brokers around the Internet Users retrieve located info from through the Cache Harvest Gatherer Collection specified as enumerated URLs with stop lists and other parameters Actions performed Enumeration Retrieval Unnesting (compression, tar, etc.) Type-specific summarizing Content summary serving Customized Content Extraction (Essence) Summary Object Interchange Format (SOIF) A type of Meta-Data Gatherer-Broker protocol Structured object summary stream Arbitrary binary data & extensible fields Type + URL + set of attribute-value pairs Easy interoperation with other standards for communication/data exchange Harvest Broker Less than a dozen routines define index/search API Current Engines: freeWAIS, Glimpse, Nebula, Verity, WAIS Current query interface: WWW Customizable result formatting MD5-based duplicate removal MD5: an algorithm for computing a 128 bit ÒdigestÓ of arbitrary-length data such that any alterations in the data will be reflected in the digest Provide data for indexing from other brokers as well as well as themselves Distributed Gatherer-Broker Arrangement Index & Search in Harvest General Broker-Indexer interface to accomodate variety of index/search engines Can use many backends (WAIS, Ingres etc.) Index & Search Implemented using two specialized backends: Glimpse for space-efficiency & flexible interactive queries Nebula for time-efficiency & complex standing queries Cache in Harvest alleviates bottlenecks on web servers from popular objects improve gathering efficiency in fetching data from file system two modes: a standalone HTTPD accelerator and an object cache hierarchy using Mosaic proxy interface Perfomance of Harvest -- Gatherer Server load efficiency: scans objects periodically & builds cache allows info to be retirieved in single stream retrieving data via Harvest GathererÕs cache has 6,600 times less server loading than via local file system Network traffic efficiency: supports incremental updates transmittal of only the content summary is 59 times faster than that of full-text indexes Performance of Search in Harvest -- Glimpse booleans, reg. expressions, approximate searches incremental indexing searches take several seconds indexing ~6000 files (~70MB) on SparcStation IPC: incremental indexing in 2-4 min. complete indexing in 20 min. Implementation of Harvest Standalone A complete, information resource discovery, and access system Archie, Veronica, WWW etc. Gatherer configuration + Essence extraction script Content Router WAIS enumerator + Essence extraction script Implementations of Harvest with Other Systems -- Continued WAIS Essence full text extraction + ranking search engine WebCrawler Gatherer configuration + Essence full text extraction WHOIS++ Gatherer configuration + Essence extraction script for centroids. Query front-end to search SOIF records Future Technologies in Web Search Natural Language Processing (NLP) Distributed Queries (DQ) Meta-Data Format (MDF) Artificial Intelligence (AI) Client-Side Data Caching/Mining (CDCM) A combination of the above technologies Future Technologies in Web Search - NLP Web Interface and web robot should be more user-friendly, allow natural language expressions Documents should be treated as linear list of words for search (ex: WAIS) Improved accuracy and richer representation of search output Future Technologies in Web Search - DQ Conventional web robots take a lot of time to traverse and locate the information Distributed Query enables faster processing, and low response time Larger data processed in a short time Let user do distributed search at several databases at once This is also called ÒThe Webants approachÓ Future Technologies in Web Search - MDF Relational attributes pave way for NLP and SQL queries MetaData Format most suitable Subject: The topic addressed Title: The name of the object Author: Person primarily responsible for content Publisher: Agent who made the object available Other Agent: Editors, transcribers etc Date: The date of publication Object Type: novel, poem, dictionary etc Form: Postscript file, Windows text file etc Identifier: Unique identifier (no.) for the object Relation: Relationship to other objects Source: Objects, from which this one is derived Language: Language of the intellectual content Coverage: Spatial locations and temporal durations characteristic of the object Future Technologies in Web Search - AI User preferences - Keep track of a user personalized preferences and adapt to his/her needs User feedback - Let users mark what documents are relevant/irrelevant and build new queries based on this information Dynamic link creation - link to related documents in the bottom links to discussions on Ôa phrase in the currentÕ Future Technologies in Web Search - CSCM Cache/Mirror data for faster access, and indexing When - the user requested, or by the maintainer of the cache What - frequently used documents, and those which arenÕt already cached Who has access to copies - proper cache is accessed according to user preferences What way to present the copies - users shouldnÕt realize that they are looking at cached copies More about Web Search Systems and Web Robots-- Yahoo use listing/browsing as a simplified and restricted (yet more structured) way for indexing/searching web sites hierarchical lists of URLs by topics - thousands of increasingly refined topic-specific areas search by browsing the predefined links, while keyword-based title-search is available on the local web site information gathering by manual categorization or user submissions (not clear how Yahoo discovers all the listed web sites) Yahoo is a web site, not a search service More about Web Search Systems and Web Robots InfoSeek -- A popular commercial net search provider WebCrawler -- gathers indexes of the total contents of documents, as well as URLSs titles WWW Worm -- gathers information about titles and URLs from Web servers WebAnts -- a project for cooperating, distributed Web spiders JumpStation -- indexes the titles and headers of documents on the Web MOMspider -- a spider that you can install on your system (Unix/Perl) NIKOS -- allows a topic-oriented search of a spider database RBSE URL -- a database of URL references, with full WAIS indexing of the contents of the documents SG-Scout -- a robot for finding Web servers Wandex -- -- index from the World Wide Web Wanderer Recent Capital Ventures in Web Search Business America Online ---> (acquisition) WebCrawler (U. of Washington), WAIS (WAIS Inc.) InfoSeek ---> (the first commercialized web search provider) net search service charged on a pay-per-document basis Sequoia Capital ---> (captital venture) Yahoo (Standford U.) CMG@Ventures ---> (captital venture) Lycos (Carnegie Mellon U.) Microsoft Inc. ---> (licensed) Lycos (CMU) KPC & Byers and Institutional Venture Partners ---> (capital venture) Architext Inc. (formed by six Standford graduates)