Given by Geoffrey C. Fox at HPDC95 Pentagon City on August 1,1995. Foils prepared July 28,1995
Abstract * Foil Index for this file
This was prepared for tutorial at HPDC-4 Conference |
It starts with motivation and Identification of four components of a Web Search system -- Information Gathering and Filtering, Indexing, Searching and User Interface |
Web Robots (gatherers) are reviewed followed by |
Discussion in detail of 3 examples Lycos, FreeWAIS and Harvest -- the associated demonstrations also include Oracle Free text search |
We end with discussion of future technologies including natural language frontends, distributed queries, metadata, caching and artificial intelligence |
This table of Contents Abstract
Tutorial Presentation at HPDC95 |
Pentagon City |
August 1,1995 |
Gang Cheng,Srinivas Polisetty |
Presented by Geoffrey Fox |
NPAC -- Syracuse University |
111 College Place |
Syracuse NY 13244-4100 |
This was prepared for tutorial at HPDC-4 Conference |
It starts with motivation and Identification of four components of a Web Search system -- Information Gathering and Filtering, Indexing, Searching and User Interface |
Web Robots (gatherers) are reviewed followed by |
Discussion in detail of 3 examples Lycos, FreeWAIS and Harvest -- the associated demonstrations also include Oracle Free text search |
We end with discussion of future technologies including natural language frontends, distributed queries, metadata, caching and artificial intelligence |
Information discovery, locate relevant sources with reasonable efforts/time |
Cache/Replicate information to alleviate regional network and server overhead |
A unified Web search interface to many different information resources. e.g.: HTTP,FTP,Gopher,WAIS,Usenet newsgroup and various on-line databases |
Data Volume
|
Data Diversity
|
User Base
|
an information gathering and filtering subsystem
|
an indexer
|
a search engine
|
a search interface
|
Definition: web robots are a class of software programs that traverse network hosts gathering information from and about resources -- lists, information and collections |
Problems Addressed
|
Limitation: Information generated Òon-the-flyÓ (e.g. by CGI scripts) cannot be retrieved by Web rebots |
Resource Discovery - Summarize large parts of Web, and provide access to them |
Statistical Analysis - Count number of Web servers, avg number of documents per server etc |
Mirroring - Cache an entire Sub-space of a Web server, and allow load sharing, faster access etc |
Maintenance - Assist authors in locating dead links, and help maintain content, structure etc of a document |
Past Implementations
|
Present Implementations
|
Network Resource & Server Load
|
Continuous requests, which the robots usually make (Òrapid firesÓ) are even more dangerous |
Current HTTP protocol is inefficient to robot attacks |
Bad implementation of the robot traversing algorithm may lead to the robot running in loops |
Updating Overhead |
No efficient change control mechanism for URL documents that got changed after caching |
Client-side robot disadvantages
|
Lycos |
FreeWAIS |
Harvest |
Web Search Demos - The demo set is designed for three different search systems to search in the same database: G. FoxÕs newest book ÒParallel Computing Works !Ó
|
developed at Carnegie MellonÕs Center for Machine Translation in the School of Computer Science |
Hardware Architecture: 7 workstations used as server engines, independently accessing the same index database using replicated data on local disks |
Gathering software: Written in Perl; uses C and libwww to fetch URLs |
SCOUT Indexer: full boolean retrieval with spelling correction and approximate phonetic matching |
Search Capability: document title, heading, link and keyword, approximate matching and probabilistic retrieval |
descriptors (title, heading and first 20 lines) of 4.24 million (large base) or 505k (small base) URLs |
keywords from 767,000 documents (for more than 5.4GB of text), update each day, 5000 new documents/day |
so far 16 million queries answered, 175,000 hits/week |
total 1493,787 URLs from 23,550 unique HTTP servers |
averaged size of a text file downloaded: 7,920 characters . |
Current total size of the index database: 737 MB |
Web URLs, including HTTP, FTP, Gopher servers. Not included: WAIS, UseNet News, Telnet Services, Email, CGI scripts |
for each URL: Title, Headings, Subheadings, 100 most ÒweightyÓ words, First 20 lines, Size in bytes, Number of words |
Performance Related Info:
|
Internet indexing and search engine suitable for use on any set of textual documents |
client server model allows effective and local access
|
index data can be bulky - pure text documents will require almost same file space for index data |
index process can be run as a daemon process running nightly |
document + index set can be registered with central wais registry making information available to Internet community |
docroot of a WWW server can be indexed or partially indexed. |
disadvantage of freeWAIS over Harvest for example is that need access to the directory of Unix files to make index data. Although remote documents/index data can be accessed, you cannot index someone elseÕs WWW server with freeWAIS, unlike Harvest |
Queries are on list of keywords. freeWAIS does not allow complex query formulation, but commercial WAIS product is Z39.50 compliant and does allow Boolean and other complex query formulation |
freeWAIS is entirely based on free-text search. It has no ability to recognize special data fields in documents. This can be a problem if documents do not have meaningful filenames, as no title can be returned by the search process |
search results are scored according to a matching criterion and can be ordered by search score |
all these deficiencies are addressed by the commercial WAIS product |
Major components
|
Gatherer collects info from resources available at Provider sites |
Broker retrieves indexing info from gatherers, suppresses duplicate info, indexes collected info, and provides WWW interface to it |
Replicator replicates Brokers around the Internet |
Users retrieve located info from through the Cache |
Collection specified as enumerated URLs with stop lists and other parameters |
Actions performed
|
A type of Meta-Data |
Gatherer-Broker protocol |
Structured object summary stream |
Arbitrary binary data & extensible fields |
Type + URL + set of attribute-value pairs |
Easy interoperation with other standards for communication/data exchange |
Less than a dozen routines define index/search API |
Current Engines: freeWAIS, Glimpse, Nebula, Verity, WAIS |
Current query interface: WWW |
Customizable result formatting |
MD5-based duplicate removal
|
Provide data for indexing from other brokers as well as well as themselves |
General Broker-Indexer interface to accomodate variety of index/search engines |
Can use many backends (WAIS, Ingres etc.) |
Index & Search Implemented using two specialized backends:
|
alleviates bottlenecks on web servers from popular objects |
improve gathering efficiency in fetching data from file system |
two modes: a standalone HTTPD accelerator and an object cache hierarchy using Mosaic proxy interface |
Server load efficiency:
|
Network traffic efficiency:
|
booleans, reg. expressions, approximate searches |
incremental indexing |
searches take several seconds |
indexing ~6000 files (~70MB) on SparcStation IPC:
|
A complete, information resource discovery, and access system |
Archie, Veronica, WWW etc.
|
Content Router
|
WAIS
|
WebCrawler
|
WHOIS++
|
Natural Language Processing (NLP) |
Distributed Queries (DQ) |
Meta-Data Format (MDF) |
Artificial Intelligence (AI) |
Client-Side Data Caching/Mining (CDCM) |
A combination of the above technologies |
Web Interface and web robot should be more user-friendly, allow natural language expressions |
Documents should be treated as linear list of words for search (ex: WAIS) |
Improved accuracy and richer representation of search output |
Conventional web robots take a lot of time to traverse and locate the information |
Distributed Query enables faster processing, and low response time |
Larger data processed in a short time |
Let user do distributed search at several databases at once |
This is also called ÒThe Webants approachÓ |
Relational attributes pave way for NLP and SQL queries |
MetaData Format most suitable
|
User preferences - Keep track of a user personalized preferences and adapt to his/her needs |
User feedback - Let users mark what documents are relevant/irrelevant and build new queries based on this information |
Dynamic link creation -
|
Cache/Mirror data for faster access, and indexing
|
use listing/browsing as a simplified and restricted (yet more structured) way for indexing/searching web sites |
hierarchical lists of URLs by topics - thousands of increasingly refined topic-specific areas |
search by browsing the predefined links, while keyword-based title-search is available on the local web site |
information gathering by manual categorization or user submissions (not clear how Yahoo discovers all the listed web sites) |
Yahoo is a web site, not a search service |
InfoSeek -- A popular commercial net search provider |
WebCrawler -- gathers indexes of the total contents of documents, as well as URLSs titles |
WWW Worm -- gathers information about titles and URLs from Web servers |
WebAnts -- a project for cooperating, distributed Web spiders |
JumpStation -- indexes the titles and headers of documents on the Web |
MOMspider -- a spider that you can install on your system (Unix/Perl) |
NIKOS -- allows a topic-oriented search of a spider database |
RBSE URL -- a database of URL references, with full WAIS indexing of the contents of the documents |
SG-Scout -- a robot for finding Web servers |
Wandex -- -- index from the World Wide Web Wanderer |
America Online ---> (acquisition) WebCrawler (U. of Washington), WAIS (WAIS Inc.) |
InfoSeek ---> (the first commercialized web search provider) net search service charged on a pay-per-document basis |
Sequoia Capital ---> (captital venture) Yahoo (Standford U.) |
CMG@Ventures ---> (captital venture) Lycos (Carnegie Mellon U.) |
Microsoft Inc. ---> (licensed) Lycos (CMU) |
KPC & Byers and Institutional Venture Partners ---> (capital venture) Architext Inc. (formed by six Standford graduates) |