Review of current Web Search Systems (draft, internal distribution only)

Gang Cheng, Piotr Sokolowski and Geoffrey Fox, NPAC, Syracuse University

1. Introduction

A web search system refers to a facility to locate/discover relavent 
information represented by Web pages. The search domain can either be
the whole WWW information space (including httpd,ftp,gopher,wais
and USENET news servers) on the Internet or WWW space on web sites.
The same technology can apply to both global and local WWW information
space.

A web search system generally has four major components (subsystems):
a gathering subsystem, an indexing subsystem, a search engine and a web 
search interface.
In each of those susbsystems, this review will compare Oracle's apporaches 
with the most advanced one
known to us. Because of the potential commercial values in all the web systems,
it is very difficult to get the internal techniques and details of what exact
technoloty used by those web systems. We will mostly use Lycos as the major
comparison example in my discussion, because Lycos is one among the 
most popularly used web search system on the Internet and WWW world, 
as well as it was first started from a research project in academic 
environment and later was commercialized by
industry as a major Internet information provider and service. 

Current most popular web search systems are (you can find them from 
the 'net search' page in the Netscape web browser):

. OpenText - Backend engine for Yahoo, free service. 
has the largest indexed web database. 
see http://www.opentext.com:8080/omw/f-faq.html.

Oracle uses some of its technology in developing Oracle's full-text 
search product.

. Lycos - free service, early web search system.
. Infoseek - free service for web database, paid-service for newsgroups 
and other on-line databases
. WebCrawler - free service
. Yahoo - catalog database, not a truely web search system, catalog 
search is provided by OpenText system.


		Fig. 1 The general architecture of a web search system


remote/local web sites <-> gathering subsystem -> indexing subsystem <-> search engine
						        	            ^
						        	            |
	clients with web browers -- web search interfaces ------- web server(s) -- cgi

2. Web Gathering Subsystem

This subsystem's major funciton is to gather web pages from either 
remote or local web sites. It deals with the whole Internet 'web space' 
which may include infomation in httpd, ftp, gopher, wais and USENET news servers. 
This gathering task is usually carried out auotomatically by an small 
program usually called 'web robot' or 'web spider' or 'web agent' whose job can 
be briefly described as follows:

1. starting from a single URL, it requests the web page from the remote web site
of the URL. Upon recieving the full page, it parses and keeps all the URLs in this page
to a quene for further gathering. 
2. For the web page from (1), according to the indexing rules predefined
in the indexing subsystem and the search engine, it parses the page and passes only
those information needed by the indexing subsystem. 
3. After (2) is done, it gets a new URL from the quene, checks if the new URL is already
indexed and proceeds (1) again to continue gathering. 

In summary, a web robot continously requests files from remote web sites, parses
and filters each file for indexing and further gathering.

It is this subsystem to decide which, what and when information will be cached/indexed/updated
into the database for web search. It determines the infomation content and volume, and
somehow performance, of a web search system. It also determines the
accuracy of a web search system. Different web systems use different
rules and approaches. Major issues are:

1). Which to gather
  choice 1: all or only HTTPD,FTP,GOPHER,WAIS,NEWS server
  choice 2: for a single server, all or first N levels of URLs visited
  choice 3: .html, .ps, .txt, .* for a single server
2). What to index
  choice 1: all or parts of 'significant' keywords in a file
  choice 2: attributes of a file, such as last update date, title, subtitles, size, outline etc.
  choice 3: controlled by size/lines, eg. first 10 lines of a text file
3). When to gather/index/update
  choice 1: the same as or a separate indexing database from the current one used by web search
            interface
  choice 2: real time daytime (i.e. online indexing) or during nights 
4) performance of a gather -- number of files can be gathered/indexed per day/hour

It is difficult to know all the above choices made by other web systems. One thing is
clear, though, due to significant performance and space requirement,
all current web systems have some common approaches:

1)  at most 3 levels of a web site is indexed. HTTPD and FTP servers are the major
    targets
2)  only a small proportion words of a file (usually 10% - 20%) is indexed, togather with
    common attibutes including title,size,date,subtitle, first 10 lines (as outline).
3) not real time, usually updating is done during nights

Facts data:

Yahoo - several Indy workstations and Intel Pentium PCs
Lycos : 1.178 Million unique URLs
   . 8.7 GB processed. 1.8 GB summary text (attribute text) stored
   . 1.08 GB inverted index in the database

InfoSeek - 8 SUN1000, maybe use RDBMS or OODBMS
   . paid-service: USENET newsgroups, 4 weeks expire period
		   15000 newsgroups, update per night, 2 million news articles, 7GB total
   . free-service: web pages - 400K URLs, total 2GB, update weekly for old urls and daily for adding new ones.
 		   full-text indexing, can update the whole database in 48 hrs, add new pages once a week,
                   update/revisit each indexed page once a month,
		   case sensitive, also index numbers, symbols,
		   support "phrase seearch", and some limited proximity search,
		   use one-level control, don't support automatic word expansion,
		   robot performance: 10000 pages/hour, whole system written Python.

Inktomi Search Engine (UC Berkeley):  1.3 Million URLs, parallel partition among 4 SUN10s, largest indexed URLs.
 				      full or partial index: unknown. Poor search interface, keywprd only.
Open Text: the Web Index contains about 985 million words of text,
and 15,436,712 hyperlinks (URLs visited, not necessarily indexed in the db). 

In the most recent Open Text Web Index update: 
      22,106 pages were removed, of which 11,102 are no longer on the Web and 11,004
      were replaced with changed versions. 
      16,930 other pages were revisited but had not changed. 
      74,443 new pages, not previously in the Index, were added.
      These statistics were computed Mon Oct 16 23:55:26 EDT 1995.

Current estimation of web pages in the whole Internet: 0.1 - 1 TB

3. Indexing Subsystem

This subsystem handles how free text in all documents/files are internally stored and managed
in the database, to be able to efficiently and effectively support searching. It includes the logical
data structure of text entities and physical organization/layout of data in the database.
The indexing subsystem is tightly coupled with the search engine subsystem as its sole purpose
is to speed up free text search/query process conducted by the search engine. 
Common approach is to build 'inverted index' for all keywords in a document.

Major approaches/issues differ in different web systems:

(1). compression scheme used to store the text and their indexes. 
    It is not clear what exact technology other systems are using. WAIS is the typical technolgy
    invented to deal with indexing/searching free text and its major drawback is the size overhead
    of indexes - it doesn't compress index and text and results a roughly 1:1 ratio of original text 
    to indexes. Lycos seems to use the similar technology for indexing. 
(2). Keep both original text and indexes in the database, or just keep indexes.
    In order to support certain advanced search capability (to be discussed in the
    database engine section) such as phrase search and proximity search, original
    text must be stored togather with their indexes. This will almost double the space
    requirement of a web search system. To my knowledge, currently only InfoSeek stores
    both. That's also the reason InfoSeek keeps a very small URLs database (400K).
(3). Index modes: real-time indexing, batch indexing, and incremental indexing
     real-time indexing allows updates,inserts or deletes of text in/from the exisiting
     text database at runtime will automatically modify the associated indexes.
     Batch indexing only updates index in a batch mode after a bulk of texts have been
     modified. Incremental indexing allows indexing process to be done incrementally,
     adding new text will not need to reindex all the previous text.
     It's not clear what indexing modes allowed in other web search systems. Batch and
     incremental indexing are common capability.
(4). Case sensativity, symbols 
     to include case sensitivity and symbols in the index will increase the space requirement 
for indexing space.

Oracle's approach:
	The index facility used in RDBMS server is only good to locate data
entities with well-structured attributes, and is not usable to deal with
free text database. For this reason, Oracle developed a seperate product
called Context Server (previously called SQL*Text*Retrieval) to build
another logical/software layer on top of its RDBMS engine. There are
some penalty to use RDBMS  search engine to mimic text search engine
to which I will not detail here. Oracle uses some quite unique and
smart techniques to deal with this issue. Some compression schemes
are used for both text (run-length compressions) and indexes (bitstring)
reduce space usage and speedup text query. The typical compression
ratio for text is 4:1 and for indexes it is dependent on the number
of documents indexed. It also supports all the index modes and case 
sensitivity etc as mentioned above.

It is relatively hard to compare the indexing techniques used in different web
search systems. But basically there are two major measurements:
space consumption and search response time. A good index scheme should
have small space use yet short query process time.

Performance of a web search system will largely depend on how the indexes and text are
partitioned across the server. Parallel Oracle is the best choice so far to handle
efficiently data partition and parallel I/O and caching to achive optimized
query performance. Basically a web search system is a OLTP type of search characteritics,
with an important feature: 99% transactions are read-only queries with 1% updating
transaction periocally in the background. Parallel server is an ideal candidate
to support this type of information system - almost all speedup can be
archived from 100% embarrasingly parallelism.

3. search engine subsystem

The search engine subsystem accepts queries from the web search interface and
schedules, partitions and executes them against the indexed database to locate 
URLs and their associated attributes which satisify the query criteria.
It also deals with I/O, caching, assembling, sorting etc performance related
system activity. Based on what content are indexed and the scheme used in a indexing
subsystem, it implements/supports different search algorithms to provide different
search capability of a text retrieval system. Together with the previous two subsystems,
it determines what basic and advanced search capability a system can support and its
efficiency.

Most web search systems are keyword-based search systems. Major search functions
are:

Basic:

(1). keywords with logical operators and their combinations, including 'and', 'or', 'not'.
(2). regular expression of keywords, including single/multiple wildcard matching.
(3). keyword expansion: stemming, fuzzy, soundex, expansion based on a thesaurus (such
     as synonyms).
(4). ranking of query results
(5). case sensitive/insensitive

Advanced:
(1). summarizing - summarize a document
(2). similarity search - search similar documents to a particular document
(3). phrase search 
(4). proximity Search - specify words distance between keywords

Almost all web search systems the 'basics' search capability.
Lycos doesn't have any of the advanced ones, while InfoSeek only supports
'phrase search' and some limited 'proximity Search'. OpenText has all the functions.
Oracle's system has all except 'similarity search'.

4. Web search interface

Any full-text infomation retrieval system has the two core subsystems discussed above - 
"indexing" and "search engine". They represent core technology  developed in infomation
processing literature and are used directly in a web search system. What's unique and different
in a web search system, compared to traditional text retrieval systems, lies in the 'web gathering'
and 'web search interface' subsystems. The two subsystems are also the major development
work we are undertaking to build a web search system, as we rely on Oracle's RDBMS technolgy
for the 'search engine' and full-text retrieval technology for the 'indexing' subsystem. 
Unlike most web search systems such as OpenText, Lycos,InfoSeek and WebCrawler where they build their
web search systems almost entirely from scratch, ie. they developed all the four subsystems using
their proprietary technology, our approach is to use Oracle's well-establish RDBMS and full-text retrieval
technolgy as core engine while developing our own 'web robot' and 'web search interface' which are
integrated into a web search system and best ultilize the core Oracle technology.

A text retrieval system usually has its own search interface and also allows developers to build
new search interfaces to access the text database. 
Most text retrival system uses client-server model as its fundamental model to provide
access interface to end-users. Therefore the client-server model used in WWW makes it seamless and natural to integrate a text retrieval system in a web search system, where a web client becomes a client of
a text retrieval system. Due to the stateless or sessionless feature
inherented in the web client-server model, a web search interface requires new approach/technique
in a information retrieval system which is quite different from that in a traditional search interafce design.

Major issues in developing a web search interface are:

1). integration of a web server and the backend search engine web server <-> search engine, ie., 
how to efficiently facilitate the single transacton path query requests ->query processing ->query results.
While interaction between a web client and web server is handled by the common http protocol, the way
how the interaction between a web server and a search engine maybe quite different in different
web search systems. Because a web search system usually is designed to handle large number of requests
simitantously and heavy load is assumed on the server side, it has to rely on high performance server
technology. Another consideraton is how the system handles simitantous search and database updates.
Most current web search systems use some very limited 'parallel processing' techniques and replication
technology to handle performance issues. A common approach, used in Lycos,InfoSeek,OpenText etc,
is to use more than one workstations, each machine running a web server and a local database. The local
database on each machine is complete identical across all the machines, replicated from another machine
which is dedicated to gather/index new web pages. The replication (thus the updates) of the
local database usually is done once a day during night (some once a week/month). 

Currently only the 'Inktomi Search Engine' at UC Berkley used a really parallel approach in which
the text database is partitioned disks among several machines and a single physical database is shared
(can be accessed) by all the machines used to support web access. But the approach used in this system
is very primitive in term of parallel processing from both software and hardware architetcures.
Workstation cluster is used in this system and parallel software used to support concurrent I/O,CPU
etc is developed from scrach as the whole retrieval system is deveoped from scratch (and actually 
based on a course project). Although the CPU processing is scalable, parallel I/O is not in this system.

2). sessionless -> session-oriented transactions in web search interface
in many situations, session-oriented memorying function is required in a web search activity, eg.
one query may generate thousands of results which is too costly/impossible to send back to 
display in a client's web browser due to bandwidth requirement of the networking and cache/memory requirement
on the client machine, one approach is to store all the results on the server and provide 
navigation buttons (next,prev etc) to guide user go through the reults once a subset of the results.
Because each user may issue a different query and the navigation requires more than one web transactions,
a web search system must 'remember' the first query's all results for each current user. This means
a user's web session includes the first query and successive navigation transactions.
In the early stage of all web search systems, the only solution used to solve this problem is
to use a 'max hits' button for the first query so that the system restricts the maximum number
of query results to be return to the user. Currently this session-oriented transaction has been a 
common practice in most web search systems. In our approach, this is a natural function used
in Oracle's text retrieval system. Actually, unlike other systems where query results may be
stored outside the search engine, our approach allows us to store the 'hitlist' (together with
the query itself) inside the database and search engine.

3). query refining
Query refining means after your previous query whose search domain may be the whole indexed
database, the new query uses the results returned in the previous query as its search domain
so that you can keep refining your search and finally find/narrow down your targets. Because this
capability requires advanced algorithms in the search engine, so far no other web search systems
except us has implemented it.

4). support 'natural language like query'.
Most web search systems only support keywords and their regular expressions to form a query,
a good web search system should also allow users to type in queris in natural language. This
function requires additional processing of query input from the search interface. No current
web search systems support this capability yet and we believe we can develop/achive
such capability in our approach (not yet implemented but technically doable).

5. Conclusion

Compare with other web search systems, I believe Oracle's approach has the following three major
advantages:
1). server performance and data volume advantages. 
Because parallel Oracle RDBMS server and parallel machine
are used in the server technology, our system can be truely scalable in server performance to
support fast response time for large number of concurrent users. It also allow us to be able to
index almost all the 'web space' in the world without much concern/degration of server performance.

2). Oracle's RDBMS technology as the core engine. No other web search systems
used RDBMS technology. It has many add-on values other than search engine. A good example
is the newsgroups archive system where we have not even used the full-text search capability yet.
Server management, rapid prototyping and ease of SQL-based code development, rich programming
tools and development environment are among the top benefits using a RDBMS system (other than
typical full-text retrieval system like WAIS and Glimpse). With Oracle's Context
server technology, it mixes both well-structured information entities with full-text search/indexing
capability.
3). simultaneous online updates and search. Due to performance consideration and sophisticated
consistancy requirement in the indexing systen and search engine, no other systems 
can support the updating (add,modify,delete) of the database and web search at the 
same time which means the database a user is searching may not be the most updated. 
RDBMS system is born to support this kind of concurrent activity and the parallel 
RDBMS server makes it little performance degration.

-----------------------------------------------------------------------------------------
Evaluation of the newly announced
web search system "Alta Vista" by DEC.

1. Summary

In summary, Alta Vista has the following common capabilities
found in most web systems:
. keyword search interface supporting regular expressions of keywords and phrases.
'and','or','not' operators are supported. Both word and phrase searching are supported.
A limited 'proximity search' is supported (allows 10 words between two specified words).
A limited 'word stemming' (wildcard matching) is supported. It also supports 
case sensitivity of the searching keywords/phrases.
. In addition to keyword search, it also supports attribute-related search, including
title of a web page,hostname or URL string of a URL, URL link in a web page

Alta Vista has the following major advantages over
other web search systems:

. fast search response
It is not clear what search and indexing software it uses behind the
seaech system. It is very fast for both word search and phrase search.
From the new release attached in the bottom of this report, it seems that
DEC's 64-bit Alpha cluster are used to host this search system, i.e., some
parallel architecture may be employed to achive the performance. 
Another possibility is that very large RAMs are used in the
hardware configuration so that all the indexes are cached in the core memory
to sustain the high query performance.

. bigger index database
Compared with other web search systems, it provides more coverage (16 millions)
while the current largest ones are Lycos (10 millions) and OpenText (millions).

Alta Vista claims that a very fast 'web robot' is used to index web pages. 
To sustain such a fast query performance while providing more full coverage
of 'web space' is really the most visible strength of Alta Vista.

. full-text search for USENET newsgroups
In additon to full text search in all the 13,000 newsgroups, it also provides
some attribute-related search, including sender,newsgroup name,subject.

. some limited ranking mechnisim for query results

I would say that Alta Vista is better than all the current web search system
in terms of coverage, query performance and accuracy. 

There are at least two unique features that Alta Vista is missing:
. query refining
This capability requires some sophisticated development on server to remember
results of previous query, i.e., make session-less web client-server interaction
become session-oriented. I believe soon all the current web search system
will provide this capability
. show highlighted keywords in the query results, this is indicated from
the FAQ item below:

"Alta Vista shows the first few words of the documents it finds, but I would like it 
to show some context so I could tell more
quickly whether the document is one I want to look at. 
Why not provide some context?

The words and phrases that Alta Vista uses to match the document may 
occur scattered anywhere in it from the beginning to the end,
and may occur multiple times. In general, there is no canonical 
way of deciding which lines to show as context for a matching query."

It seems that Alta Vista only keeps all indexes from web pages. We plan
to keep both indexes and the original pages (in compressed format) which
allow us to support this function.

2. Detailed Evaluation

Alta Vista is a complex web search system provided by DIGITAL. Evaluation is done in 
four categories - interface, search scope, search capabilities, performance.

2.1 Interface
Interface of Alta Vista consists of two parts. One is intended for simplified 
quick search while the other one provides a sophisticated set of options for more complex queries.

The simplified  interface is very efficient, though it is not obvious and the 
first time user must see the help page to recognize options. There is only one input line, 
where the user types both words and options. A single check box controls the way of di
splaying documents. User can choose between Compact or Detailed display.

The advanced query interface demands the user to enter a complex query with logical 
operators like 'AND','OR','NOT' and 'NEAR'.
The forms allows also specifying the ranking key and time period for a documents. 

2.2 Search scope
Alta Vista claims to have over 16 millions documents and that would rank it on top of 
other search systems. However it probably has some documents duplicated as it informs 
the user about removing duplicates.
It's difficult however to say anything about its gathering performance.

2.3 Search capabilities
Alta Vista has a wide set of search functions. It allows searching for words, phrases and 
words within distance of 10 words from each other. 
It can also find documents that <b>do not contain</b> typed word. The user can also 
give only beginning of the word for example  <i>string</i>* which matches 
any word beginning with <i>string</i>.
It also ranks documents and the user can supply a separate list of words for ranking.
Alta Vista operate in both case sensitive and case insensitive manner. If a typed 
word contains uppercase characters it is treated as case sensitive otherwise it is 
evaluated as case insensitive.

In general it has medium range of search options. Alta Vista is better then most 
search systems in terms of phrase searching and proximity searching, however it 
does not perform any advanced word approximation.

2.4 Performance

It is extremely fast in retrieval. Especially for phrase search. It seems that 
they use some kind of index for phrase search as well as for normal search. One 
thing that would point to that solution is that Alta Vista does not recognize 
punctuation charac
ters and treats them all as spaces.

3 Conclusions and comparison with Oracle TextServer

Alta Vista is very fast and powerful search system. However it does 
not make any special text processing that would be possible when using 
ORACLE TextServer with Context. 

ORACLE TextServer gives certainly much better general capabilities. 
Especially when integrated with context it would give all capabilities 
of Alta Vista and many others. TextServer allows for example synonym search, 
soundex expansion, pattern matching and Context would allow theme search and 
summaries of documents. TextServer also offers much more flexible way of 
refining the search. Alta Vista allows only simple adding of new options 
to the query. TextServer allows issuing completely new query concerning 
only previous query results.

The most important advantage of Alta Vista compared to other search systems is 
its speed. From our former experiences it seems that ORACLE system 
tends to be slow. These tests were however performed on older version of this 
product TextRetrival which is now improved in the new Context Server release.


4. Text from the news release of AltaVista - URL: 
http://www.digital.com:80/info/flash/">http://www.digital.com:80/info/flash/

Digital Equipment Corporation today introduced the Internet's first `super spider' software, as part
of the most advanced information search and indexing technology available for the World Wide
Web. Blazing fast, it conducts the most comprehensive search of the entire Web text at speeds up
to 100 times faster than spiders used in conventional information search services. 

Under development at Digital's Corporate Research Group here, the technology promises to
surpass the limitations of current information services by delivering the most complete, precise,
and up-to-date information of the Web's entire text. The technology's super spider and super
indexer employ next-generation software and advanced networking, powered by the
highest-performing 64-bit Alpha computers. 

The advanced technology is set to undergo its toughest testing starting today as Digital makes a
`beta' version available to tens of millions of Web users worldwide.