Use Oracle's RDBMS and SQL*TextRetrieval Technology to 
Build a Web Search System (draft)

Gang Cheng and Piotr Sokolowski, NPAC, Syracuse University

1. Introduction

A web search system generally has four major components (subsystems):
a gathering subsystem, an indexing subsystem, a search engine and a web 
search interface.
In each of those susbsystems, I will compare our apporaches with the 
most advanced one
known to me. Because of the potential commercial values in all the web systems,
it is very difficult to get the internal techniques and details of what exact
technoloty used by those web systems. I will mostly use Lycos as the major
comparison example in my discussion, because Lycos is one among the 
most popularly used web search system on the Internet and WWW world, 
as well as it was first started
from a research project in academic environment and later was 
commercialized by
industry as a major Internet information provider and service. 

Current most popular web search systems are (you can find them from 
the 'net search' page in the Netscape web browser):

. OpenText - this will be our major competetor. Backend engine 
for Yahoo, free service. has the largest indexed web database. 
 see http://www.opentext.com:8080/omw/f-faq.html.
Oracle uses some of its technology in developing Oracle's full-text 
search product.
. Lycos - free service.
. Infoseek - free service for web database, paid-service for newsgroups 
and other on-line databases
. WebCrawler - free service
. Yahoo - catalog database, not a truely web search system, catalog 
search is provided by OpenText system.

From the discussion and comparions detailed in the following sections, 
it will become evident that our approach - using Oracle's RDBMS and its 
SQL*TextRetrival technology,
together with our unique advantages of parallel server technology 
and infrastructure,
will give a favorable situation to become a major Internet information
provider and service center to compete with other web search systems currently
popurlarly known and served in the Internet community. I will show you that our approach
provides better solutions than most of the other web systems in ALL the four subsystems.
My conlcusion and comparion are based on our experience of building the 'USENET newsgroups
archive' and the 'NPAC books search' prototype systems.


		Fig. 1 The general architecture of a web search system


remote/local web sites <-> gathering subsystem -> indexing subsystem <-> search engine
						        	            ^
						        			    |
	clients with web browers -- web search interfaces ------- web server(s) -- cgi

2. Web Gathering Subsystem

This subsystem's major funciton is to gather web pages from either 
remote or local web sites. It deals with the whole Internet 'web space' 
which may include 
infomation in httpd, ftp, gopher, wais and USENET news servers. This gathering task is
usually carried out auotomatically by an small program usually called 'web robot'
or 'web spider' or 'web agent' whose job can be briefly described as follows:
1. starting from a single URL, it requests the web page from the remote web site
of the URL. Upon recieving the full page, it parses and keeps all the URLs in this page
to a quene for further gathering. 
2. For the web page from (1), according to the indexing rules predefined
in the indexing subsystem and the search engine, it parses the page and passes only
those information needed by the indexing subsystem. 
3. After (2) is done, it gets a new URL from the quene, checks if the new URL is already
indexed and proceeds (1) again to continue gathering. 

In summary, a web robot continously requests files from remote web sites, parses
and filters each file for indexing and further gathering.

It is this subsystem to decide which, what and when information will be cached/indexed/updated
into the database for web search. It determines the infomation content and volume, and
somehow performance, of a web search system. It also determines the
accuracy of a web search system. Different web systems use different
rules and approaches. Major issues are:

1). Which to gather
  choice 1: all or only HTTPD,FTP,GOPHER,WAIS,NEWS server
  choice 2: for a single server, all or first N levels of URLs visited
  choice 3: .html, .ps, .txt, .* for a single server
2). What to index
  choice 1: all or parts of 'significant' keywords in a file
  choice 2: attributes of a file, such as last update date, title, subtitles, size, outline etc.
  choice 3: controlled by size/lines, eg. first 10 lines of a text file
3). When to gather/index/update
  choice 1: the same as or a separate indexing database from the current one used by web search
            interface
  choice 2: real time daytime (i.e. online indexing) or during nights 
4) performance of a gather -- number of files can be gathered/indexed per day/hour

It is difficult to know all the above choices made by other web systems. One thing is
clear, though, due to significant performance and space requirement,
all current web systems have some common approaches:

1)  at most 3 levels of a web site is indexed. HTTPD and FTP servers are the major
    targets
2)  only a small proportion words of a file (usually 10% - 20%) is indexed, togather with
    common attibutes including title,size,date,subtitle, first 10 lines (as outline).
3) not real time, usually updating is done during nights

Our approach: given the resource/performance of a parallel Oracle 7 server and huge disk space
1). current HTTPD server (and NEWS server), future all servers. All levels of a
web server (controllable). Our recent test in indexing www.npac.syr.edu shows there are about
30,000 html URLs, while searching Lycos found just less than 100 URLs, which
means Lycos uses very restricted URL level control for gathering.
2). Using the Oracle Text*Retrieval technology, we index all siginifcant words, which
is defined by a stoplist. All words not in the stoplist will be indexed. Words in
stoplist are common words such as 'I', 'the', etc.
3). Using the parallel server and insert/indexing options of a parallel Oracle server,
we can achieve online indexing (just like the newsgroups system achived online archiving).
For example, given four nodes SP2, because the parallel Oracle 7 uses the 'shared-disk'
architecture on SP2, one oracle instance can be dedicated to gathering/updating while
the other 3 SP2 nodes are dedicated to web searching without performance degration
to support simitenous querying and igathering which is not possible on web systems
not employing RDBMS technology. Data consistency is a big plus in a RDBMS-based web
search system.
4). Piotr is developing a web agent which takes advantages of using RDBMS engine
to index and is a high-performance web robot, by using multiple gathering processes
sharing the same database server to maximize networking bandwidth of the gathering
process. This approach is unique in terms of a RDBMS-based system. Current rate
for indexing local NPAC web sites is about 1 page/second. Note indexing a new
page requires the same time as updating an indexed page in our approach.

In summary, our approach of using a RDBMS server has the following major advantages
over the current web systems in the aspect of indexing:
1). full coverage and indexes of 'web space', 
if implemented on a parallel Oracle server, with no performance degration.
The major reason why other web search systems do not provide full coverage
is due to performance/space requirement which is not possible on
a uniprocessor server (single CPU, no parallel I/O). Web search system demands
high-performance server technology and we are in the unique position to
compete with other web search systems just because of our parallel server
infrastructure at NPAC. (Current Lycos indexed database is less than 2 GB).

Our current web robot is written in Perl4 due to significant parsing requirement. 
Because we take advantage of a RDBMS server in developing this robot, performance
bottleneck in networking bandwidth, usually found in other web robot
system is no longer a limiting factor in our approach. Our experiment
shows the current performance bottleneck of our robot is in the pattern matching part
of the perl program. Further enhancement will require program rewitten in C
and technically this is not a problem.

Facts data:

Yahoo - several Indy workstations and Intel Pentium PCs
Lycos : 1.178 unique URLs
   . 8.7 GB processed. 1.8 GB summary text (attribute text) stored
   . 1.08 GB inverted index in the database

InfoSeek - 8 SUN1000, maybe use RDBMS or OODBMS
   . paid-service: USENET newsgroups, 4 weeks expire period
		   15000 newsgroups, update per night, 2 million news articles, 7GB total
   . free-service: web pages - 400K URLs, total 2GB, update weekly for old urls and daily for adding new ones.
 		   full-text indexing, can update the whole database in 48 hrs, add new pages once a week,
                   update/revisit each indexed page once a month,
		   case sensitive, also index numbers, symbols,
		   support "phrase seearch", and some limited proximity search,
		   use one-level control, don't support automatic word expansion,
		   robot performance: 10000 pages/hour, whole system written Python.

Inktomi Search Engine (UC Berkeley):  1.3M URLs, parallel partition among 4 SUN10s, largest indexed URLs.
 				      full or partial index: unknown. Poor search interface, keywprd only.
Open Text: the Web Index contains about 985 million words of text,
and 15,436,712 hyperlinks (URLs visited, not necessarily indexed in the db). 

In the most recent Open Text Web Index update: 
      22,106 pages were removed, of which 11,102 are no longer on the Web and 11,004
      were replaced with changed versions. 
      16,930 other pages were revisited but had not changed. 
      74,443 new pages, not previously in the Index, were added.
      These statistics were computed Mon Oct 16 23:55:26 EDT 1995.

Current estimation of web pages in the whole Internet: 4 - 5 millions			  	     

3. Indexing Subsystem

This subsystem handles how free text in all documents/files are internally stored and managed
in the database, to be able to efficiently and effectively support searching. It includes the logical
data structure of text entities and physical organization/layout of data in the database.
The indexing subsystem is tightly coupled with the search engine subsystem as its sole purpose
is to speed up free text search/query process conducted by the search engine. 
Common approach is to build 'inverted index' for all keywords in a document.

Major approaches/issues differ in different web systems:

(1). compression scheme used to store the text and their indexes. 
    It is not clear what exact technology other systems are using. WAIS is the typical technolgy
    invented to deal with indexing/searching free text and its major drawback is the size overhead
    of indexes - it doesn't compress index and text and results a roughly 1:1 ratio of original text 
    to indexes. Lycos seems to use the similar technology for indexing. 
(2). Keep both original text and indexes in the database, or just keep indexes.
    In order to support certain advanced search capability (to be discussed in the
    database engine section) such as phrase search and proximity search, original
    text must be stored togather with their indexes. This will almost double the space
    requirement of a web search system. To my knowledge, currently only InfoSeek stores
    both. That's also the reason InfoSeek keeps a very small URLs database (400K).
(3). Index modes: real-time indexing, batch indexing, and incremental indexing
     real-time indexing allows updates,inserts or deletes of text in/from the exisiting
     text database at runtime will automatically modify the associated indexes.
     Batch indexing only updates index in a batch mode after a bulk of texts have been
     modified. Incremental indexing allows indexing process to be done incrementally,
     adding new text will not need to reindex all the previous text.
     It's not clear what indexing modes allowed in other web search systems. Batch and
     incremental indexing are common capability.
(4). Case sensativity, symbols 
     to include case sensitivity and symbols in the index will increase 
the space requirement for indexing space.

Oracle's approach:
	The index facility used in RDBMS server is only good to locate data
entities with well-structured attributes, and is not usable to deal with
free text database. For this reason, Oracle developed a seperate product
called Context Server (previously called SQL*Text*Retrieval) to build
another logical/software layer on top of its RDBMS engine. There are
some penalty to use RDBMS  search engine to mimic text search engine
to which I will not detail here. Oracle uses some quite unique and
smart techniques to deal with this issue. Some compression schemes
are used for both text (run-length compressions) and indexes (bitstring)
reduce space usage and speedup text query. The typical compression
ratio for text is 4:1 and for indexes it is dependent on the number
of documents indexed.
It also supports all the index modes and case sensitivity etc as mentioned
above.

It is relatively hard to compare the indexing techniques used in different web
search systems. But basically there are two major measurements:
space consumption and search response time. A good index scheme should
have small space use yet short query process time.

Performance of a web search system will largely depend on how the indexes and text are
partitioned across the server. Parallel Oracle is the best choice so far to handle
efficiently data partition and parallel I/O and caching to achive optimized
query performance. Basically a web search system is a OLTP type of search characteritics,
with an important feature: 99% transactions are read-only queries with 1% updating
transaction periocally in the background. Parallel server is an ideal candidate
to support this type of information system - almost all speedup can be
archived from 100% embarrasly parallelism.


3. search engine subsystem

The search engine subsystem accepts queries from the web search interface and
schedules, partitions and executes them against the indexed database to locate 
URLs and their associated attributes which satisify the query criteria.
It also deals with I/O, caching, assembling, sorting etc performance related
system activity. Based on what content are indexed and the scheme used in a indexing
subsystem, it implements/supports different search algorithms to provide different
search capability of a text retrieval system. Together with the previous two subsystems,
it determines what basic and advanced search capability a system can support and its
efficiency.

Most web search systems are keyword-based search systems. Major search functions
are:

Basic:

(1). keywords with logical operators and their combinations, including 'and', 'or', 'not'.
(2). regular expression of keywords, including single/multiple wildcard matching.
(3). keyword expansion: stemming, fuzzy, soundex, expansion based on a thesaurus (such
     as synonyms).
(4). ranking of query results
(5). case sensitive/insensitive

Advanced:
(1). summarizing - summarize a document
(2). similarity search - search similar documents to a particular document
(3). phrase search 
(4). proximity Search - specify words distance between keywords

Almost all web search systems the 'basics' search capability.
Lycos doesn't have any of the advanced ones, while InfoSeek only supports
'phrase search' and some limited 'proximity Search'. OpenText has all the functions.
Oracle's system has all except 'similarity search'.

4. Web search interface

Any full-text infomation retrieval system has the two core subsystems discussed above - 
"indexing" and "search engine". They represent core technology  developed in infomation
processing literature and are used directly in a web search system. What's unique and different
in a web search system, compared to traditional text retrieval systems, lies in the 'web gathering'
and 'web search interface' subsystems. The two subsystems are also the major development
work we are undertaking to build a web search system, as we rely on Oracle's RDBMS technolgy
for the 'search engine' and full-text retrieval technology for the 'indexing' subsystem. 
Unlike most web search systems such as OpenText, Lycos,InfoSeek and WebCrawler where they build their
web search systems almost entirely from scratch, ie. they developed all the four subsystems using
their proprietary technology, our approach is to use Oracle's well-establish RDBMS and full-text retrieval
technolgy as core engine while developing our own 'web robot' and 'web search interface' which are
integrated into a web search system and best ultilize the core Oracle technology.

A text retrieval system usually has its own search interface and also allows developers to build
new search interfaces to access the text database. 
Most text retrival system uses client-server model as its fundamental model to provide
access interface to end-users. Therefore the client-server model used in WWW makes it seamless and natural to integrate a text retrieval system in a web search system, where a web client becomes a client of
a text retrieval system. Due to the stateless or sessionless feature
inherented in the web client-server model, a web search interface requires new approach/technique
in a information retrieval system which is quite different from that in a traditional search interafce design.

Major issues in developing a web search interface are:

1). integration of a web server and the backend search engine web server <-> search engine, ie., 
how to efficiently facilitate the single transacton path query requests ->query processing ->query results.
While interaction between a web client and web server is handled by the common http protocol, the way
how the interaction between a web server and a search engine maybe quite different in different
web search systems. Because a web search system usually is designed to handle large number of requests
simitantously and heavy load is assumed on the server side, it has to rely on high performance server
technology. Another consideraton is how the system handles simitantous search and database updates.
Most current web search systems use some very limited 'parallel processing' techniques and replication
technology to handle performance issues. A common approach, used in Lycos,InfoSeek,OpenText etc,
is to use more than one workstations, each machine running a web server and a local database. The local
database on each machine is complete identical across all the machines, replicated from another machine
which is dedicated to gather/index new web pages. The replication (thus the updates) of the
local database usually is done once a day during night (some once a week/month). 

Currently only the 'Inktomi Search Engine' at UC Berkley used a really parallel approach in which
the text database is partitioned disks among several machines and a single physical database is shared
(can be accessed) by all the machines used to support web access. But the approach used in this system
is very primitive in term of parallel processing from both software and hardware architetcures.
Workstation cluster is used in this system and parallel software used to support concurrent I/O,CPU
etc is developed from scrach as the whole retrieval system is deveoped from scratch (and actually 
based on a course project). Although the CPU processing is scalable, parallel I/O is not in this system.

We beileve our approach of employing the best technology in parallel database and parallel machine
in our web search system is unique and will eventually outperform all the other web systems
which are built on uniprocessor technology.

2). sessionless -> session-oriented transactions in web search interface
in many situations, session-oriented memorying function is required in a web search activity, eg.
one query may generate thousands of results which is too costly/impossible to send back to display in a client's web browser due to bandwidth requirement of the networking and cache/memory requirement
on the client machine, one approach is to store all the results on the server and provide 
navigation buttons (next,prev etc) to guide user go through the reults once a subset of the results.
Because each user may issue a different query and the navigation requires more than one web transactions,
a web search system must 'remember' the first query's all results for each current user. This means
a user's web session includes the first query and successive navigation transactions.
In the early stage of all web search systems, the only solution used to solve this problem is
to use a 'max hits' button for the first query so that the system restricts the maximum number
of query results to be return to the user. Currently this session-oriented transaction has been a 
common practice in most web search systems. In our approach, this is a natural function used
in Oracle's text retrieval system. Actually, unlike other systems where query results may be
stored outside the search engine, our approach allows us to store the 'hitlist' (together with
the query itself) inside the database and search engine.

3). query refining
Query refining means after your previous query whose search domain may be the whole indexed
database, the new query uses the results returned in the previous query as its search domain
so that you can keep refining your search and finally find/narrow down your targets. Because this
capability requires advanced algorithms in the search engine, so far no other web search systems
except us has implemented it.

4). support 'natural language like query'.
Most web search systems only support keywords and their regular expressions to form a query,
a good web search system should also allow users to type in queris in natural language. This
function requires additional processing of query input from the search interface. No current
web search systems support this capability yet and we believe we can develop/achive
such capability in our approach (not yet implemented but technically doable).

5. Conclusion

Compare with other web search systems, I believe our approach has the following three major
advantages:
1). server performance and data volume advantages. Because parallel Oracle RDBMS server and parallel machine
are used in the server technology, our system can be truely scalable in server performance to
support fast response time for large number of concurrent users. It also allow us to be able to
index almost all the 'web space' in the world without much concern/degration of server performance.
No other organizations who are currently in this 'web search' business has our unique position in
this 'parallel processing' business as far as technology and infrastructure go.
2). Oracle's RDBMS technology as the core engine. Not machine other web search systems
used RDBMS technology. It has many add-on values other than search engine. A good example
is the newsgroups archive system where we have not even used the full-text search capability yet.
Server management, rapid prototyping and ease of SQL-based code development, rich programming
tools and development environment are among the top benefits using a RDBMS system (other than
typical full-text retrieval system like WAIS). In addition, we are among the leading groups in the
nation in the research and applications of web-rdbms integration.
3). simultaneous online updates and search. Due to performance consideration and sophisticated
consistancy requirement in the indexing systen and search engine, no other systems can support the updating
(add,modify,delete) of the database and web search at the same time which means the database
a user is searching may not be the most updated. RDBMS system is born to support this kind
of concurrent activity and the parallel RDBMS server makes it little performance degration.

Our major disadvantages are:
1. A serious web search system must run in production mode and reqires quite significant investment.
Most current web search systems are run by commercial companies with dedicated prefessional staff and resource. Their ultimate goal of running a web search system is 'profit' while in this case 'money' and 
'fame' go togather and usually in order to be 'rich' such a system must be 'famous'. Our current staff and 
facility  are not trained and run in such a production environment. Our effort may fail simply becuase of this 'service' kind of requirement, though we have the best technology/expertise/system.
2. Stability of the parallel machine and maturity of the parallel Oracle RDBMS. Before we make it working,
we are not sure all the fancy technolgies used in this system are matured/reliable enough to
support such a full production system. New technology always is risky at first place.
3. Marketing. The current 'web search' systems (there are about 14 in total) have already established
their current images and customer base. We will be a late comer. Without significant marketing
effort, it may be difficult to compete with the existing system even we have better capability.

Comparison conclusion:

In terms of technology, OpenText is major competetor of us who has full capability of what we are
doing. Other systems are less competitive and have quite large limitations. But they have better
marketing images and user base.