Develop WWW Search Systems Using SQL*TextRetrieval

Gang Cheng, Piotr Sokolowski, Marek Podgnory and Geoffrey Fox 
[gcheng,piotr,marek,gcf]@npac.syr.edu

Northeast Parallel Architectures Center
at Syracuse University, NY 13244

Abstract

We describe our effort in developing Web Search Systems using
Oracle's SQL*TextRetrieval. In a prototype system for
on-line books in HTML and a web site, SQL*TextRetrieval 
is used to index significant words in the 'web space' and provides an 
efficient search engine for free-text search. Web enables
global access to and maximum information sharing with a hypertext-based
text retrieval system. Using Oracle's Context Server technology,
various search options are implemented, including basic word 
stemming, phrase, fuzzy, and soundex searching, as well as more advanced 
proximity search and concept search. For the concept search option,
we converted a public domain thesaurus called "Roget Thesaurus" into
our text search system to support synonym expansions. Combining
a Web-Oracle-Web (WOW) package, this system makes open and available 
under the web search interface listing of both indexed words with 
occurrence frequency and stoplist words. An advanced search mechnism
to recursively refine search domain via the web is also described.
The prototype system can be found at URL 
<http://kayak.npac.syr.edu:1963/search/index.html>.

1. Introduction

World Wide Web (WWW) has becomes a common place and a popular way for either 
corporate or Internet users to publish and exchange information
over networks. Base components in WWW include: HTML - the markup language;
HTTP - the networking transport protocol; HTTP server and web browser -
the client and server. WWW can be viewed as a gigantic distributed database
(or data repository) where data entities are stored in HTML pages on web
servers which are mostly non- or poorly-structured text documents and
where logic relationships among web pages are represented by 'Hyperlinks'
in HTML. The interactions between a web browser and (remote) web servers
via HTTP can be seen as the (network-based) access interface/transactions 
between the database client and server, where browsing, i.e., clicking by
following hyperlinks, is the simplest and most primitive search/query
activity in the database such that no sophisticated search engine is
required except for the file system on the web server. Depending on 
the client's browsing scope, the content of the
distributed database can be as small as a single web site, or it can
be as big as all the web servers on the whole Internet. With the daily
growing volume in the WWW space, it becomes impossible to solely rely
on 'browsing' to quickly and accurately locate/discovery relevent
information on the WWW. One natural solution is to build web search systems.
For the distributed WWW database, one can view such a web search system
as a centralized meta-index database where text indexes of the distributed
data repository are built and stored at a central place to speedup the
'browsing' process.

There can be many approaches to build such a web search system. For example,
Yahoo [1] used a 'Yellow Page' approach to catalog web servers by some
pre-defined criteria. While many WWW indexing services like Yahoo adopt
such approaches close to WWW technology, in this paper we are most
interested in employing well-established database and information
retrieval technology to build web search systems. This approach has been
popularly used in current main-stream web search systems, including
Alta Vista [2], OpenText [3], Lycos [4], and InfoSeek [5].

This paper is organized as follows: Section 2 discusses system configuration
and major components found in a typical web search system. Oracle's full-text
indexing and search system "SQL*TextRetrieval" is described in Section 3.
We describe our experience in developing a prototype web search systems using
SQL*TextRetrieval in Section 4. The last section outlines our current and 
planned research and development in this area.

2. Components of a WWW search system

A web search system is a facility to locate/discover relevent information
represented by WWW URLs. The search domain can either be
the whole WWW information space (including httpd,ftp,gopher,wais
and USENET news servers) on the Internet or HTML documents on a single web site.

As shown in Figure 1,  a web search system generally has four major components: 
a gathering subsystem, an indexing subsystem, a search engine and a WWW-based 
search interface.

		Fig. 1 The general architecture of a web search system


remote/local web sites <-> gathering subsystem -> indexing subsystem <-> search engine
						        	            ^
						        	            |
	clients with web browers -- web search interfaces ------- web server(s) -- cgi

2.1 Web Gathering Subsystem

This subsystem's major funciton is to gather web pages/files from either 
remote or local WWW servers. It may involve the whole Internet 'web space' 
which includes infomation on httpd, ftp, gopher, wais and USENET news servers. 
This gathering task is usually carried out auotomatically by an small 
program sometimes called 'web robot' or 'web spider' or 'web agent' whose job can 
be briefly described as follows:

(1) Starting from a single URL, it requests the web page from the remote web site
of the URL. Upon recieving the full page, it parses and keeps all the URLs in this page
to a quene for further gathering. 
(2) For the web page from (1), according to the indexing rules predefined
in the indexing subsystem and the search engine, it parses the page and filters only
those information needed by the indexing subsystem. 
(3) After (2) is done, it gets a new URL from the quene, checks if the new URL is already
indexed and proceeds with (1) again to repeat the gathering process. 

In summary, a web robot continously requests files from remote web sites, parses
and filters each file for indexing and further gathering.

It is this subsystem to decide which, what and when information will be 
cached/indexed/updated into the text database for web search. 
It determines the infomation content and volume, and
somehow performance, of a web search system. It can also have significant impact on
the accuracy of a web search system. Different web systems may use different
rules and approaches. Major issues in this subsyetm are:

(1) Which to gather
  choice 1: all or only HTTPD,FTP,GOPHER,WAIS,NEWS server
  choice 2: for a single server, all or first N levels of URLs in the 'hyperlinks tree'
  choice 3: web pages in single or multiple text formats of HTML, Plain Text, Latex and such.
(2) What to index
  choice 1: all words or parts of 'significant' keywords in a web page, i.e., full-text or 
            keywords only
  choice 2: attributes of a file, such as last update date, title, subtitles, size, outline etc.
  choice 3: controlled by size/lines, eg. first 10 lines of a text file
(3) When to gather/index/update
  choice 1: the same as or a separate indexing database from the current one used by web search 
            interface
  choice 2: real time (i.e. online indexing) or batch during nights 
(4) performance of a gather -- number of files can be gathered/indexed per day/hour

2.2. Indexing Subsystem

This subsystem handles how text gathered from remote web sites are internally stored and managed 
in a text database, to be able to efficiently and effectively support searching. It 
involves with the logical data structure of text entities and physical organization/layout 
of textural indexes in the database. The indexing subsystem is tightly coupled with the 
search engine subsystem as its sole purpose is to speed up free text search/query process 
conducted by the search engine. Common approach is to build 'inverted index' for all keywords 
in a document.

Major approaches/issues differ in different web systems:

(1) Compression scheme used to store the text and their indexes. 
(2) Keep both original text and indexes in the database, or just keep indexes.
    In order to support certain advanced search capability (to be discussed in 
    Section 2.3) such as phrase search and proximity search, original
    text must be stored togather with their indexes. This will increase the space
    consumption of the text database.
(3) Index modes: real-time indexing, batch indexing, or incremental indexing.
    Real-time indexing allows updates, inserts or deletes of text in/from the exisiting
    text database at any time, which will automatically modify the associated indexes,
    without affecting the simutenous text retrieval on the index database.
    Batch indexing only updates index in a batch mode after a bulk of texts have been
    modified/loaded. Incremental indexing allows indexing process to be done incrementally,
    only documents which have changed since the last indexing run are indexed, 
    thus adding new text will not affect the existing indexes in the database.

(4) Case sensativity, symbols or numbers
     To include case sensitivity and symbols in the index will increase the space requirement.

Performance of a web search system will largely depend on how good a index scheme
is used in the index database. A good index scheme should have small space use yet 
short query process time.

2.3 Search Engine Subsystem

The search engine subsystem accepts queries from a WWW-based search interface and
schedules, partitions and executes them against the indexed database to locate 
URLs and their associated attributes which satisify the query criteria.
It also deals with I/O, caching, assembling, sorting etc performance related
system activity. Based on what content are indexed and the scheme used in the indexing
subsystem, it implements/supports different search algorithms to provide different
search capability of a text retrieval system. Together with the previous two subsystems,
it determines what basic and advanced search capability a system can support and its
efficiency.

Most web search systems are keyword-based search systems. Major search functions
include:

Basic Search Capability:

(1) keywords with logical operators and their combinations, including 'and', 'or', 'not'.
(2) regular expression of keywords, including single/multiple wildcard matching.
(3) keyword expansion: stemming, fuzzy, soundex, expansions based on a thesaurus (such as synonyms).
(4) ranking of query results
(5) case sensitive/insensitive

Advanced Search Capability:
(1) summarizing - summarize a document
(2) similarity search - search 'similar' documents to a particular document
(3) phrase search 
(4) proximity Search - specify words distance between keywords

2.4 A WWW-based Search Interface

Any infomation retrieval system has the two core subsystems discussed above - 
"indexing" and "search engine". They represent base technology developed in infomation
processing literature and are used directly in a web search system. What is unique and 
different in a web search system, compared to a traditional text retrieval system, 
lies in the 'web gathering' and 'web search interface' subsystems. 
The two subsystems are also the major development work we are undertaking to build a web 
search system, as we rely on Oracle's RDBMS and SQL*TextRetrieval technology 
to build the 'search engine' and 'indexing' subsystems. 

A conventional text retrieval system usually has its own search interface and also allows 
developers to build new search interfaces to access the text database. For example,
SQL*TextRetrieval provides both Form-based and C-based APIs.
Most text retrival system uses client-server model as its fundamental model to build
access interface to end-users. Therefore the client-server model used in WWW makes it 
seamless and natural to integrate a text retrieval system in a web search system, 
where a web client becomes a client of a text retrieval system. 
Due to the stateless or sessionless feature inherent in the web client-server model, new
techniques are required in a web search interface which is quite different from that 
in a traditional search interafce design.

Major issues in developing a web search interface are:

(1) Integration of a web server and the backend search engine, i.e., interaction between
a web server and the search engine,  to efficiently facilitate the single transacton path:
query requests ->query processing ->query results.
While interaction between a web client and web server is handled by the common http protocol, 
the way how the interaction between a web server and a search engine maybe quite different 
in different web search systems. Because a web search system is designed to handle 
large number of requests (they are essentially OLTP type of readonly transactions) 
simitantously, and heavy load is assumed on the server side, high 
performance server technology is usually required to support scalable CPU and I/O processing. 
Another consideraton is how the system handles simitantous search and database updates/indexing
at real time.

Most current web search systems use some very limited 'parallel processing' techniques and 
replication technology to handle performance scalability issues. For instance, a common approach
popularly used in Lycos, InfoSeek, OpenText etc is to use more than one workstations, 
each running a web server and a local database. The local database on each machine is 
complete identical across all the machines, replicated from the machine dedicated to 
gather/index new web pages. The replication (thus the updates) of the
local database usually is done once a day during night (some once a week/month). 

(2) Sessionless web connections towards session-oriented database transactions on the web search 
interface. In many situations, session-oriented memorying capability is required in a web 
search activity. For example, one query may generate thousands of results which is 
too costly/impossible to send back all of them to display in a client's web browser,
due to significant networking bandwidth constraints and cache/memory requirement 
on the client machine. One approach is to store all the results on the server and 
provide navigation buttons (next,previous etc) to guide the user to go 
through the query results once a subset. Because each user may issue a different 
query and the navigation requires more than one web transactions, a web search system must 
'remember' the first query's all results for each concurrent client. This means a user's web 
session includes the first query and successive navigation transactions. In the early days 
of all web search systems, the only solution used to solve this problem is to use a 
'max hits' button for the first query so that the system restricts the maximum number 
of query results to be return to the client. Currently this session-oriented transaction 
has been a common practice in most web search systems. 

(3) Query refining. It means that after you issued a query whose search domain may be 
the whole indexed database, the new query uses the results returned in the previous query as 
its search domain so that you can keep refining your search and finally find/narrow down your 
search targets. 

(4) Support 'natural language like query'. Most web search systems only support keywords 
and their regular expressions to form a query, a good web search system should also allow 
users to type in queries in their natural languages. This function requires additional 
processing and analysing of query input from the search interface. 

3. Oracle SQL*TextRetrieval

Traditional RDBMS system is developed to deal with well-structured information
entities such as business data. Information represented in WWW space are usually
in the form of free-format with little or no pre-defined structure such as HTML pages,
program source code and plain-ASCII articles. Because HTML is a very loosely defined
markup language, a HTML page can be loosely or even illy formatted. For example, 
not every page has a title (as enclosed by <title> nad </title> tag in HTML), 
though it is legal and acceptable by any web browser. Although documents in 
free-text form can be stored in tables of a RDBMS server, search for the content of 
the documents without predefined search terms (words,phrases, or concepts)  
requires a serial scan of the table and an exact patten matching on the text column. 
Performance of such a search is not acceptable. The standard indexing techniques used in Oracle
RDBMS cannot help in this case, as the full-text documents are stored in 
a single or multiple table columns of character type. New indexing facility 
together with its text search utility must be developed to solve the performance 
issue in free text searching. SQL*TextRetrieval is one of such solutions developed 
by Oracle. SQL*TR indexes any text that is contained in the multimedia documents 
and indexes a textual description of the data. It allows
users to perform free text search that returns a set of documents or database
rows which match the search criteria. Because SQL*TR is software layer built on
top of the core Oracle7 RDBMS engine, it uses SQL queries as the standard method
and query language for retrieving both textual information and structured data.
It can take full advantage of an Oracle7 server such as concurrent control,recovery,
data integrity, security, backup and server management. Users can combine searching
of structured information, (such as title,author,size,date of a web page) with free
text searching. Among many utilities provided in the SQL*TR package [6] to
build a fully functional yet traditional information retrieval system, the indexing
facility and text retrieval API (in C) are most relevent to building a WWW search
system. We briefly outline the major parts and functionality of the two components
in this section. Readers are recommanded to refer to [4][5][6] for details of SQL*TR.

3.1 Text Indexing in SQL*TextRetrieval

SQL*TR uses a fully inverted index method. It is a proprietary bit-mapped location
index rather than a conventional pointer system. The index is highly compressed
that can be cached in memory. Its compact size can achieve as little as 4:1 ratio
of the original text to their text indexes. This small index has two direct implications:
fast indexing and retrieval speeds, as fewer disk I/O operations are required.

As illustrated in Figure 2, the text index in SQL*TR consists of two tables:
a word list and a location list.

There are two indexing modes in SQL*TR to control what to be indexed: STOP mode and PASS
mode. In STOP mode, all words are indexed except any that have previously
been entered in the wordlist, such as 'and', 'or' and 'the'  in a document which
have little or no information retrieval value. In pass mode, the words to be indexed
are put in the wordlist before indexing occurs and flagged as pass words.
During indexing, all other words in the documents are ignored.
This provides a useful means at server side (as compared to at web gathering part) to
filter out common words in a WWW text database to further reduce space consumption.

By default, lexical delimiters in SQL*TR that indicate the boundaries of words
and numbers are all non-alphanumeric characters for words and all non-numberic
characters for numbers with the exception of the decimal point or 
equivalent radix character. A user can exclude specified characters from the list 
of delimiters  for a particular text table.  This proves to be another very useful 
control mechnism in the  indexing subsystem of a web search system. For example,
We can define the starting and ending HTML tag symbols ('<' and '>') as two 
delimiters such that all text enclosed by them are not indexed. This effectively
prevents all HTML tags in a web page from being indexed.

The scope of an indexing run can be global (batch) or incremental, both at runtime
while the searching is running. The Oracle RDBMS engine handles the concurrent
control in supporting the simutenous text indexing and retrieving.
Updated text can be re-indexed either in real time or through a later run
of incremental indexing. The advantage of realtime indexing to a web search
system is that updated text can be queried immediately. In realtime indexing,
the text index is updated in the same DBMS transaction in which the text
is added, changed, or deleted.  All these features are required by a on-line web
search system in which a single system supports simutenous gathering/indexing and
retrieving as WWW is a dynamic world - web pages are created,updated,removed
from time to time. A web search system must reflect such a dynamic change within
a reasonable time scale.

SQL*TR provides a utility for loading and indexing large amount of data. The data
can be a combination of structured data and free-format text. Data can be appended
to tables so initial loads can be performed in several runs or large additional batches 
can be added while application are running.

3.1 Text Retrieval in SQL*TextRetrieval

SQL*TR's underlying retrieval language is based upon an extended SQL. This is accomplished by
a comprehensive, yet simple set of addtions to SQL that enable searches to be carried
out jointly on structured data and on the content
of documents. Major syntactic extensions are the 'CONTAINS' clause (similar to 'LIKE')
and a set of built-in functions to express different keyword search conditions. 
The language goes beyond merely search for keywords. A query can be made up of words, 
phrases, terms, and structured data operations. Standard boolean of operations using 
AND, OR, and NOT can be used to build queries.

Text tables in SQL*TR are base Oracle tables with entended text columns where
a text column holds parts of or the whole document in free-text form. There are
several text column types with various maximum sizes ranging from 255 bytes to unlimited
size.
Every text table must have a numberic column that holds a unique key known as
textkey to locate specific rows in the table. Each row of the text table can be
used to store free text of a document as well as any other structured information
associated with the text (by using separate columns). 

For a web search system, documents gathered from remote web sites are first
loaded in a text table and later used to build their index database. After the indexing
finished, the original text data can be discarded, or it can be kept in the text table
for some more advanced search purposes to be discussed in Section 4. 
By nature, unstructured text documents use large amount of storage. To save space,
SQL*TR optionally compresses text stored in text tables. Considering the trade-off between
the amount of space saved by compression and the time taken to decompress the 
text, there are two compression modes with save ranging from 40-60% to 60-70%.

Besides the SQL query language, SQL*TR provides two APIs for developers
to build retrieval interface: Form API and Pro*C API. Ideally for a web
search system, a PL/SQL based API of SQL*TR would be the best gateway to 
interface the text search engine built on SQL*TR (which is run on a Oracle7 RDBMS server) 
with the web search interface accessed on a HTTPD server, as demonstrated in the
Web-Oracle-Web (WOW) package [4] and the Oracle WebSystem [5].
A set of Common Gateway Interface (CGI) programs written in the Pro*C API
are currently the only solution to link a web server and an Oracle text server.

4. A prototype WWW search system

Our approach to build web search systems is to develop a high-performance
and intelligent web gathering subsystem and to build a flexible and
customizable HTML form-based web search interface while employing Oracle RDBMS 
technology for the search engine and server managment, and the SQL*TextRetrieval for
the full-text indexing engine. Because the SQL*TR package was originally not
developed for WWW information retrieval, much work have also 
been paid to interfacing the gathering subsystem with the SQL*TR, and to
integrating the web search interface with the Oracle7 RDBMS server.

Figure 3 is a system diagram of the web gathering subsystem developed in our
web search system. It consists three major components: a loader, a checker and
one or more gatherers, each of which is an individual process.
The fundemental philosophy behind this design is that the
gathering program must be both high performance -- the potential performance 
bottleneck of the limited networking bandwidth over the wide area network 
should not become the bottleneck of the gathering process, the full advantage
of a RDBMS server in concurrent control and caching should be ultilized to build
such a RDBMS-based web agent; and intelligent -- no redundant yet 
updated and complete information (documents) are gathered and the web robot 
convention [6] is followed, as well as it is extensible to gather 
domain/topic specific information.

During the gathering, the 'loader' and the 'checker' are connected to
the RDBMS server where tables consisting of the text and index database 
are stored. A set of temporary tables are also created to store URLs and hosts
used to identify URLs which can be new (to be gathered), pending, 
being processed or to be updated in the gathering processing by 
the 'gatherers' and the 
'checker'. Concurrent control of accessing/updating those tables
by the 'checker' and the 'loader' is managed by the RDBMS server.
To sustain the maximum networking bandwidth, more than one 'gatherers'
can be started to interact with web servers simultanously via HTTP to 
get bulk of documents in
HTML. Each light-weighted 'gatherer' parses the document to identify
URLs to be considered for further gathering and a couple of attributes
assoicated with the document, including URL,size,date,title,number of
URLs and inline images, outline (first 256 chararters) etc.

Major functions of the 'loader' include:
(1) fetch new URLs to be gathered/updated from the Oracle server
and pass them to gatherers upon request.
(2) invoke SQL*TR 'rtload' process to load/update and/or index incrementally 
documents when they become available (in batch).

The 'checker' is responsible for checking URLs:
(1) validate URLs to be gathered according to several filtering
conditions:
    . the disallowed URLs specified in the 'robot.txt' on a host
    . URLs with all the attributes (size,title,outline etc) are
identical yet may have different hostnames (alias) or paths (symbolic links) 
    . URLs found to be non-existed/removed in the previous visits of the 
      'gatherer' (dead links)
    . URLs being processed by the 'loader' or one of the 'gatherers'

(2) fiter out URLs not to be gathered according to some predefined
conditions, such as URLs for server statistics.
(3) filter in only URLs to be gathered according to predefined
conditions, such as the online HTML books
(http://www.infomall.org/npac/pcw/*), NPAC web server
(http://www.npac.syr.edu/*) or any WWW server (http://*,ftp://* etc).

Currently the information used by the 'checker' to do the filtering
is only the URL itself and the URL/host tables in the database. To make the
gathering subsystem more intelligent, to be able to gather/index
domain-specific information and such, the architecture design of our
system is developed to be extensible/open. For example, we plan to
add another module 'filter' between the 'gatherer' and the 'loader' 
which filters documents (not URLs) based on the content of 
the documents gathered.

To enable the 'phrase search' and 'proximity search' in SQL*TR,
the text of indexed documents need to keep in the text table. 

Figure 4: Main search interface of the NPAC web search system

Figure 5: Display of search results and an interface of the navigation
and refining


Figure 4 is the main search interface of the prototyping
system where one of the indexed WWW database is the NPAC official
web server. Other databases currently indexed include three
online HTML books in parallel computing.

Directly built on the text retrieval in SQL*TR, this HTML form-based
interface allows users to have the following search options:

(1) Logical operations of keywords - AND/OR ('all'/'any'), NOT
('contain'/'not contain')
(2) Word expansions 
	. word stemming - expanding a keyword based on its stem and indexed
words having the same stem (wildcard matching on either end);
	. fuzzy or soundex - words similarly spelt or phonetically
similar;
	. synonyms - same or nearly same meaning as other words predefined in
the same synonym ring of a thesaurus. We converted a public domain "Roget
Thesaurus" [7] into our database which has about ??? synonym rings;
	. search keywords to be treated as case sensitive/insensitive
(3) Phrase search - exact matching of a phrase;
(4) Proximity search - two words with relative location each other.

Accessible from the interface also include a gateway, implemented by the
WOW package, to view all indexed
words and stoplist words currently in the database, as well as a link to
the source "Roget Thesaurus".

As shown in Figure 5, search results are returned to the client's web
browser one subset per page. Included in each result page are:

(1) A set of navigation buttons ('next', 'previous', 'top' and 'end') to
allow the user to browse through all the query results one page by one
page. SQL*TR's Stored Query Expressions (SQE), together with a sequence, is
used to first store the query and its hitlist (i.e., identification of
matching documents) and later restore them when those buttons are activated
to request additional query results in successive web client-server connections.
The use of SQE prevents the initial query from re-running. Note that the
hitlist stored by SQE is in a highly compressed bitmap form so that the
temporary space consumed should not be much an overhead on the server.

(2) An message area indicating search keywords and their expansions, and
total number of matching documents found

(3) The entities of matching documents that contain the search keywords,
each of which is represented by a set of attributes associated with the
document. The title of each document is hyperlinked back to the original
URL for quick browsing of the full document source, while the URL attribute
links to another query to pull out the document text from the database
and make the search keywords to be highlighted in the display.

(4) A similar search interface to allow the user to refine the previous
query so that the search domain of a new query will be the document list
generated by the previous query, not the whole database. This is again
implemented by SQL*TR's SQE. 

By using word expansions, we attemp to broaden search domain to improve
search accuracy limited by keywords search, while using the 'query
refining' to gradually narrow down the search domain which may be
over-relaxed by the word expansions.

As a conclusion of our current work, we should point out that although
SQL*TR provides us with an excellent free-text retrieval facility and test
bed to rapidly prototype web search systems and investigate major issues in
WWW-based information system with both well-structured and non-structured
entities, we found quite a few performance problems in the current SQL*TR
package. For example, it appears that indexing information is not used 
in the 'phrase search' and 'proximity search' so they suffer
very slow query response. We recently started to work on the new Oracle
TextServer3 with ConText in our on-going web search development, 
hoping that the new software has significant improvement on functionality and
performance. 

5. Current and planned work at NPAC

Northeast Parallel Architecture Center (NPAC) at Syracuse University
has been engaged in research and development of building advanced
information systems, in particular enterprise information systems
using advanced parallel processing and WWW technologies. Our current 
development work is focused on two major areas:

(1) Investigate major development and performance issues in
    . integrating conventional RDBMS technology in WWW-based systems, 
    . building production-level information systems using RDBMS and
      Web technologies

Our current technical strategy is to use Oracle 7 RDBMS server to
build mail-based database, including USENET newsgroups, mailing lists
and personal mailboxes, and use the system as a major vehicle or testbed
to conduct/experiment major issues in Web and RDBMS integration.
A production system of a searchable USENET newsgroups and mailing lists
archive can be found at URL <http://asknpac.npac.syr.edu>.

(2) Web search systems using RDBMS and full-text indexing technology.
As described in this paper, our approach is to 
use Oracle 7 RDBMS server as search engine and Oracle's Context server 
as full-text indexing engine to build backend web search engine,
while developing our own web agents and web search interfaces. 
Sponsored by Oracle Corp., we are also conducting research and development
on building large-scale web search and information systems on parallel
machines. For example, we are using parallel Oracle7 with
parallel server and query options to build WWW-based systems
on a IBM Scalable POWERparallel (IBM SP2) system.

6. References

[1] Alta Vista, http://altavista.digital.com/.
[2] Open Text, http://www.opentext.com/omw/f-omw.html.
[3] InfoSeek, http://guide.infoseek.com/.
[4] Lycos, http://www.lycos.com/.
[5] Yahoo, http://www.yahoo.com/.
[6] Oracle Corp., SQL*TextRetrieval Administrator's Guide V2.0, July 1992.
[7] Oracle Corp., SQL*TextRetrieval Developer's Guide V2.0, July 1992.
[8] Roget Thesaurus, http://www2.thesaurus.com/thesaurus/roget/.
[9] Web-Oracle-Web, http://dozer.us.oracle.com:8080/sdk10/wow/.
[10] Oracle WebSystem, http://www.oracle.com/mainEvent/webSystem/
webSystemOverview.html.
[11] Web Robot Convention,