Use Oracle's RDBMS and SQL*TextRetrieval Technology to Build a Web Search System (draft) Gang Cheng and Piotr Sokolowski, NPAC, Syracuse University 1. Introduction A web search system generally has four major components (subsystems): a gathering subsystem, an indexing subsystem, a search engine and a web search interface. In each of those susbsystems, I will compare our apporaches with the most advanced one known to me. Because of the potential commercial values in all the web systems, it is very difficult to get the internal techniques and details of what exact technoloty used by those web systems. I will mostly use Lycos as the major comparison example in my discussion, because Lycos is one among the most popularly used web search system on the Internet and WWW world, as well as it was first started from a research project in academic environment and later was commercialized by industry as a major Internet information provider and service. Current most popular web search systems are (you can find them from the 'net search' page in the Netscape web browser): . OpenText - this will be our major competetor. Backend engine for Yahoo, free service. has the largest indexed web database. see http://www.opentext.com:8080/omw/f-faq.html. Oracle uses some of its technology in developing Oracle's full-text search product. . Lycos - free service. . Infoseek - free service for web database, paid-service for newsgroups and other on-line databases . WebCrawler - free service . Yahoo - catalog database, not a truely web search system, catalog search is provided by OpenText system. From the discussion and comparions detailed in the following sections, it will become evident that our approach - using Oracle's RDBMS and its SQL*TextRetrival technology, together with our unique advantages of parallel server technology and infrastructure, will give a favorable situation to become a major Internet information provider and service center to compete with other web search systems currently popurlarly known and served in the Internet community. I will show you that our approach provides better solutions than most of the other web systems in ALL the four subsystems. My conlcusion and comparion are based on our experience of building the 'USENET newsgroups archive' and the 'NPAC books search' prototype systems. Fig. 1 The general architecture of a web search system remote/local web sites <-> gathering subsystem -> indexing subsystem <-> search engine ^ | clients with web browers -- web search interfaces ------- web server(s) -- cgi 2. Web Gathering Subsystem This subsystem's major funciton is to gather web pages from either remote or local web sites. It deals with the whole Internet 'web space' which may include infomation in httpd, ftp, gopher, wais and USENET news servers. This gathering task is usually carried out auotomatically by an small program usually called 'web robot' or 'web spider' or 'web agent' whose job can be briefly described as follows: 1. starting from a single URL, it requests the web page from the remote web site of the URL. Upon recieving the full page, it parses and keeps all the URLs in this page to a quene for further gathering. 2. For the web page from (1), according to the indexing rules predefined in the indexing subsystem and the search engine, it parses the page and passes only those information needed by the indexing subsystem. 3. After (2) is done, it gets a new URL from the quene, checks if the new URL is already indexed and proceeds (1) again to continue gathering. In summary, a web robot continously requests files from remote web sites, parses and filters each file for indexing and further gathering. It is this subsystem to decide which, what and when information will be cached/indexed/updated into the database for web search. It determines the infomation content and volume, and somehow performance, of a web search system. It also determines the accuracy of a web search system. Different web systems use different rules and approaches. Major issues are: 1). Which to gather choice 1: all or only HTTPD,FTP,GOPHER,WAIS,NEWS server choice 2: for a single server, all or first N levels of URLs visited choice 3: .html, .ps, .txt, .* for a single server 2). What to index choice 1: all or parts of 'significant' keywords in a file choice 2: attributes of a file, such as last update date, title, subtitles, size, outline etc. choice 3: controlled by size/lines, eg. first 10 lines of a text file 3). When to gather/index/update choice 1: the same as or a separate indexing database from the current one used by web search interface choice 2: real time daytime (i.e. online indexing) or during nights 4) performance of a gather -- number of files can be gathered/indexed per day/hour It is difficult to know all the above choices made by other web systems. One thing is clear, though, due to significant performance and space requirement, all current web systems have some common approaches: 1) at most 3 levels of a web site is indexed. HTTPD and FTP servers are the major targets 2) only a small proportion words of a file (usually 10% - 20%) is indexed, togather with common attibutes including title,size,date,subtitle, first 10 lines (as outline). 3) not real time, usually updating is done during nights Our approach: given the resource/performance of a parallel Oracle 7 server and huge disk space 1). current HTTPD server (and NEWS server), future all servers. All levels of a web server (controllable). Our recent test in indexing www.npac.syr.edu shows there are about 30,000 html URLs, while searching Lycos found just less than 100 URLs, which means Lycos uses very restricted URL level control for gathering. 2). Using the Oracle Text*Retrieval technology, we index all siginifcant words, which is defined by a stoplist. All words not in the stoplist will be indexed. Words in stoplist are common words such as 'I', 'the', etc. 3). Using the parallel server and insert/indexing options of a parallel Oracle server, we can achieve online indexing (just like the newsgroups system achived online archiving). For example, given four nodes SP2, because the parallel Oracle 7 uses the 'shared-disk' architecture on SP2, one oracle instance can be dedicated to gathering/updating while the other 3 SP2 nodes are dedicated to web searching without performance degration to support simitenous querying and igathering which is not possible on web systems not employing RDBMS technology. Data consistency is a big plus in a RDBMS-based web search system. 4). Piotr is developing a web agent which takes advantages of using RDBMS engine to index and is a high-performance web robot, by using multiple gathering processes sharing the same database server to maximize networking bandwidth of the gathering process. This approach is unique in terms of a RDBMS-based system. Current rate for indexing local NPAC web sites is about 1 page/second. Note indexing a new page requires the same time as updating an indexed page in our approach. In summary, our approach of using a RDBMS server has the following major advantages over the current web systems in the aspect of indexing: 1). full coverage and indexes of 'web space', if implemented on a parallel Oracle server, with no performance degration. The major reason why other web search systems do not provide full coverage is due to performance/space requirement which is not possible on a uniprocessor server (single CPU, no parallel I/O). Web search system demands high-performance server technology and we are in the unique position to compete with other web search systems just because of our parallel server infrastructure at NPAC. (Current Lycos indexed database is less than 2 GB). Our current web robot is written in Perl4 due to significant parsing requirement. Because we take advantage of a RDBMS server in developing this robot, performance bottleneck in networking bandwidth, usually found in other web robot system is no longer a limiting factor in our approach. Our experiment shows the current performance bottleneck of our robot is in the pattern matching part of the perl program. Further enhancement will require program rewitten in C and technically this is not a problem. Facts data: Yahoo - several Indy workstations and Intel Pentium PCs Lycos : 1.178 unique URLs . 8.7 GB processed. 1.8 GB summary text (attribute text) stored . 1.08 GB inverted index in the database InfoSeek - 8 SUN1000, maybe use RDBMS or OODBMS . paid-service: USENET newsgroups, 4 weeks expire period 15000 newsgroups, update per night, 2 million news articles, 7GB total . free-service: web pages - 400K URLs, total 2GB, update weekly for old urls and daily for adding new ones. full-text indexing, can update the whole database in 48 hrs, add new pages once a week, update/revisit each indexed page once a month, case sensitive, also index numbers, symbols, support "phrase seearch", and some limited proximity search, use one-level control, don't support automatic word expansion, robot performance: 10000 pages/hour, whole system written Python. Inktomi Search Engine (UC Berkeley): 1.3M URLs, parallel partition among 4 SUN10s, largest indexed URLs. full or partial index: unknown. Poor search interface, keywprd only. Open Text: the Web Index contains about 985 million words of text, and 15,436,712 hyperlinks (URLs visited, not necessarily indexed in the db). In the most recent Open Text Web Index update: 22,106 pages were removed, of which 11,102 are no longer on the Web and 11,004 were replaced with changed versions. 16,930 other pages were revisited but had not changed. 74,443 new pages, not previously in the Index, were added. These statistics were computed Mon Oct 16 23:55:26 EDT 1995. Current estimation of web pages in the whole Internet: 4 - 5 millions 3. Indexing Subsystem This subsystem handles how free text in all documents/files are internally stored and managed in the database, to be able to efficiently and effectively support searching. It includes the logical data structure of text entities and physical organization/layout of data in the database. The indexing subsystem is tightly coupled with the search engine subsystem as its sole purpose is to speed up free text search/query process conducted by the search engine. Common approach is to build 'inverted index' for all keywords in a document. Major approaches/issues differ in different web systems: (1). compression scheme used to store the text and their indexes. It is not clear what exact technology other systems are using. WAIS is the typical technolgy invented to deal with indexing/searching free text and its major drawback is the size overhead of indexes - it doesn't compress index and text and results a roughly 1:1 ratio of original text to indexes. Lycos seems to use the similar technology for indexing. (2). Keep both original text and indexes in the database, or just keep indexes. In order to support certain advanced search capability (to be discussed in the database engine section) such as phrase search and proximity search, original text must be stored togather with their indexes. This will almost double the space requirement of a web search system. To my knowledge, currently only InfoSeek stores both. That's also the reason InfoSeek keeps a very small URLs database (400K). (3). Index modes: real-time indexing, batch indexing, and incremental indexing real-time indexing allows updates,inserts or deletes of text in/from the exisiting text database at runtime will automatically modify the associated indexes. Batch indexing only updates index in a batch mode after a bulk of texts have been modified. Incremental indexing allows indexing process to be done incrementally, adding new text will not need to reindex all the previous text. It's not clear what indexing modes allowed in other web search systems. Batch and incremental indexing are common capability. (4). Case sensativity, symbols to include case sensitivity and symbols in the index will increase the space requirement for indexing space. Oracle's approach: The index facility used in RDBMS server is only good to locate data entities with well-structured attributes, and is not usable to deal with free text database. For this reason, Oracle developed a seperate product called Context Server (previously called SQL*Text*Retrieval) to build another logical/software layer on top of its RDBMS engine. There are some penalty to use RDBMS search engine to mimic text search engine to which I will not detail here. Oracle uses some quite unique and smart techniques to deal with this issue. Some compression schemes are used for both text (run-length compressions) and indexes (bitstring) reduce space usage and speedup text query. The typical compression ratio for text is 4:1 and for indexes it is dependent on the number of documents indexed. It also supports all the index modes and case sensitivity etc as mentioned above. It is relatively hard to compare the indexing techniques used in different web search systems. But basically there are two major measurements: space consumption and search response time. A good index scheme should have small space use yet short query process time. Performance of a web search system will largely depend on how the indexes and text are partitioned across the server. Parallel Oracle is the best choice so far to handle efficiently data partition and parallel I/O and caching to achive optimized query performance. Basically a web search system is a OLTP type of search characteritics, with an important feature: 99% transactions are read-only queries with 1% updating transaction periocally in the background. Parallel server is an ideal candidate to support this type of information system - almost all speedup can be archived from 100% embarrasly parallelism. 3. search engine subsystem The search engine subsystem accepts queries from the web search interface and schedules, partitions and executes them against the indexed database to locate URLs and their associated attributes which satisify the query criteria. It also deals with I/O, caching, assembling, sorting etc performance related system activity. Based on what content are indexed and the scheme used in a indexing subsystem, it implements/supports different search algorithms to provide different search capability of a text retrieval system. Together with the previous two subsystems, it determines what basic and advanced search capability a system can support and its efficiency. Most web search systems are keyword-based search systems. Major search functions are: Basic: (1). keywords with logical operators and their combinations, including 'and', 'or', 'not'. (2). regular expression of keywords, including single/multiple wildcard matching. (3). keyword expansion: stemming, fuzzy, soundex, expansion based on a thesaurus (such as synonyms). (4). ranking of query results (5). case sensitive/insensitive Advanced: (1). summarizing - summarize a document (2). similarity search - search similar documents to a particular document (3). phrase search (4). proximity Search - specify words distance between keywords Almost all web search systems the 'basics' search capability. Lycos doesn't have any of the advanced ones, while InfoSeek only supports 'phrase search' and some limited 'proximity Search'. OpenText has all the functions. Oracle's system has all except 'similarity search'. 4. Web search interface Any full-text infomation retrieval system has the two core subsystems discussed above - "indexing" and "search engine". They represent core technology developed in infomation processing literature and are used directly in a web search system. What's unique and different in a web search system, compared to traditional text retrieval systems, lies in the 'web gathering' and 'web search interface' subsystems. The two subsystems are also the major development work we are undertaking to build a web search system, as we rely on Oracle's RDBMS technolgy for the 'search engine' and full-text retrieval technology for the 'indexing' subsystem. Unlike most web search systems such as OpenText, Lycos,InfoSeek and WebCrawler where they build their web search systems almost entirely from scratch, ie. they developed all the four subsystems using their proprietary technology, our approach is to use Oracle's well-establish RDBMS and full-text retrieval technolgy as core engine while developing our own 'web robot' and 'web search interface' which are integrated into a web search system and best ultilize the core Oracle technology. A text retrieval system usually has its own search interface and also allows developers to build new search interfaces to access the text database. Most text retrival system uses client-server model as its fundamental model to provide access interface to end-users. Therefore the client-server model used in WWW makes it seamless and natural to integrate a text retrieval system in a web search system, where a web client becomes a client of a text retrieval system. Due to the stateless or sessionless feature inherented in the web client-server model, a web search interface requires new approach/technique in a information retrieval system which is quite different from that in a traditional search interafce design. Major issues in developing a web search interface are: 1). integration of a web server and the backend search engine web server <-> search engine, ie., how to efficiently facilitate the single transacton path query requests ->query processing ->query results. While interaction between a web client and web server is handled by the common http protocol, the way how the interaction between a web server and a search engine maybe quite different in different web search systems. Because a web search system usually is designed to handle large number of requests simitantously and heavy load is assumed on the server side, it has to rely on high performance server technology. Another consideraton is how the system handles simitantous search and database updates. Most current web search systems use some very limited 'parallel processing' techniques and replication technology to handle performance issues. A common approach, used in Lycos,InfoSeek,OpenText etc, is to use more than one workstations, each machine running a web server and a local database. The local database on each machine is complete identical across all the machines, replicated from another machine which is dedicated to gather/index new web pages. The replication (thus the updates) of the local database usually is done once a day during night (some once a week/month). Currently only the 'Inktomi Search Engine' at UC Berkley used a really parallel approach in which the text database is partitioned disks among several machines and a single physical database is shared (can be accessed) by all the machines used to support web access. But the approach used in this system is very primitive in term of parallel processing from both software and hardware architetcures. Workstation cluster is used in this system and parallel software used to support concurrent I/O,CPU etc is developed from scrach as the whole retrieval system is deveoped from scratch (and actually based on a course project). Although the CPU processing is scalable, parallel I/O is not in this system. We beileve our approach of employing the best technology in parallel database and parallel machine in our web search system is unique and will eventually outperform all the other web systems which are built on uniprocessor technology. 2). sessionless -> session-oriented transactions in web search interface in many situations, session-oriented memorying function is required in a web search activity, eg. one query may generate thousands of results which is too costly/impossible to send back to display in a client's web browser due to bandwidth requirement of the networking and cache/memory requirement on the client machine, one approach is to store all the results on the server and provide navigation buttons (next,prev etc) to guide user go through the reults once a subset of the results. Because each user may issue a different query and the navigation requires more than one web transactions, a web search system must 'remember' the first query's all results for each current user. This means a user's web session includes the first query and successive navigation transactions. In the early stage of all web search systems, the only solution used to solve this problem is to use a 'max hits' button for the first query so that the system restricts the maximum number of query results to be return to the user. Currently this session-oriented transaction has been a common practice in most web search systems. In our approach, this is a natural function used in Oracle's text retrieval system. Actually, unlike other systems where query results may be stored outside the search engine, our approach allows us to store the 'hitlist' (together with the query itself) inside the database and search engine. 3). query refining Query refining means after your previous query whose search domain may be the whole indexed database, the new query uses the results returned in the previous query as its search domain so that you can keep refining your search and finally find/narrow down your targets. Because this capability requires advanced algorithms in the search engine, so far no other web search systems except us has implemented it. 4). support 'natural language like query'. Most web search systems only support keywords and their regular expressions to form a query, a good web search system should also allow users to type in queris in natural language. This function requires additional processing of query input from the search interface. No current web search systems support this capability yet and we believe we can develop/achive such capability in our approach (not yet implemented but technically doable). 5. Conclusion Compare with other web search systems, I believe our approach has the following three major advantages: 1). server performance and data volume advantages. Because parallel Oracle RDBMS server and parallel machine are used in the server technology, our system can be truely scalable in server performance to support fast response time for large number of concurrent users. It also allow us to be able to index almost all the 'web space' in the world without much concern/degration of server performance. No other organizations who are currently in this 'web search' business has our unique position in this 'parallel processing' business as far as technology and infrastructure go. 2). Oracle's RDBMS technology as the core engine. Not machine other web search systems used RDBMS technology. It has many add-on values other than search engine. A good example is the newsgroups archive system where we have not even used the full-text search capability yet. Server management, rapid prototyping and ease of SQL-based code development, rich programming tools and development environment are among the top benefits using a RDBMS system (other than typical full-text retrieval system like WAIS). In addition, we are among the leading groups in the nation in the research and applications of web-rdbms integration. 3). simultaneous online updates and search. Due to performance consideration and sophisticated consistancy requirement in the indexing systen and search engine, no other systems can support the updating (add,modify,delete) of the database and web search at the same time which means the database a user is searching may not be the most updated. RDBMS system is born to support this kind of concurrent activity and the parallel RDBMS server makes it little performance degration. Our major disadvantages are: 1. A serious web search system must run in production mode and reqires quite significant investment. Most current web search systems are run by commercial companies with dedicated prefessional staff and resource. Their ultimate goal of running a web search system is 'profit' while in this case 'money' and 'fame' go togather and usually in order to be 'rich' such a system must be 'famous'. Our current staff and facility are not trained and run in such a production environment. Our effort may fail simply becuase of this 'service' kind of requirement, though we have the best technology/expertise/system. 2. Stability of the parallel machine and maturity of the parallel Oracle RDBMS. Before we make it working, we are not sure all the fancy technolgies used in this system are matured/reliable enough to support such a full production system. New technology always is risky at first place. 3. Marketing. The current 'web search' systems (there are about 14 in total) have already established their current images and customer base. We will be a late comer. Without significant marketing effort, it may be difficult to compete with the existing system even we have better capability. Comparison conclusion: In terms of technology, OpenText is major competetor of us who has full capability of what we are doing. Other systems are less competitive and have quite large limitations. But they have better marketing images and user base.