Review of current Web Search Systems (draft, internal distribution only) Gang Cheng, Piotr Sokolowski and Geoffrey Fox, NPAC, Syracuse University 1. Introduction A web search system refers to a facility to locate/discover relavent information represented by Web pages. The search domain can either be the whole WWW information space (including httpd,ftp,gopher,wais and USENET news servers) on the Internet or WWW space on web sites. The same technology can apply to both global and local WWW information space. A web search system generally has four major components (subsystems): a gathering subsystem, an indexing subsystem, a search engine and a web search interface. In each of those susbsystems, this review will compare Oracle's apporaches with the most advanced one known to us. Because of the potential commercial values in all the web systems, it is very difficult to get the internal techniques and details of what exact technoloty used by those web systems. We will mostly use Lycos as the major comparison example in my discussion, because Lycos is one among the most popularly used web search system on the Internet and WWW world, as well as it was first started from a research project in academic environment and later was commercialized by industry as a major Internet information provider and service. Current most popular web search systems are (you can find them from the 'net search' page in the Netscape web browser): . OpenText - Backend engine for Yahoo, free service. has the largest indexed web database. see http://www.opentext.com:8080/omw/f-faq.html. Oracle uses some of its technology in developing Oracle's full-text search product. . Lycos - free service, early web search system. . Infoseek - free service for web database, paid-service for newsgroups and other on-line databases . WebCrawler - free service . Yahoo - catalog database, not a truely web search system, catalog search is provided by OpenText system. Fig. 1 The general architecture of a web search system remote/local web sites <-> gathering subsystem -> indexing subsystem <-> search engine ^ | clients with web browers -- web search interfaces ------- web server(s) -- cgi 2. Web Gathering Subsystem This subsystem's major funciton is to gather web pages from either remote or local web sites. It deals with the whole Internet 'web space' which may include infomation in httpd, ftp, gopher, wais and USENET news servers. This gathering task is usually carried out auotomatically by an small program usually called 'web robot' or 'web spider' or 'web agent' whose job can be briefly described as follows: 1. starting from a single URL, it requests the web page from the remote web site of the URL. Upon recieving the full page, it parses and keeps all the URLs in this page to a quene for further gathering. 2. For the web page from (1), according to the indexing rules predefined in the indexing subsystem and the search engine, it parses the page and passes only those information needed by the indexing subsystem. 3. After (2) is done, it gets a new URL from the quene, checks if the new URL is already indexed and proceeds (1) again to continue gathering. In summary, a web robot continously requests files from remote web sites, parses and filters each file for indexing and further gathering. It is this subsystem to decide which, what and when information will be cached/indexed/updated into the database for web search. It determines the infomation content and volume, and somehow performance, of a web search system. It also determines the accuracy of a web search system. Different web systems use different rules and approaches. Major issues are: 1). Which to gather choice 1: all or only HTTPD,FTP,GOPHER,WAIS,NEWS server choice 2: for a single server, all or first N levels of URLs visited choice 3: .html, .ps, .txt, .* for a single server 2). What to index choice 1: all or parts of 'significant' keywords in a file choice 2: attributes of a file, such as last update date, title, subtitles, size, outline etc. choice 3: controlled by size/lines, eg. first 10 lines of a text file 3). When to gather/index/update choice 1: the same as or a separate indexing database from the current one used by web search interface choice 2: real time daytime (i.e. online indexing) or during nights 4) performance of a gather -- number of files can be gathered/indexed per day/hour It is difficult to know all the above choices made by other web systems. One thing is clear, though, due to significant performance and space requirement, all current web systems have some common approaches: 1) at most 3 levels of a web site is indexed. HTTPD and FTP servers are the major targets 2) only a small proportion words of a file (usually 10% - 20%) is indexed, togather with common attibutes including title,size,date,subtitle, first 10 lines (as outline). 3) not real time, usually updating is done during nights Facts data: Yahoo - several Indy workstations and Intel Pentium PCs Lycos : 1.178 Million unique URLs . 8.7 GB processed. 1.8 GB summary text (attribute text) stored . 1.08 GB inverted index in the database InfoSeek - 8 SUN1000, maybe use RDBMS or OODBMS . paid-service: USENET newsgroups, 4 weeks expire period 15000 newsgroups, update per night, 2 million news articles, 7GB total . free-service: web pages - 400K URLs, total 2GB, update weekly for old urls and daily for adding new ones. full-text indexing, can update the whole database in 48 hrs, add new pages once a week, update/revisit each indexed page once a month, case sensitive, also index numbers, symbols, support "phrase seearch", and some limited proximity search, use one-level control, don't support automatic word expansion, robot performance: 10000 pages/hour, whole system written Python. Inktomi Search Engine (UC Berkeley): 1.3 Million URLs, parallel partition among 4 SUN10s, largest indexed URLs. full or partial index: unknown. Poor search interface, keywprd only. Open Text: the Web Index contains about 985 million words of text, and 15,436,712 hyperlinks (URLs visited, not necessarily indexed in the db). In the most recent Open Text Web Index update: 22,106 pages were removed, of which 11,102 are no longer on the Web and 11,004 were replaced with changed versions. 16,930 other pages were revisited but had not changed. 74,443 new pages, not previously in the Index, were added. These statistics were computed Mon Oct 16 23:55:26 EDT 1995. Current estimation of web pages in the whole Internet: 0.1 - 1 TB 3. Indexing Subsystem This subsystem handles how free text in all documents/files are internally stored and managed in the database, to be able to efficiently and effectively support searching. It includes the logical data structure of text entities and physical organization/layout of data in the database. The indexing subsystem is tightly coupled with the search engine subsystem as its sole purpose is to speed up free text search/query process conducted by the search engine. Common approach is to build 'inverted index' for all keywords in a document. Major approaches/issues differ in different web systems: (1). compression scheme used to store the text and their indexes. It is not clear what exact technology other systems are using. WAIS is the typical technolgy invented to deal with indexing/searching free text and its major drawback is the size overhead of indexes - it doesn't compress index and text and results a roughly 1:1 ratio of original text to indexes. Lycos seems to use the similar technology for indexing. (2). Keep both original text and indexes in the database, or just keep indexes. In order to support certain advanced search capability (to be discussed in the database engine section) such as phrase search and proximity search, original text must be stored togather with their indexes. This will almost double the space requirement of a web search system. To my knowledge, currently only InfoSeek stores both. That's also the reason InfoSeek keeps a very small URLs database (400K). (3). Index modes: real-time indexing, batch indexing, and incremental indexing real-time indexing allows updates,inserts or deletes of text in/from the exisiting text database at runtime will automatically modify the associated indexes. Batch indexing only updates index in a batch mode after a bulk of texts have been modified. Incremental indexing allows indexing process to be done incrementally, adding new text will not need to reindex all the previous text. It's not clear what indexing modes allowed in other web search systems. Batch and incremental indexing are common capability. (4). Case sensativity, symbols to include case sensitivity and symbols in the index will increase the space requirement for indexing space. Oracle's approach: The index facility used in RDBMS server is only good to locate data entities with well-structured attributes, and is not usable to deal with free text database. For this reason, Oracle developed a seperate product called Context Server (previously called SQL*Text*Retrieval) to build another logical/software layer on top of its RDBMS engine. There are some penalty to use RDBMS search engine to mimic text search engine to which I will not detail here. Oracle uses some quite unique and smart techniques to deal with this issue. Some compression schemes are used for both text (run-length compressions) and indexes (bitstring) reduce space usage and speedup text query. The typical compression ratio for text is 4:1 and for indexes it is dependent on the number of documents indexed. It also supports all the index modes and case sensitivity etc as mentioned above. It is relatively hard to compare the indexing techniques used in different web search systems. But basically there are two major measurements: space consumption and search response time. A good index scheme should have small space use yet short query process time. Performance of a web search system will largely depend on how the indexes and text are partitioned across the server. Parallel Oracle is the best choice so far to handle efficiently data partition and parallel I/O and caching to achive optimized query performance. Basically a web search system is a OLTP type of search characteritics, with an important feature: 99% transactions are read-only queries with 1% updating transaction periocally in the background. Parallel server is an ideal candidate to support this type of information system - almost all speedup can be archived from 100% embarrasingly parallelism. 3. search engine subsystem The search engine subsystem accepts queries from the web search interface and schedules, partitions and executes them against the indexed database to locate URLs and their associated attributes which satisify the query criteria. It also deals with I/O, caching, assembling, sorting etc performance related system activity. Based on what content are indexed and the scheme used in a indexing subsystem, it implements/supports different search algorithms to provide different search capability of a text retrieval system. Together with the previous two subsystems, it determines what basic and advanced search capability a system can support and its efficiency. Most web search systems are keyword-based search systems. Major search functions are: Basic: (1). keywords with logical operators and their combinations, including 'and', 'or', 'not'. (2). regular expression of keywords, including single/multiple wildcard matching. (3). keyword expansion: stemming, fuzzy, soundex, expansion based on a thesaurus (such as synonyms). (4). ranking of query results (5). case sensitive/insensitive Advanced: (1). summarizing - summarize a document (2). similarity search - search similar documents to a particular document (3). phrase search (4). proximity Search - specify words distance between keywords Almost all web search systems the 'basics' search capability. Lycos doesn't have any of the advanced ones, while InfoSeek only supports 'phrase search' and some limited 'proximity Search'. OpenText has all the functions. Oracle's system has all except 'similarity search'. 4. Web search interface Any full-text infomation retrieval system has the two core subsystems discussed above - "indexing" and "search engine". They represent core technology developed in infomation processing literature and are used directly in a web search system. What's unique and different in a web search system, compared to traditional text retrieval systems, lies in the 'web gathering' and 'web search interface' subsystems. The two subsystems are also the major development work we are undertaking to build a web search system, as we rely on Oracle's RDBMS technolgy for the 'search engine' and full-text retrieval technology for the 'indexing' subsystem. Unlike most web search systems such as OpenText, Lycos,InfoSeek and WebCrawler where they build their web search systems almost entirely from scratch, ie. they developed all the four subsystems using their proprietary technology, our approach is to use Oracle's well-establish RDBMS and full-text retrieval technolgy as core engine while developing our own 'web robot' and 'web search interface' which are integrated into a web search system and best ultilize the core Oracle technology. A text retrieval system usually has its own search interface and also allows developers to build new search interfaces to access the text database. Most text retrival system uses client-server model as its fundamental model to provide access interface to end-users. Therefore the client-server model used in WWW makes it seamless and natural to integrate a text retrieval system in a web search system, where a web client becomes a client of a text retrieval system. Due to the stateless or sessionless feature inherented in the web client-server model, a web search interface requires new approach/technique in a information retrieval system which is quite different from that in a traditional search interafce design. Major issues in developing a web search interface are: 1). integration of a web server and the backend search engine web server <-> search engine, ie., how to efficiently facilitate the single transacton path query requests ->query processing ->query results. While interaction between a web client and web server is handled by the common http protocol, the way how the interaction between a web server and a search engine maybe quite different in different web search systems. Because a web search system usually is designed to handle large number of requests simitantously and heavy load is assumed on the server side, it has to rely on high performance server technology. Another consideraton is how the system handles simitantous search and database updates. Most current web search systems use some very limited 'parallel processing' techniques and replication technology to handle performance issues. A common approach, used in Lycos,InfoSeek,OpenText etc, is to use more than one workstations, each machine running a web server and a local database. The local database on each machine is complete identical across all the machines, replicated from another machine which is dedicated to gather/index new web pages. The replication (thus the updates) of the local database usually is done once a day during night (some once a week/month). Currently only the 'Inktomi Search Engine' at UC Berkley used a really parallel approach in which the text database is partitioned disks among several machines and a single physical database is shared (can be accessed) by all the machines used to support web access. But the approach used in this system is very primitive in term of parallel processing from both software and hardware architetcures. Workstation cluster is used in this system and parallel software used to support concurrent I/O,CPU etc is developed from scrach as the whole retrieval system is deveoped from scratch (and actually based on a course project). Although the CPU processing is scalable, parallel I/O is not in this system. 2). sessionless -> session-oriented transactions in web search interface in many situations, session-oriented memorying function is required in a web search activity, eg. one query may generate thousands of results which is too costly/impossible to send back to display in a client's web browser due to bandwidth requirement of the networking and cache/memory requirement on the client machine, one approach is to store all the results on the server and provide navigation buttons (next,prev etc) to guide user go through the reults once a subset of the results. Because each user may issue a different query and the navigation requires more than one web transactions, a web search system must 'remember' the first query's all results for each current user. This means a user's web session includes the first query and successive navigation transactions. In the early stage of all web search systems, the only solution used to solve this problem is to use a 'max hits' button for the first query so that the system restricts the maximum number of query results to be return to the user. Currently this session-oriented transaction has been a common practice in most web search systems. In our approach, this is a natural function used in Oracle's text retrieval system. Actually, unlike other systems where query results may be stored outside the search engine, our approach allows us to store the 'hitlist' (together with the query itself) inside the database and search engine. 3). query refining Query refining means after your previous query whose search domain may be the whole indexed database, the new query uses the results returned in the previous query as its search domain so that you can keep refining your search and finally find/narrow down your targets. Because this capability requires advanced algorithms in the search engine, so far no other web search systems except us has implemented it. 4). support 'natural language like query'. Most web search systems only support keywords and their regular expressions to form a query, a good web search system should also allow users to type in queris in natural language. This function requires additional processing of query input from the search interface. No current web search systems support this capability yet and we believe we can develop/achive such capability in our approach (not yet implemented but technically doable). 5. Conclusion Compare with other web search systems, I believe Oracle's approach has the following three major advantages: 1). server performance and data volume advantages. Because parallel Oracle RDBMS server and parallel machine are used in the server technology, our system can be truely scalable in server performance to support fast response time for large number of concurrent users. It also allow us to be able to index almost all the 'web space' in the world without much concern/degration of server performance. 2). Oracle's RDBMS technology as the core engine. No other web search systems used RDBMS technology. It has many add-on values other than search engine. A good example is the newsgroups archive system where we have not even used the full-text search capability yet. Server management, rapid prototyping and ease of SQL-based code development, rich programming tools and development environment are among the top benefits using a RDBMS system (other than typical full-text retrieval system like WAIS and Glimpse). With Oracle's Context server technology, it mixes both well-structured information entities with full-text search/indexing capability. 3). simultaneous online updates and search. Due to performance consideration and sophisticated consistancy requirement in the indexing systen and search engine, no other systems can support the updating (add,modify,delete) of the database and web search at the same time which means the database a user is searching may not be the most updated. RDBMS system is born to support this kind of concurrent activity and the parallel RDBMS server makes it little performance degration. ----------------------------------------------------------------------------------------- Evaluation of the newly announced web search system "Alta Vista" by DEC. 1. Summary In summary, Alta Vista has the following common capabilities found in most web systems: . keyword search interface supporting regular expressions of keywords and phrases. 'and','or','not' operators are supported. Both word and phrase searching are supported. A limited 'proximity search' is supported (allows 10 words between two specified words). A limited 'word stemming' (wildcard matching) is supported. It also supports case sensitivity of the searching keywords/phrases. . In addition to keyword search, it also supports attribute-related search, including title of a web page,hostname or URL string of a URL, URL link in a web page Alta Vista has the following major advantages over other web search systems: . fast search response It is not clear what search and indexing software it uses behind the seaech system. It is very fast for both word search and phrase search. From the new release attached in the bottom of this report, it seems that DEC's 64-bit Alpha cluster are used to host this search system, i.e., some parallel architecture may be employed to achive the performance. Another possibility is that very large RAMs are used in the hardware configuration so that all the indexes are cached in the core memory to sustain the high query performance. . bigger index database Compared with other web search systems, it provides more coverage (16 millions) while the current largest ones are Lycos (10 millions) and OpenText (millions). Alta Vista claims that a very fast 'web robot' is used to index web pages. To sustain such a fast query performance while providing more full coverage of 'web space' is really the most visible strength of Alta Vista. . full-text search for USENET newsgroups In additon to full text search in all the 13,000 newsgroups, it also provides some attribute-related search, including sender,newsgroup name,subject. . some limited ranking mechnisim for query results I would say that Alta Vista is better than all the current web search system in terms of coverage, query performance and accuracy. There are at least two unique features that Alta Vista is missing: . query refining This capability requires some sophisticated development on server to remember results of previous query, i.e., make session-less web client-server interaction become session-oriented. I believe soon all the current web search system will provide this capability . show highlighted keywords in the query results, this is indicated from the FAQ item below: "Alta Vista shows the first few words of the documents it finds, but I would like it to show some context so I could tell more quickly whether the document is one I want to look at. Why not provide some context? The words and phrases that Alta Vista uses to match the document may occur scattered anywhere in it from the beginning to the end, and may occur multiple times. In general, there is no canonical way of deciding which lines to show as context for a matching query." It seems that Alta Vista only keeps all indexes from web pages. We plan to keep both indexes and the original pages (in compressed format) which allow us to support this function. 2. Detailed Evaluation Alta Vista is a complex web search system provided by DIGITAL. Evaluation is done in four categories - interface, search scope, search capabilities, performance. 2.1 Interface Interface of Alta Vista consists of two parts. One is intended for simplified quick search while the other one provides a sophisticated set of options for more complex queries. The simplified interface is very efficient, though it is not obvious and the first time user must see the help page to recognize options. There is only one input line, where the user types both words and options. A single check box controls the way of di splaying documents. User can choose between Compact or Detailed display. The advanced query interface demands the user to enter a complex query with logical operators like 'AND','OR','NOT' and 'NEAR'. The forms allows also specifying the ranking key and time period for a documents. 2.2 Search scope Alta Vista claims to have over 16 millions documents and that would rank it on top of other search systems. However it probably has some documents duplicated as it informs the user about removing duplicates. It's difficult however to say anything about its gathering performance. 2.3 Search capabilities Alta Vista has a wide set of search functions. It allows searching for words, phrases and words within distance of 10 words from each other. It can also find documents that do not contain typed word. The user can also give only beginning of the word for example string* which matches any word beginning with string. It also ranks documents and the user can supply a separate list of words for ranking. Alta Vista operate in both case sensitive and case insensitive manner. If a typed word contains uppercase characters it is treated as case sensitive otherwise it is evaluated as case insensitive. In general it has medium range of search options. Alta Vista is better then most search systems in terms of phrase searching and proximity searching, however it does not perform any advanced word approximation. 2.4 Performance It is extremely fast in retrieval. Especially for phrase search. It seems that they use some kind of index for phrase search as well as for normal search. One thing that would point to that solution is that Alta Vista does not recognize punctuation charac ters and treats them all as spaces. 3 Conclusions and comparison with Oracle TextServer Alta Vista is very fast and powerful search system. However it does not make any special text processing that would be possible when using ORACLE TextServer with Context. ORACLE TextServer gives certainly much better general capabilities. Especially when integrated with context it would give all capabilities of Alta Vista and many others. TextServer allows for example synonym search, soundex expansion, pattern matching and Context would allow theme search and summaries of documents. TextServer also offers much more flexible way of refining the search. Alta Vista allows only simple adding of new options to the query. TextServer allows issuing completely new query concerning only previous query results. The most important advantage of Alta Vista compared to other search systems is its speed. From our former experiences it seems that ORACLE system tends to be slow. These tests were however performed on older version of this product TextRetrival which is now improved in the new Context Server release. 4. Text from the news release of AltaVista - URL: http://www.digital.com:80/info/flash/">http://www.digital.com:80/info/flash/ Digital Equipment Corporation today introduced the Internet's first `super spider' software, as part of the most advanced information search and indexing technology available for the World Wide Web. Blazing fast, it conducts the most comprehensive search of the entire Web text at speeds up to 100 times faster than spiders used in conventional information search services. Under development at Digital's Corporate Research Group here, the technology promises to surpass the limitations of current information services by delivering the most complete, precise, and up-to-date information of the Web's entire text. The technology's super spider and super indexer employ next-generation software and advanced networking, powered by the highest-performing 64-bit Alpha computers. The advanced technology is set to undergo its toughest testing starting today as Digital makes a `beta' version available to tens of millions of Web users worldwide.