WWW: Beyond the Basics

15. Searching and Databases on the Web

15.3. Indexer

A indexer takes information collected by the gatherer, creates index records and enters them into the index (or database). This process is called indexing. Gatherer and indexer are combined into one, such as robot gatherer, in some search systems, in which the gatherer traverses the Web and creates records at the same time.

Indexing is the key component of the search system. An effective indexing process will yield high-quality records that represent accurately collection of resources, and that better describe the documents. Searching a high-quality index leads to precise identification and correct retrieval of resources.

In fact, the principle of indexing for a search system of the Web is virtually the same as for a traditional database. In a traditional database, a record is a set of values treated as a unit. For example in a library catalog, a record for a book may consists of title, author, publisher, subject, and so on.

An example of three sample records in a database is shown below

     Record: 012
     ------------
     Author:     Lincoln D. Stein
     Title:      How to Set Up and Maintain a World Wide Web Site
     Publisher:  Addison-Wesley Publishing Company, Inc. (1995)
     Subject:    Web, Netscape, HTML
     ISBN:       0-201-63389-2

     Record: 345
     ------------
     Author:     David Flanagan
     Title:      Java in a Nutshell
     Publisher:  O'Reilly & Associates, Inc., (1996)
     Subject:    Java 
     ISBN:       1-56592-183-6

     Record: 678
     ------------   
     Author:     W. Richard Stevens
     Title:      TCP/IP Illustrated, Volume 3.
     Publisher:  Addison-Wesley Publishing Company, Inc. (1996)
     Subject:    TCP/IP, Networks, HTTP, NNTP.
     ISBN:       0-201-63495-3
 

Similarly, for a search system, a record of a Web resource or document is a URC (Uniform Resource Characteristic). A URC is a set of attributes that describes the resource or document. For example, the simplest type of URC are Hotlists that only consist of a document title and document location (URL - Uniform Resource Location), such as

    Record: 210
    Title:  Table of Contents
    URL:    http://www.netscape.com/search/

    Record: 543
    Title:  Virginia Tech Computer Science Department Home Page
    URL:    http://www.cs.vt.edu/

    Record: 876
    Title:  CS6204: WWW:Beyond the Basics
    URL:    http://ei.cs.vt.edu/~wwwbtb/

The most important thing about a database is to find information by knowing something about the information you want, without necessarily knowing the detail of how it is stored.

15.3.1 Primary index

A database is a set of records. A record of a document is a set of attributes that describe that document. Each record in a database can be uniquely identified by a key value called primary key. If a database has the primary keys, it may also be called a primary index. We could search the primary database on the primary key. For example, the ISBN (International Standard Book Number) number is unique for each book. Thus ISBN can be used as the primary key. The books can be searched via the primary key, the ISBN number, in the database of books. Similarly URL could be used as the primary key in the database of the Web documents.

15.3.2 Inverted index

Searching only on the primary key, however, is somewhat unconvenient, which limits the capability of the database. Sometimes we do not know the primary key, but want to search terms based on other keys. For example, we want to search books by an author, or in a subject, or by book title on a library database. Searching information on the Web by heading or document title on a Web indexing database is another example. This can be done by creating another database organized by that key. The database that supports searching via that key is called inverted index or secondary index, and that key is called the inverted key or the secondary key. For example, we may create another database organized by document title, based on the primary index. The database is an inverted index which supports searching via document title. Usually we may have several inverted indexes (or secondary indexes). Building secondary indexes is an important technique to speed the search process and enhance the capabilities of the database. Many criteria or strategies on indexing differ the ways of creating inverted indexes. An inverted index is derived from the primary index --- the original database of records, and the inverted index can make the search process faster. In fact, an inverted index record is the sum of all possible searches stored in advance according to the strategy. Inverted indexes may be created for some or all of the terms in the database. When all the terms are indexed, it is called full index. Full index is fast and effective, especially when the user knows only a piece of information in the record.

15.3.3 Fulltext index

Fulltext indexing is another widespread use of inverted index. Indexes can be built from the words in the document, the full text of the resource. Because the entire resource is parsed, or scanned to derive the information that makes up the record, this indexing technique is called fulltext indexing. This would allow the user search certain words or phrase, or combinations of words among documents.

Documents on the Web have a great varieties both on type and size and reside on widely distributed Web servers. Many data types and formats, such as words, pictures, audio, video, multimedia, newsgroups, and network services, exist on the Web. Also on the Web, some resources are small files, and some are huge databases containing thousands of entries. Thus new challenges arise for indexing for the Web. Practically speaking, a wide variety of classification schemes are used.

A common way to derive a record - URC (Uniform Resource Characteristic) data from a resource or document on the Web is to create a fulltext index of that document. However the practical disadvantage of a fulltext indexing is that it requires great amount of storage: Parsing a resource and pulling out all the words creates a huge volume of data. To reduce a URC from fulltext index down to a reasonable size, some operations may be applied to the fulltext index data. This is exactly what RBSE and Lycos indexers do. For example, to create a record - URC describing the document, a parser can be used to derive the most important keywords, such as the title of the document, heading, subheadings (words in the section titles), the most weighty words (according to the frequency of occurrences of that word over all documents), descriptive words around clickable hypertext links, and so on. It depends on what strategies or criteria the search system likes to choose for indexing.

The use of robot indexer is a successful approach to derive URCs. There are a few of early robot indexers, such as Jumpstation, the World Wide Web Worm, RBSE Project Spider. The following table (from Yeager et. al.'s book[Yeager96], p.238-239) shows an example of different URC formats for Web robots with different strategies:

                URC format  for Web robots
      ---------------------------------------------------------------
           Robot           |               URC
      ---------------------+-----------------------------------------
      Jumpstation          |  Title of HTML document, words in 
                           |     section titles (level headers).
                           |
      World Wide Web Worm  |  Title of HTML document, full pathname
                           |     (URL), descriptive words around
                           |     clickable hypertext link that
                           |     lead to the document.
                           |
      RBSE                 |  Title of HTML document, 20 most weighted
                           |     words* (derived via WAIS fulltext
                           |     indexing).
                           |
      Lycos                |  Title, heading, subheadings, 100 most
                           |     weighted words* (derived via fulltext
                           |     indexing), first 20 lines, size in
                           |     bytes, number of words.
                           |
      CUI W3               |  Title, location (URL), document that
                           |     linked to and from this document,
                           |     keywords, abstracts.
      ---------------------------------------------------------------
      *weight of word = term frequency / document frequency
       term frequency = frequency of occurrence of that word over
            all documents in the document collection
       document frequency = frequency of occurrence of that word
            in a given document

The most popular and successful Web search services today, Lycos and Yahoo, create indexes from robot gatherers.

[PREV] [NEXT] [UP] [HOME] [VT CS]

Copyright © 1996 Aixiang (I Song) Yao, All Rights Reserved

Aixiang (I Song) Yao<ayao@csgrad.cs.vt.edu>
Last modified: November 21, 1996