WWW: Beyond the Basics

15. Searching and Databases on the Web

15.2. Gatherer

A gatherer traverses the World Wide Web as a spider traverse a spider web. It collects the resources or documents to be indexed. A gatherer starts from a single document, usually the Web server's home directory, then selects the next document to index by following a hypertext link.

The resources or documents on the Web are complex, widely distributed, and dynamic because anybody who has access to a Web server can put and/or delete documents on the Web. In fact, it is impossible to collect every document on the Web, because that there are just too many. Furthermore, there is no guarantee that documents on the Web will be maintained for any specific period of time.

Robot gatherers are used for some search systems. Gathers locate Web severs and collect resources. Typically Gatherers fetch documents by traversing hypertext links of a document. Also, they should have a policy for limiting the traverse. There are two main strategies for limiting the traverse of a Web server: breath first and depth first.

15.2.1 Breath first

A breath first strategy will first traverse all the hypertext links to the original document and gather the documents, then examine the gathered documents for further links to follow. The breath first gatherer does a wide and shallow traverse.

15.2.2 Depth first

The depth-first strategy will traverse one link and gather the document to index, then follow one of the links to the gathered document, and go to the depth of a path first, then to another path. Usually there is a maximum number limit of links to follow for depth first, or otherwise the gatherer might follow links down forever. The depth first gatherer does a narrow but deep traverse.

The gatherer fetches and collects documents, and submits them to a indexer. The indexer then assigns each document or resource a unique identifier (called the primary key ) and its storage location, and also create a record , a set of values or related terms describing the document. That is indexing. Now let us turn to the indexer.

[PREV] [NEXT] [UP] [HOME] [VT CS]

Copyright © 1996 Aixiang (I Song) Yao, All Rights Reserved

Aixiang (I Song) Yao<ayao@csgrad.cs.vt.edu>
Last modified: November 21, 1996