Information gathering and filtering
-
This is done by web "robots" - programs which automatically connect to all servers and search some number of documents - usually up to a certain "depth" of links, such as 4.
-
For each document, the robot returns keywords and other information to the search index. For example, Lycos returns: the title, any headings and subheadings, the 100 most"weighty" words, the first 20 lines, the size in bytes, and the number of words.
-
Problems with information gathering:
-
Information update
-
Information resulting from CGI scripts is not available.
-
Resource intensive: robots repeatedly connect to a site, informal protocols try to prevent "rapid fire" or "robot attack"
-
Preventing robot loops when links are circular.
|