Information gathering and filtering
- This is done by web ÒrobotsÓ - programs which automatically connect to all servers and search some number of documents - usually up to a certain ÒdepthÓ of links, such as 4.
- For each document, the robot returns keywords and other information to the search index. For example, Lycos returns: the title, any headings and subheadings, the 100 mostÓweightyÓ words, the first 20 lines, the size in bytes, and the number of words.
- Problems with information gathering:
- Information update
- Information resulting from CGI scripts is not available.
- Resource intensive: robots repeatedly connect to a site, informal protocols try to prevent Òrapid fireÓ or Òrobot attackÓ
- Preventing robot loops when links are circular.
|