HELP! * YELLOW=global GREY=local Global HTML version of Foils prepared 22 January 1996

Foil 26 The Web Gathering Subsystem

From Web Technology Overview CPS616 Basic Information Track for Computational Science -- Winter-Spring Semester 96. by Geoffrey Fox * See also color IMAGE

Gather WWW pages/files from remote web servers and filter them into indexed text database
Use 'Web Robot' or 'Web Agent' technology - a class of programs that automatically traverse network hosts and bring back information via various network protocols (e.g. HTTP)
Major issues - direct impact on database size, search coverage and performance
  • which files to gather (HTTP,FTP,GOPHER,WAIS,USENET NEWS etc.)
  • what to index (full-text,partial text,file attributes, etc.)
  • when to gather/index/update (real-time,once a day/week/month etc.)


Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Tue Feb 18 1997