HELP! * YELLOW=global GREY=local Global HTML version of Foils prepared February 11,1996

Foil 49 The Web Gathering Subsystem

From IBM Tutorial on Web Technology for HPCC IBM Poughkeepsie -- February 7 1996. by Geoffrey Fox * See also color IMAGE

Gather WWW pages/files from remote web servers and filter them into indexed text database
Use 'Web Robot' or 'Web Agent' technology - a class of programs that automatically traverse network hosts and bring back information via various network protocols (e.g. HTTP)
Major issues - direct impact on database size, search coverage and performance
  • which files to gather (HTTP,FTP,GOPHER,WAIS,USENET NEWS etc.)
  • what to index (full-text,partial text,file attributes, etc.)
  • when to gather/index/update (real-time,once a day/week/month etc.)


Northeast Parallel Architectures Center, Syracuse University, npac@npac.syr.edu

If you have any comments about this server, send e-mail to webmaster@npac.syr.edu.

Page produced by wwwfoil on Tue Feb 18 1997