A Public Service provided by NEXOR
-------------------------------------------------------------------------------

List of Robots

This is a list of Web Wanderers. See also the World Wide Web Wanderers, Spiders
and Robots page.

If you know of any that aren't on this list, please let me know.

If you're just looking for search engines, you might try CUSI.
-------------------------------------------------------------------------------

The JumpStation Robot

Run by Jonathon Fletcher <J.Fletcher@stirling.ac.uk>.

Verion I has been in development since September 1993, and has been running on
several occasions, the last run was between February the 8th and February the
21st.

More information, including access to a searcheable database with titles can be
found on The Jumpstation

Identification: It runs from pentland.stir.ac.uk, has "JumpStation" in the
User-agent field, and sets the From field.

Version II is under development..
-------------------------------------------------------------------------------

Repository Based Software Engineering Project Spider

Run by Dr. David Eichmann <eichmann@rbse.jsc.nasa.gov> For more information see
the Repository Based Software Engineering Project.

Consists of two parts:

Spider
     A program that creates an Oracle database of the Web graph, traversing
     links to a specifiable depth (defaults to 5 links) beginning at a URL
     passed as an argument. Only URLs having ".html" suffixes or tagged
     as"http:" and ending in a slash are probed. Unsuccessful attempts and
     leaves are logged into a separate table to prevent revisiting. This is
     effectively then, a limited-depth breadth-first traversal of only html
     portions of the Web. We err on the side of missing non-obvious html
     documents in order to avoid stuff we're not interested in. A third table
     provides a list of pruning points for hierarchies to avoid because of
     discovered complexity, or hierarchies not wishing to be probed.
Indexer
     A script that sucks html URLs out of the database and feeds them to a
     modified freeWAIS waisindex, which retrieves the document and indexes it.
     Retrieval support is provided by a front page and a cgi script driving a
     modified freeWAIS waissearch.

The separation of concerns is to allow spider to be a lightweight assessor of
Web state, while still providing the value added to the general community of
the URL search facility.

Identification: it runs from rbse.jsc.nasa.gov (192.88.42.10), requests GET
/path RBSE-Spider/0.1", with a and uses a RBSE-Spider/0,1a in the User-Agent
field.

Seems to retrieve documents more than once.
-------------------------------------------------------------------------------

The WebCrawler

Run by Brian Pinkerton <bp@biotech.washington.edu>

Identification: It runs from webcrawler.cs.washington.edu , and uses
WebCrawler/0.00000001 in the HTTP User-agent field.

It does a breadth-first walk, and indexes content as well as URLs etc. For more
information see description.
-------------------------------------------------------------------------------

The NorthStar Robot

Run by Fred Barrie <barrie@unr.edu> and Billy Barron.

More information including a search interface is available on the NorthStar
Database. Recent runs (26 April) will concentrate on textual analysis of the
Web versus GopherSpace (from the Veronica data) as well as indexing.

Run from frognot.utdallas.edu, possibly other sites in utdallas.edu, and from
cnidir.org. Now uses HTTP From fields, and sets User-agent to NorthStar
-------------------------------------------------------------------------------

W4 (the World Wide Web Wanderer)

Run by Matthew Gray <mkgray@mit.edu>

Run initially in June 1993, its aim is to measure the growth in the web. See
details and the list of servers

User-agent: WWWWanderer v3.0 by Matthew Gray <mkgray@mit.edu>
-------------------------------------------------------------------------------

The fish Search

Run by people using the version of Mosaic modified by Paul De Bra
<debra@win.tue.nl>

It is a spider built into Mosaic. There is some documentation online.

Identification: Modifies the HTTP User-agent field. (Awaiting details)
-------------------------------------------------------------------------------

The Python Robot

Written by Guido van Rossum <Guido.van.Rossum@cwi.nl>

Written in Python. See the overview.
-------------------------------------------------------------------------------

html_analyzer-0.02

Run by James E. Pitkow <pitkow@aries.colorado.edu>

Its aim is to check validity of Web servers. I'm not sure if it has ever been
run remotely.
-------------------------------------------------------------------------------

MOMspider

Written by Roy Fielding <fielding@ics.uci.edu>

Its aim is to assist maintenance of distributed infostructures (HTML webs). It
has its own page.
-------------------------------------------------------------------------------

HTMLgobble

Maintained by Andreas Ley <ley@rz.uni-karlsruhe.de>

A mirroring robot. Configured to stay within a directory, sleeps between
requests, and the next version will use HEAD to check if the entire document
needs to be retrieved.

Identification: Uses User-Agent: HTMLgobble v2.2, and it sets the From field.
Usually run by the author, from tp70.rz.uni-karlsruhe.de.

Now source is available (but unmaintained).
-------------------------------------------------------------------------------

WWWW - the WORLD WIDE WEB WORM

Maintained by Oliver McBryan <mcbryan@piper.cs.colorado.edu>.

Another indexing robot, for which more information is available. Actually has
quite flexible search options.

Awaiting identification information (run from piper.cs.colorado.edu?).
-------------------------------------------------------------------------------

WM32 Robot

Run by Christophe Tronche <Christophe.Tronche@lri.fr>

It has its own page. Supposed to be compliant with the proposed standard for
robot exclusion.

Identification: run from hp20.lri.fr, User-Agent W3M2/0.02 and From field is
set.
-------------------------------------------------------------------------------

Websnarf

Maintained by Charlie Stross <charless@sco.com>

A WWW mirror designed for off-line browsing of sections of the web.

Identification: run from ruddles.london.sco.com.
-------------------------------------------------------------------------------

The Webfoot Robot

Run by Lee McLoughlin <L.McLoughlin@doc.ic.ac.uk>

First spotted in Mid February 1994.

Identification: It runs from phoenix.doc.ic.ac.uk

Further information unavailable.
-------------------------------------------------------------------------------

Lycos

Owned by Dr. Michael L. Mauldin <fuzzy@cmu.edu> at Carnegie Mellon University.

This is a research program in providing information retrieval and discovery in
the WWW, using a finite memory model of the web to guide intelligent, directed
searches for specific information needs.

You can search the Lycos database of WWW documents, which currently has
information about 390,000 documents in 87 megabytes of summaries and pointers.

More information is available on its home page.

Identification: User-agent "Lycos/x.x", run from fuzine.mt.cs.cmu.edu. Lycos
also complies with the latest robot exclusion standard.
-------------------------------------------------------------------------------

ASpider (Associative Spider)

Written and run by Fred Johansen <fred@nvg.unit.no>

Currently under construction, this spider is a CGI script that searches the web
for keywords given by the user through a form.

Identification: User-Agent: "ASpider/0.09", with a From field
"fredj@nova.pvv.unit.no".
-------------------------------------------------------------------------------

SG-Scout

Introduced by Peter Beebee <ptbb@ai.mit.edu, beebee@parc.xerox.com>

Run since 27 June 1994, for an internal XEROX research project, with some
information being made available on SG-Scout's home page

Does a "server-oriented" breadth-first search in a round-robin fashion, with
multiple processes.

Identification: User-Agent: "SG-Scout", with a From field set to the operator.
Complies with standard Robot Exclusion. Run from beta.xerox.com.
-------------------------------------------------------------------------------

EIT Link Verifier Robot

Written by Jim McGuire <mcguire@eit.COM>

Announced on 12 July 1994, see their page.

Combination of an HTML form and a CGI script that verifies links from a given
starting point (with some controls to prevent it going off-site or limitless).

Seems to run at full speed...

Identification: version 0.1 sets no User-Agent or From field. From version 0.2
up the User-Agent is set to "EIT-Link-Verifier-Robot/0.2". Can be run by anyone
from anywhere.
-------------------------------------------------------------------------------

ANL/MCS/SIGGRAPH/VROOM Walker

Owned/Maintained by Bob Olson <olson@mcs.anl.gov>

This robot is gathering data to do a full-text glimpse and provide a Web
interface for it. The index, and further information, will appear on ANL's
server.

Identification: sets User-agent to "ANL/MCS/SIGGRAPH/VROOM Walker", and From to
"olson.anl.gov".

Now follows the exclusion protocol, and doesn't perform rapid fire searches.
-------------------------------------------------------------------------------

WebLinker

Written and run by James Casey <casey@ptsun00.cern.ch>

It is a tool called 'WebLinker' which traverses a section of web, doing
URN->URL conversion. It will be used as a post-processing tool on documents
created by automatic converters such as LaTeX2HTML or WebMaker. More
information is on its home page.

At the moment it works at full speed, but is restricted to local sites.
External GETs will be added, but these will be running slowly.

WebLinker is meant to be run locally, so if you see it elsewhere let the author
know!

Identification: User-agent is set to 'WebLinker/0.0 libwww-perl/0.1'.
-------------------------------------------------------------------------------

Emacs w3-search

Written by William M. Perry <wmperry@spry.com>

This is part of the w3 browser mode for Emacs, and half implements a
client-side search for use in batch processing. There is no interactive access
to it.

For more info see the Searching section in the Emacs-w3 User's Manual.

I don't know if this is ever actually used by anyone...
-------------------------------------------------------------------------------

Arachnophilia

Run by Vince Taluskie <taluskie@utpapa.ph.utexas.edu>

The purpose (undertaken by HaL Software) of this run was to collect
approximately 10k html documents for testing automatic abstract generation.
This program will honor the robot exclusion standard and wait 1 minute in
between requests to a given server.

Identification: Sets User-agent to 'Arachnophilia', runs from halsoft.com.
-------------------------------------------------------------------------------

Mac WWWWorm

Written by Sebastien Lemieux <lemieuse@ERE.UMontreal.CA>

This is a French Keyword-searching robot for the Mac, written in HyperCard. The
author has decided not to release this robot to the public.

Awaiting identification details.
-------------------------------------------------------------------------------

churl

Maintained by Justin Yunke <yunke@umich.edu>

A URL checking robot, which stays within one step of the local server, see
further information.

Awaiting identification details.
-------------------------------------------------------------------------------

tarspider

Run by Olaf Schreck <chakl@fu-berlin.de> (Can be fingered at
chakl@bragg.chemie.fu-berlin.de or olafabbe@w255zrz.zrz.tu-berlin.de)

Sets User-Agent to "tarspider <version>", and From to "chakl@fu-berlin.de".
-------------------------------------------------------------------------------

The Peregrinator

Run by Jim Richardson <jimr@maths.su.oz.au>.

This robot, in Perl V4, commenced operation in August 1994 and is being used to
generate an index called MathSearch of documents on Web sites connected with
mathematics and statistics. It ignores off-site links, so does not stray from a
list of servers specified initially.

Identification: The current version sets User-Agent to
Peregrinator-Mathematics/0.7. It also sets the From field.

The robot follows the exclusion standard, and accesses any given server no more
often than once every several minutes.

A description of the robot is available.
-------------------------------------------------------------------------------

Checkbot

Currently maintained by Hans de Graaff <j.j.degraaff@twi.tudelft.nl>.

This program verifies links within a given domain, and all external links
within this domain. External links are checked with a HEAD request only.

Additional information is available through the Checkbot information page.

Sets User-agent to 'checkbot.pl/x.x libwww-perl/x.x' and sets the From field.
-------------------------------------------------------------------------------

web-walk

Written by Rich Testardi <rpt@fc.hp.com>

A world-wide-web maintenance robot.

Sets User-agent and the From field.
-------------------------------------------------------------------------------

Harvest

Run by hardy@bruno.cs.colorado.edu

A Resource Discovery Robot, part of the Harvest Project.

Runs from bruno.cs.colorado.edu, sets User-agent and From fields.

Pauses 1 second between requests (by default).

Note that Harvest's motivation is to index community- or topic- specific
collections, rather than to locate and index all HTML objects that can be
found. Also, Harvest allows users to control the enumeration several ways,
including stop lists and depth and count limits. Therefore, Harvest provides a
much more controlled way of indexing the Web than is typical of robots.
-------------------------------------------------------------------------------

Katipo

Run by Michael Newbery <Michael.Newbery@vuw.ac.nz>

I've written a robot to retrieve specific WWW pages. A Mac WWW robot that
periodically (typically, once a day), walks through the global history file
produced by some Browsers (NetScape and NCSA Mosaic for example), checking for
documents that have changed since last visited. See its information page.

The Robot is called Katipo and identifies itself as User-Agent: Katipo/1.0
From: Michael.Newbery@vuw.ac.nz It emits _only_ HEAD queries, and prefers to
work through proxy (caching) servers.
-------------------------------------------------------------------------------

InfoSeek Robot 1.0

By Steve Kirsch <stk@infoseek.com>

Its purpose is to collect information to use in a "WWW Pages" database in
InfoSeek's information retrieval service (for more information on InfoSeek,
please send a blank e-mail to info@infoseek.com).

The Robot follows all the guidelines listed in "Guidelines for Robot Writers"
and we try to run it on off hours.

We will be updating the WWW database daily with new pages and re-load from
scratch no more frequently than once per month (probably even longer). Most
sites won't get more than 20 requests a month from us since there are only
about 100,000 pages in the database.
-------------------------------------------------------------------------------

GetURL

Written and maintained by James Burton <burton@cs.latrobe.edu.au>.

A robot written in REXX, which downloads a hierarchy of documents with a
breadth-first search.

Example usage:

geturl http://info.cern.ch/ -recursive -host info.cern.ch -path /hypertext/#?

would restrict the search to the specified host and path.

Source and documentation are available.

The Use-Agent field is set to 'GetUrl.rexx v1.0 by burton@cs.latrobe.edu.au'
-------------------------------------------------------------------------------

Open Text Corporation Robot

Run by Tim Bray <tbray@opentext.com&rt;

Sets User-agent to 'OMW/0.1 libwww/217'

Follows robot exclusion rules, and shouldn't visit any host more than once in 5
minutes.
-------------------------------------------------------------------------------

The TkWWW Robot

Implemented by Scott Spetka <scott@cs.sunyit.edu>

The TkWWW Robot is described in a paper presented at the WWW94 Conference in
Chicago. It is designed to search Web neighborhoods to find pages that may be
logically related. The Robot returns a list of links that looks like a hot
list. The search can be by key word or all links at a distance of one or two
hops may be returned.

For more information see The TkWWW Home Page.
-------------------------------------------------------------------------------
Martijn Koster