WebCopy v0.97b2 95/07/01

Copyright 1994, 1995 by Víctor Parada (vparada@inf.utfsm.cl).

Copy files (recursively) via HTTP protocol.

NEWS!!! Current version is Beta2.

It is still not available at the home FTP site (the FTPmaster is on hollidays). You can pick this version temporaly at httpd://www.inf.utfsm.cl/~vparada/archive/webcopy-0.97b2.tar.gz while this message is here. Use the link in this paragraph to get the software, others gets Beta1.

Description:

WebCopy is a perl program that retrieves the URL specified in a unix-like command line. It can also retrieve recursively any file that an HTML file references, i.e. inlined images and/or anchors, if specified with an option.

It can be used as a "mirror" program to retrieve a tree of documents from a remote site, and put them on-line immediately through the local server.

By default, only the document pointed by the URL in the command line is retrieved. Many switches can be specified in the command line and each option enables one type reference to follow.

To avoid endless recursion, only files at one site can be retrieved with one command. WebCopy never follows links to files not in the same host, port number and protocol (only HTTP supported) of the first document retrieved. A list of discarded URLs are logged for future references.

This program does not comply with the Robot Exclusion Standard, since it retrieves almost what the user specifies in the command line.

The user must know what kind of server and documents he want to access. Webcopy does all what it knows to stop at CGI-generated files (virtual documents).

What's New:

Syntax:

webcopy [options] http://host:port/path/file [http://proxy:port]

Options (can be combined):

-o
output through stdout.
You can redirect the output to another filename or pipe it to a program. Use it with -s option. You cannot recurse HTML files or use -v or -q options in this mode.
-v
operates in verbose mode.
Displays every URL to fetch. -vv is "very verbose" and outputs every header line the server sends with the file.
-q
query each URL to transfer.
Use it to select the files to transfer. Enter 'n' to skip file, 'y' to transfer, 'a' to transfer all the remaining files, and 'q' to quit immediately. If you don't say 'y' to the first file (the one specified in the command line), no recursion is made.
-s
do not log in 'W.log'.
This file is stored in the root of the working directory. It can be parsed to get a list of every file (NOT) transfered.
-tdelay
set 'path' seconds between transfers.
This option is used to change the default 30 seconds of delay before every connection. This delay is due to avoid server overload.
-wpath
set working directory to 'path'.
WebCopy stores files in the current working directory. Use this option to force WebCopy to use another directory.
-xfile
set default index to 'file'.
When a directory index is required, this is the filename that is used to store the output. Defaults to index.html.
-zfile
post 'file' or query string if ommited.
You can send some URL-encoded form data using the POST method to a CGI script. If the filename is omited, the data is taken from the query string specified in the URL after a "?". The data in the file must be in URL-encoded format, and spaces are suppressed.
-r
recurse HTML documents.
Same as -il.
-i
include inlined images.
Retrieves the files referenced in <IMG> and <FIG> tags.
-l
follow hypertext links.
Recurse through hypertext references in .html documents.
Warning: Never leave WebCopy unattended if you don't know what you are recursively retrieving.
-m
allow imagemaps.
NOT available yet.
-c
allow links to CGI scripts.
By default, WebCopy discards references that seems to be a CGI script (e.g. /cgi-bin/ in the path). Use this option if you want to retrieve the output of a CGI script. If the base path is other than current, you'll also require options -paf.
-a
allow absolute references to the same host.
References like /path/to/file.html, where the path is the current one (the one that was specified in the command line) are not rejected when this option is specified. If other paths are required, also use option -p.
-f
allow full URL references to the same host.
Complete http: URLs are accepted only if this option is specified in the command line and the host and port remain the same than the current, but still rejected unless option -p is also specified.
-p
allow paths other than current.
References like /images/some.gif, where the path is not the current, are accepted. Use this option to allow references to CGI scripts. To keep the same document structure of the server and to avoid document name collision, option -d is recommended.
Warning: This option can cause WebCopy to retrieve the whole data from a server if it finds a reference to the server root in some document while using recursion. Never leave WebCopy unattended if you don't know what you are recursively retrieving.
-d
keep directory path in URL for local file.
The defaut behaviour of WebCopy is to set the working directory the equivalent of the document directory specified in the command line's URL. Using this option, WebCopy sets the working directory to be the same of the root directory of the server, so directories in the path are also created in the working directory. If you want to specify this option after doing some documents transfer, you'll have to create the subdirectories yourself and move the retrieved files in working directory to the subdirectory, or you will get duplicated files.
-u
use local copy of file if exists.
Before doing a request to a server, WebCopy checks for the file in the working directory, and sends file information to the server. Only if the file was changed since last access, the new version is retrieved. This option forces WebCopy to use the local copy of the file if it exists, without checking if the file was changed in the server.
-n
don't use defined PROXY.
If http_proxy environtment variable is defined, this option makes WebCopy to ignore it. It also ignores a PROXY specified in the command line.
-h
help.
Webcopy displays a brief help, ignores other options specified and exits.

Note: Some options conflicts with others. For example, you cannot use -v and -o at the same time because both require STDOUT.

Examples:

  1. To retrieve a single file and store it with some name in current directory:
    webcopy -so http://www.host/images/icon.gif > logo.gif
  2. To retrieve a page and some of the inlined images without delay:
    webcopy -vsiqt0 http://www.host/page.html
    and press RETURN on each file NOT to transfer.
  3. To mirror a group of files in some other directory:
    webcopy -rwpub/mirror/name http://www.host/intro.html
  4. To retrieve the output of a form:
    1. Get the form:
      webcopy -so http://www.host/form.html > form.html
    2. Using an editor, change:
      <FORM METHOD=POST ACTION="http:/www.host/cgi-bin/proc">
      tag into:
      <FORM METHOD=POST ACTION="mailto:yourself@yourdomain">
    3. Using a WWW browser, read the modified file, fill the form and press "OK" button.
    4. Wait for your own mail to arrive. It should contain the posted URL-encoded data in the body.
    5. Save the mail in a file (post.dat) without the mail headings.
    6. Post the data:
      webcopy -so -zpost.dat http://www.host/cgi-bin/proc > result.html
    If you are smart enough, you can write your own files of data and just do step 6, or use the following:
    webcopy -so -z http://www.host/cgi-bin/proc?postdata > result.html
  5. To verbosely retrieve html documents and icons that are not in the same directory of the server:
    webcopy -vvrpafd http://www.host/path/page.html
  6. To retrieve a file using a PROXY, overriding the default http_proxy environtment variable:
    webcopy http://www.host/path/page.html http://otherproxy

License Agreement and Lack of Warranty:

If you (want to) use this program, please send e-mail to the author. He will try to notify you of any updates made to it.

System Requirements:

Down-loading and Setting-Up:

  1. Make sure you have the previous System Requirements.
  2. Get the latest version of WebCopy from its home FTP server: ftp://ftp.inf.utfsm.cl/pub/utfsm/perl/webcopy.tgz
    This is a gzip'ed tar archive.
  3. Untar the file with the command:
    tar -xzvf webcopy.tgz
    (GNU version of tar).
  4. Make sure you got the following files in a subdir called webcopy-0.97:
  5. Read the License Agreement and Lack of Warranty in webcopy.txt, or in webcopy.html using an HTML browser.
  6. Edit the first line of webcopy if your perl interpreter is not located at /usr/local/bin/perl.
  7. Move webcopy to a suitable directory.
  8. Use it at your own risk!
  9. Register yourself (it's free) and send feed-back!

If you cannot do gunzip or tar, please send e-mail to the author. He will try to send you a shar'ed copy of it :-)


Document last modified on 1995/07/16 by Víctor Parada