WebCopy v0.97b2 95/07/01
Copyright 1994, 1995 by
Víctor Parada
(vparada@inf.utfsm.cl).
Copy files (recursively) via HTTP protocol.
NEWS!!! Current version is Beta2.
It is still not available at the home FTP site (the FTPmaster is on
hollidays). You can pick this version temporaly at
httpd://www.inf.utfsm.cl/~vparada/archive/webcopy-0.97b2.tar.gz
while this message is here.
Use the link in this paragraph to get the software, others gets Beta1.
Description:
WebCopy is a perl
program that retrieves
the URL specified in a unix
-like command line.
It can also retrieve recursively any file that an HTML file references,
i.e. inlined images and/or anchors, if specified with an option.
It can be used as a "mirror" program to retrieve
a tree of documents from a remote site, and put them on-line immediately
through the local server.
By default, only the document pointed by the URL in the command line
is retrieved.
Many switches can be specified in the command line and each option enables
one type reference to follow.
To avoid endless recursion,
only files at one site can be retrieved with one command.
WebCopy never follows links to files not in the same host, port number
and protocol (only HTTP supported) of the first document retrieved.
A list of discarded URLs are logged for future references.
This program does not comply with the
Robot
Exclusion Standard,
since it retrieves almost what the user specifies in the command line.
The user must know what kind of server and documents he want to access.
Webcopy does all what it knows to stop at CGI-generated files
(virtual documents).
What's New:
- Slightly improved code :-)
- More restrictive, to avoid endless recursion.
- Added PROXY support. An HTTP proxy must be specified in the command line
or through the
http-proxy
environment variable.
- It can POST data to CGI scripts (but cannot recurse on output).
- Added delay time between connections, to avoid overload on the server.
Duration can be specified by an option in the command line.
Defaults to 30 seconds.
- Small bug (a missnamed variable was massive rejecting links as CGI scripts)
removed from version 0.97b.
Syntax:
webcopy [options] http://host:port/path/file [http://proxy:port]
Options (can be combined):
-o
- output through stdout.
You can redirect the output to another filename or pipe it to a program.
Use it with -s
option.
You cannot recurse HTML files or use -v
or -q
options in this mode.
-v
- operates in verbose mode.
Displays every URL to fetch. -vv
is "very verbose" and
outputs every header line the server sends with the file.
-q
- query each URL to transfer.
Use it to select the files to transfer.
Enter 'n
' to skip file, 'y
' to transfer,
'a
' to transfer all the remaining files,
and 'q
' to quit immediately.
If you don't say 'y
' to the first file (the one specified
in the command line), no recursion is made.
-s
- do not log in '
W.log
'.
This file is stored in the root of the working directory.
It can be parsed to get a list of every file (NOT) transfered.
-t
delay
- set 'path' seconds between transfers.
This option is used to change the default 30 seconds of delay before
every connection. This delay is due to avoid server overload.
-w
path
- set working directory to 'path'.
WebCopy stores files in the current working directory. Use this
option to force WebCopy to use another directory.
-x
file
- set default index to 'file'.
When a directory index is required, this is the filename that is used
to store the output. Defaults to index.html
.
-z
file
- post 'file' or query string if ommited.
You can send some URL-encoded form data using the POST method to
a CGI script. If the filename is omited, the data is taken from the
query string specified in the URL after a "?
".
The data in the file must be in URL-encoded format, and spaces are
suppressed.
-r
- recurse HTML documents.
Same as -il
.
-i
- include inlined images.
Retrieves the files referenced in <IMG>
and
<FIG>
tags.
-l
- follow hypertext links.
Recurse through hypertext references in .html
documents.
Warning: Never leave WebCopy unattended if you don't
know what you are recursively retrieving.
-m
- allow imagemaps.
NOT available yet.
-c
- allow links to CGI scripts.
By default, WebCopy discards references that seems to be a CGI script
(e.g. /cgi-bin/
in the path).
Use this option if you want to retrieve the output of a CGI script.
If the base path is other than current, you'll also require options
-paf
.
-a
- allow absolute references to the same host.
References like /path/to/file.html
, where the path is the
current one (the one that was specified in the command line) are not
rejected when this option is specified.
If other paths are required, also use option -p
.
-f
- allow full URL references to the same host.
Complete http:
URLs are accepted only if this option is
specified in the command line and the host and port remain the same than
the current, but still rejected unless option -p
is also
specified.
-p
- allow paths other than current.
References like /images/some.gif
, where the path is not the
current, are accepted. Use this option to allow references to CGI scripts.
To keep the same document structure of the server and to avoid document name
collision, option -d
is recommended.
Warning: This option can cause WebCopy to retrieve the
whole data from a server if it finds a reference to the server
root in some document while using recursion.
Never leave WebCopy unattended if you don't know what you are recursively
retrieving.
-d
- keep directory path in URL for local file.
The defaut behaviour of WebCopy is to set the working directory the
equivalent of the document directory specified in the command line's URL.
Using this option, WebCopy sets the working directory to be the same of
the root directory of the server, so directories in the path are also
created in the working directory.
If you want to specify this option after doing some documents transfer,
you'll have to create the subdirectories yourself and move the retrieved files
in working directory to the subdirectory, or you will get duplicated files.
-u
- use local copy of file if exists.
Before doing a request to a server, WebCopy checks for the file in the
working directory, and sends file information to the server. Only if the file
was changed since last access, the new version is retrieved.
This option forces WebCopy to use the local copy of the file if it exists,
without checking if the file was changed in the server.
-n
- don't use defined PROXY.
If http_proxy
environtment variable is defined, this option
makes WebCopy to ignore it. It also ignores a PROXY specified in the command
line.
-h
- help.
Webcopy displays a brief help, ignores other options specified and exits.
Note: Some options conflicts with others.
For example, you cannot use -v
and -o
at the
same time because both require STDOUT.
Examples:
- To retrieve a single file and store it with some name in current directory:
webcopy -so http://www.host/images/icon.gif > logo.gif
- To retrieve a page and some of the inlined images without delay:
webcopy -vsiqt0 http://www.host/page.html
and press RETURN on each file NOT to transfer.
- To mirror a group of files in some other directory:
webcopy -rwpub/mirror/name http://www.host/intro.html
- To retrieve the output of a form:
- Get the form:
webcopy -so http://www.host/form.html > form.html
- Using an editor, change:
<FORM METHOD=POST ACTION="http:/www.host/cgi-bin/proc">
tag into:
<FORM METHOD=POST ACTION="mailto:yourself@yourdomain">
- Using a WWW browser, read the modified file, fill the form and
press "OK" button.
- Wait for your own mail to arrive. It should contain the posted
URL-encoded data in the body.
- Save the mail in a file (
post.dat
) without
the mail headings.
- Post the data:
webcopy -so -zpost.dat http://www.host/cgi-bin/proc > result.html
If you are smart enough, you can write your own files of data and just do
step 6, or use the following:
webcopy -so -z http://www.host/cgi-bin/proc?postdata > result.html
- To verbosely retrieve html documents and icons that are not in the same
directory of the server:
webcopy -vvrpafd http://www.host/path/page.html
- To retrieve a file using a PROXY, overriding the default
http_proxy
environtment variable:
webcopy http://www.host/path/page.html http://otherproxy
License Agreement and Lack of Warranty:
- The author of this program is Victor Parada <vparada@inf.utfsm.cl>.
- This program is "Freeware", not "Public Domain".
- This program must be distributed for free, and cannot be included in
commercial packages without prior written permisson from the autor.
- This program cannot be distributed if modified in any way.
- This program can be used by anyone if the copyright and this notice
remains intact in every file.
- If you modify this program, please e-mail patches to the the author.
- This is a Beta version of the program. You have been warned!
- This program is provided ``AS IS'', without any warranty.
- This program can cause huge file transfers and all the related effects.
- This program can fill data disks without notice.
- Neither the author nor UTFSM are responsibles for the use of this program.
- Bug reports, comments, questions and suggestions are welcome! But
please check first that you have the latest version!
If you (want to) use this program, please send
e-mail to the author.
He will try to notify you of any updates made to it.
System Requirements:
perl
interpreter (either 4.036 or 5.000 or later) with
perl library (sys/socket.ph
and timelocal.pl
).
hostname
program or script to get current host's name.
- TCP/IP connection and Sockets.
- Space on disk.
- A machine with all the above available.
Down-loading and Setting-Up:
- Make sure you have the previous System Requirements.
- Get the latest version of WebCopy from its home FTP server:
ftp://ftp.inf.utfsm.cl/pub/utfsm/perl/webcopy.tgz
This is a gzip
'ed tar
archive.
- Untar the file with the command:
tar -xzvf webcopy.tgz
(GNU version of tar
).
- Make sure you got the following files in a subdir called
webcopy-0.97
:
webcopy
webcopy.html
webcopy.txt
- Read the License Agreement and Lack of Warranty
in
webcopy.txt
, or in webcopy.html
using an HTML browser.
- Edit the first line of
webcopy
if your
perl
interpreter is not located at
/usr/local/bin/perl
.
- Move
webcopy
to a suitable directory.
- Use it at your own risk!
- Register yourself (it's free)
and send feed-back!
If you cannot do gunzip
or tar
, please
send e-mail to the author.
He will try to send you a shar
'ed copy of it :-)
Document last modified on 1995/07/16 by Víctor Parada