Copyright 1994, 1995, 1996 by Víctor Parada (vparada@inf.utfsm.cl).
Copy files (recursively) via HTTP protocol.
WebCopy is a perl
program that retrieves
the URL specified in a unix
-like command line.
It can also retrieve recursively any file that a HTML file references,
i.e. inlined images and/or anchors, if specified with an option.
It can be used as a "mirror" program to retrieve a tree of documents from a remote site, and put them on-line immediately through the local server.
By default, only the document pointed to by the URL in the command line is retrieved. Many switches can be specified in the command line and each option enables one type reference to follow.
To avoid endless recursion,
only files at one site can be retrieved with one command.
WebCopy never follows links to files not in the same host, port number
and protocol (only HTTP/1.x supported) of the first document retrieved.
A list of discarded URLs are logged for future references in a file
called "W.log
".
This program does not comply with the Robot Exclusion Standard, since it retrieves almost what the user specifies in the command line.
The user must know what kind of server and documents he want to access. WebCopy does all what it knows to stop at CGI-generated files (virtual documents).
.map
and is not recognized as HTML doc
by the server.
Makefile
is provided for easy installation.
webcopy [options] http://host:port/path/file [http://proxy:port]
-o
-g
and -s
option.
You cannot recurse HTML files or use -v
or -q
options in this mode.
-v
-vv
is "very verbose" and
outputs every header line the server sends with the file.
-q
n
' to skip file, 'y
' to transfer,
'a
' to transfer all the remaining files,
and 'q
' to quit immediately.
If you don't say 'y
' to the first file (the one specified
in the command line), no recursion is made.
-s
W.log
'.
-t
delay
-w
path
-x
file
index.html
.
-z
file
?
".
The data in the file must be in URL-encoded format, and spaces are
suppressed.
-y
userid:password
-v
was also specified
in the command line.
-Y
userid:password
-y
except that an older authentication method
is used. Try this if the other one does not work.
-k
ext1:ext2:ext3:...
-K
ext1:ext2:ext3:...
-r
depth
-il
.
<A>
,
<FRAME>
, and <AREA>
tags.
-i
<IMG>
,
<FIG>
, <BODY>
,
<TABLE>
, and <BGSOUND>
tags.
-l
.html
documents.
-r
instead.
-m
.map
file
specified as an URL in the command line. Only NCSA's and CERN's
compatable formats are scanned.
To know which URL you must give, you must guess the location of that
file in the server.
Sometimes, you may remove "/cgi-bin/imagemap
"
from the URL available in the page which contains the map.
This is not done automagically by WebCopy, because some HTTP
servers (like Spinner) recognizes .map
extension,
and starts the CGI over that file, using the (given?) coordinates
instead of transfer the file.
-c
/cgi-bin/
in the path).
Use this option if you want to retrieve the output of a CGI script.
If the base path is other than current, you'll also require options
-paf
.
-a
/path/to/file.html
, where the path is the
current one (the one that was specified in the command line) are not
rejected when this option is specified.
If other paths are required, also use option -p
.
-f
http:
URLs are accepted only if this option is
specified in the command line and the host and port remain the same than
the current, but still rejected unless option -p
is also
specified.
-p
/images/some.gif
, where the path is not the
current, are accepted. Use this option to allow references to CGI scripts.
To keep the same document structure of the server and to avoid document name
collision, option -d
is recommended.
-d
-u
-g
-o
option and redirecting the output to a file with the
same name.
It may also force a PROXY or server with cache to refresh the file via
no-cache
pragma.
-n
http_proxy
environtment variable is defined, this option
makes WebCopy to ignore it. It also ignores a PROXY specified in the command
line.
-h
--dump
undump
command and you want to speed up
initialization/compile time, WebCopy generates a big
core
file if this option is the only one in the command line.
Note: Some options conflicts with others.
For example, you cannot use -v
and -o
at the
same time because both require STDOUT.
webcopy -so http://www.host/images/icon.gif > logo.gifIf you use the same name for both source and destination file, you must also specify
-g
option, or you will probably get an empty
file.
webcopy -vsiqt0 http://www.host/page.htmland press RETURN on each file NOT to transfer.
webcopy -vr1 http://www.host/page.htmlCheck in
W.log
for ignored links.
webcopy -rwpub/mirror/name http://www.host/intro.html
webcopy -so http://www.host/form.html > form.html
<FORM METHOD=POST ACTION="http:/www.host/cgi-bin/proc">tag into:
<FORM METHOD=POST ACTION="mailto:yourself@yourdomain">
post.dat
) without
the mail headings.
webcopy -so -zpost.dat http://www.host/cgi-bin/proc > result.html
webcopy -so -z http://www.host/cgi-bin/proc?postdata > result.html
webcopy -vvrpafd http://www.host/path/page.html
http_proxy
environment variable:
webcopy http://www.host/path/page.html http://otherproxy
Aladdin
" and password "open sesame
",
you must quote embeded blanks:
webcopy "-yAladdin:open sesame" http://www.host/path/page.html
.html
or .txt
or
.lst
or .htm
files:
webcopy -r -khtml:txt:lst:htm http://www.host/path/page.htmlor
webcopy -r -khtml -ktxt -klst -khtm http://www.host/path/page.htmlor even
webcopy -r -k.html.txt.lst.htm http://www.host/path/page.html
webcopy -r -Kps:doc:rtf http://www.host/path/page.html
If you (want to) use this program, please send e-mail to the author. He will try to notify you of any updates made to it.
perl
interpreter and library,
either 4.036 (with sys/socket.ph
and timelocal.pl
)
or 5.000 or later (with Socket.pm
and timelocal.pl
).
hostname
program or script to get current host's name.
gzip
'ed tar
archive.
tar -xzvf webcopy.tgz(GNU version of
tar
).
webcopy-0.98b7
:
Makefile
webcopy.src
webcopy.html
webcopy.html
using an HTML browser.
Makefile
file and select or change the path
and filename for PERL
and DESTINATION
macros as
required, and select the version of perl
for
IGNORE
macro.
Makefile
file:
makeor, to force perl 4 code in WebCopy:
make perl4or, to force perl 5 code in WebCopy:
make perl5If you cannot do
make
,
copy or move webcopy.src
to webcopy
,
edit webcopy
and change "%PERL%
"
in the first 2 lines into the location of
your perl
interpreter,
for example: "/usr/local/bin/perl
".
#P5
" if your interpreter is perl4
, or
"#P4
" if you are using perl5
(yes, this is OK. #P5 are lines with Perl 5 code).
#UNDEFINED
". If something goes wrong when you run WebCopy,
uncomment some of them and place the required code.
hostname
program. If doesn't,
create your own script:
#!/bin/sh /bin/uname -hor (if you run WebCopy in the same host every time):
#!/bin/sh echo "myhostname"or something like that, then make it executable:
chmod 755 hostnameand place it in a directory available in the
PATH
.
./webcopy -hIt should display a help menu.
./webcopy -vv http://www/It should display some status and create two files:
W.log
a log file where you can find useful info about
the last run.
index.html
the home page retrieved from the server at www.
webcopy
to a suitable directory.
This can also be done with:
make install
If you cannot do gunzip
or tar
, please
send e-mail to the author.
He will try to send you a shar
'ed copy of it :-)