Splinterweb Web Site Extraction and Mapping

Splinterweb Web Site Extraction and Mapping

This is a Perl program designed to extract a subset of a complete Web Site in a consistent fashion so all URL's linking to original site are correctly mapped either internally to new Web site or externally to original Web Site. Splinterweb will analyze all URL's in all files either specified directly or found recursively in a directory tree. URL's can be either relative or absolute and are converted if necessary in each case. There are some special options to support the WebWisdom delivery system. Splinterweb produces a log, which includes information on missing links discovered in conversion process. This log can be accumulated over several runs and analysis of missing links etc. can be performed from combined runs. Splinterweb defines a "splinter" of the original Web Site and then can either produce in multiple runs or selectively update selected parts of the splinter. Splinterweb simply copies files unless they have .HTML or .HTM (any case) extension. In these cases, it reads file and converts any URL's discovered. URL's discovered in known tags are carefully converted with relative addressing if necessary. Other matched URL's are converted using simple pattern match that misses a relative URL outside a TAG.

Command Line

splinterweb optionfile wisdomtemplate readmetemplate

optionfile contains options defining features of conversion

wisdomtemplate is a special file defining master template for a file to be produced allowing WebWisdom to map URL's correctly. You can use default file of this name in distribution directory.

readmetemplate is a special file defining master template to be used to produced a header file describing the converted Web Site. You can use default file of this name in distribution directory.

Distribution directory has a sample runsplinter shell file.

Note there are several critical files defined in optionfile - in particular the directory where log files are to written and the directory where new web image to be constructed

Optionfile

This defines a set of parameters which are input in simple

name:value syntax where allowed attribute names are given below. Any attributes can be continued with contd:morestuff syntax where morestuff is concatenated with previous value or CONTD:morestuff syntax which is concatenated with previous value with a newline separating them. After from this special CONTD attribute, case is ignored in attribute names. (M) Below implies that attribute is an array and can appear many times to specify separate array elements. Otherwise multiple entries overwrite each other and only last entry is kept. All options are read before any are processed.

Convertfile (M): A file in the splinter

Convertdir (M): The files found by recursive descent starting at this directory are to be placed in splinter

Doconvertdir (M): This directory should be converted on this run. Any such value can either be one of the values defined by a convertdir directive or a subdirectory of such a named directory. If no doconvertdir names are specified, ALL of the directories defined in convertdir entries are converted. If a single doconvertdir attribute with value no or none is specified, no directories are converted.

Doconvertfile (M): This file should be converted on this run. This file should either be named in a convertfile directive or contained in a convertdir directory or one of its sub-directories. Files specified in convertfile or doconvertfile are in addition to those found recursively with the convertdir/doconvertdir attribute. If no doconvertfile names are specified, ALL of the files defined in convertfile entries are converted. If a single doconvertfile attribute with value no or none is specified, no separate files are converted.

Skip (M): do not copy under any circumstances a file with this name

Map (M): oldstring, newstring maps oldstring into newstring in places where URL's are possible. This is only necessary for mappings that are not implied by lists given in convertfile, convertdir. There is a special option oldstring, newstring, no which is only used in WebWisdom file generation.

Newbasedir: directory where we will start writing the converted copy of the splinter Web Site

Copyover: default is 1 which, implies that splinterweb copies a file even it already exists in output directory. If copyover is 0 splinterweb will ignore files that have already been copied. Note splinterweb does not remake existing directories. You must remove files that exist in target directories from an earlier run but are no longer needed. Splinterweb will create any needed directories.

Oldbaseurlstem: URL of effective root of original Web Site - this is typically the document root but need not be. One usually runs splinterweb from this directory. Using true document root ensures all relative addresses within original site are correctly interpreted. Note splinterweb converts all URL's to same form whether they were originally absolute or relative.

Newbaseurlstem: URL to be used in new splinter web site to replace that given in oldbaseurlstem. It need not be a document root.

Userelative: if 1 use relative addresses where ever possible, if 0 use absolute addresses

Selectionfile: if nonnull, create a file of this name in base directory of new file system documenting conversions performed. This file uses readmetemplate specified in third argument in splinterweb calling sequence and attributes selectionname and selectiontext. See example readmetemplate for simple way that this file is specified

Selectionname: Header name of selection file describing splinter.

Selectiontext: Main text describing splinter. Note one "automatically" in a good readmetemplate gets a good listing of files and directories converted.

Logfile: this if non null must be logdirectory/filename where logdirectory must exist. Splinterweb creates several log files gotten by appending 1 2 .. 10 to basic logfile specified. It also writes a file logdirectory/filenamelabel which describes each of the log files. This logdirectory directory also holds a copy of input data, the WebWisdom mapping files defined by second command line argument, and the selection description file defined in selectionfile.

summary (M): previous log files specified directory and typical file name as in logfile, which are to be concatenated with results of this run for analysis of missing links etc.

Bugs and Features

We should remove second and third command line arguments and put as optionfile attributes
The algorithm used to identify URL's outside HTML tags can be improved. These are typically either arguments of JavaScript functions or just parts of text.
No attempt is made for efficiency. In particular we store log information in Perl hash variables and although Perl has no limit, system runs particularly slowly if you need to page these to disk. You can speed up processing by running a large site in chunks using doconvertdir to selectively convert