Splinterweb Web Site Extraction and Mapping

This is a Perl program designed to extract a subset of a complete Web Site in a consistent fashion so all URL's linking to original site are correctly mapped either internally to new Web site or externally to original Web Site. Splinterweb will analyze all URL's in all files either specified directly or found recursively in a directory tree. URL's can be either relative or absolute and are converted if necessary in each case. There are some special options to support the WebWisdom delivery system. Splinterweb produces a log, which includes information on missing links discovered in conversion process. This log can be accumulated over several runs and analysis of missing links etc. can be performed from combined runs. Splinterweb defines a "splinter" of the original Web Site and then can either produce in multiple runs or selectively update selected parts of the splinter. Splinterweb simply copies files unless they have .HTML or .HTM (any case) extension. In these cases, it reads file and converts any URL's discovered. URL's discovered in known tags are carefully converted with relative addressing if necessary. Other matched URL's are converted using simple pattern match that misses a relative URL outside a TAG.

 

Command Line

splinterweb optionfile wisdomtemplate readmetemplate

 

 

Distribution directory has a sample runsplinter shell file.

Note there are several critical files defined in optionfile - in particular the directory where log files are to written and the directory where new web image to be constructed

Optionfile

This defines a set of parameters which are input in simple

name:value syntax where allowed attribute names are given below. Any attributes can be continued with contd:morestuff syntax where morestuff is concatenated with previous value or CONTD:morestuff syntax which is concatenated with previous value with a newline separating them. After from this special CONTD attribute, case is ignored in attribute names. (M) Below implies that attribute is an array and can appear many times to specify separate array elements. Otherwise multiple entries overwrite each other and only last entry is kept. All options are read before any are processed.

 

Bugs and Features