PERL 4 OVERVIEW Overview of Perl4 (and some Perl5) -- a Module for CPS616 -- Technologies for the Information Age Original: February 19,95 Updated March,1996 Geoffrey Fox NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 Abstract of PERL4 Overview for CPS616 This simple discussion of PERL4 describes the essential features needed to get going for general purpose programming A few Perl5 points are made when appropriate i.e. it does not describe the special concerns needed for systems programming but is aimed at what you need for writing CGI programs We reference in detail Llama Book: Learning PERL by Randal L. Schwartz and published by O'Reilly and Associates. ISBN: 1-56592-042-2 More detailed is the recently updated Camel book: Programming PERL by Larry Wall, Tom Christiansen and Randal L. Schwartz and also published by O'Reilly and Associates. ISBN: 1-56592-149-6 This is one of few authoritative Perl5 discussions Another useful book which lies between Llama and Camel books in completeness is: PERL by Example by Ellie Quigley, Prentice Hall. ISBN 0-13-122839-0 General Remarks on PERL Note PERL is an Interpreter and is a cross between C, the UNIX Shell, sed and awk. I certainly consider it easier than all four of these as even when these other approachs work, PERL produces clearer code (than especially the UNIX Shell) which is easier to write and debug. We describe PERL4 which can be invoked by putting PERL in an executable file (chmod +x on UNIX file) and making certain first line is (at NPAC) #!/usr/local/bin/perl We will later describe PERL5 which was released late 1994 and is analogous to C++ in same way PERL is analogous to C Note C as a compiler will be more efficient than PERL. We use PERL for those tedious high level things which take a long time to write but don't take much execution time. Computationally intensive loops should be coded in C (or equivalent) and called from PERL Note PERL is comprable to C for I/O and UNIX system calls but can be thousands of times slower than C for arithmetic. Scalar Data I -- Numbers (Chapter 2 of Llama Book) Scalars are either numbers or a string of characters as in C although in both cases there are significant if "second-order" differences numbers and strings are not typed separately Note Perl is "safer" than JavaScript in this regard and it is very rare that numbers and strings are confused Numbers are stored internally as integers if this represents them adequately -- otherwise as double precision numbers Perl and the runtime system make certain that this is transparent to user Wolfram's SMP (the forerunner of Mathematica) at Caltech (which I worked on) also made the more extreme choice of everything being double precision For example 1, 5.0, 4.5E23, and 7.45 E-15, are all numbers Octal and Hexadecimal numbers are allowed with O377 (initial zero) assumed to be Octal and so equal to 255 decimal OX or Ox reprepresents hexadecimal with letters A to F corresponding to numbers 10 to 15 as normal. OXFF is hex FF or also 255 decimal Scalar Data II -- Single Quoted Strings (Chapter 2 of Llama Book) There are two types of strings defined by a stream of characters inside either single quotes ' and ' or double quotes " and " Single quoted strings are the simplest but probably least often used. Inside such strings ALL characters including newlines are treated as they look except that ' must be represented as \' and \ as \\. \n does NOT represent a newline. Example: 'don\'t' is the 5 character don't As in C ALL strings are stored as zero byte terminated byte streams so that '' and "" are both stored internally as one byte 00 (octal zero) Scalar Data III -- Double Quoted Strings (Chapter 2 of Llama Book) Double Quoted strings are very similar to C with a lot of special characters given in table 2-1 of Llama book and online PERL man page. (See later in foils) Examples: \n is newline, \t is tab and \cC is Control-C Example: 'Hello World' is equivalent to "Hello\nWorld" Note \L instructs PERL that all following characters until a \E are to be interpreted as lower case. \U ... \E is similar but intervening characters are Upper case A critical feature of double quoted strings is that they can include variables designated by $ as initial character. For this reason use \$ to denote a real dollar sign in double quoted string. Variables are NEVER interpolated in single quoted strings Scalar Variables and Statements/Comments (Chapter 2 of Llama Book) Scalar Variables are named variables holding numeric or string scalars or both. There are NO types (integer float char) of variables. The interpretation of a scalar is always determined by context and not "type" of variable. $cps616_num is a scalar variable represting number of students in CPS616 Variables are set by conventional equal sign $cps616_num = 15; Note statements are ended by semicolons -- not by newlines and # can be used to denote a comment $Instructor= "Fox"; # Not really true as Wojtek did all the work Comments can be single lines or after statements or on a line on their own. They are terminated by newlines. Operators for Numbers and Strings I (Chapter 2 of Llama Book) Convential arithmetic operators are available in PERL +, -, =, /, ** (last is raise to power of) mean what you think 2+3 # is 5 10/3 is 3.33333 (and not 3) as numbers are floating point if necessary One important string operator is . for concatenate "Hello" . " World" is identical to "Hello World" Less used is x (times) used to replicate string data "hip " x 3 . "hurrah" # is "hip hip hip hurrah Note that (3+2) x (3+1) # is NOT a number but rather the string "5555" Operators for Numbers and Strings II -- Comparison (Chapter 2 of Llama Book) There are six basic comparison operators which are DIFFERENT for numeric and string comparisons. Here PERL is using operators to create context to define type There are a complex set of precedence rules but I always use parentheses and do not remember rules! "CPS" . (12*50) # is string "CPS600" Operators for Numbers and Strings III -- Binary Assignment (Chapter 2 of Llama Book) Assignments are $Next_Course = "CPS615"; $Funding = $Funding + $Contract; The latter can be written as in C as $Funding += $Contract; Similarly can write for strings: $Name= "Geoffrey"; $Name .= " Fox" # Sets $Name to "Geoffrey Fox" Example: $A = 6; $B = ($A +=2); # sets $A = $B = 8 AutoIncrement and Autodecrememt: as in C $a = $a + 1; $a +=1; and ++$a; # are the same and increment $a by 1 ++ and -- are both allowed and can be used BEFORE(prefix) or AFTER(suffix) variable(operand). Both forms change operand in same way but in suffix form result if used in expression is value BEFORE variable incremented. $a=3; $b = (++$a) # sets $a and $b to 4 $a=3; $b = ($a++) # sets $a to 4 and $b to 3 Interpolation of Scalars into Strings (Chapter 2 of Llama Book) We can use scalar variables in strings $h= "World"; $hw= "Hello $h"; # sets $hw to "Hello World" $h= "World"; $hw= "\UHello $h"; # sets $hw to "HELLO WORLD" showing how \U and similarly \L operate on interpolated variables As mentioned, there is NO interpolation for single quoted strings There is also no recursion as illustrated below: $fred= "You over there"; $x= '$fred'; $y= "Hey $x"; # sets $y as "Hey $fred" with no interpolation Use \$ to ensure no interpolation where you need real $ character $fred= "You over there"; $y= "Hey \$fred"; # sets $y as "Hey $fred" with no interpolation whereas: $fred= "You over there"; $y= "Hey $fred"; # sets $y as "Hey You over there" with interpolation used Use ${var} to remove ambiguity as in $y= "Hey ${fred}followed by more characters"; Some Simple Scalar I/O Capabilities (Chapter 2 of Llama Book) STDIN is a File Handle or Pointer and is a scalar representing next line from the input stream. $line = sets variable $line to next line read from standard input. Unusually, this line always includes terminating newline and so we have a special function to remove the last character of a string chop($line); removes last character in $line and returns scalar string value of this character $nl = chop($line) # should set $nl to "\n" and remove newline from $line chop is replaced by chomp in Perl5 We can also print scalars with print $line; print is more powerful and we learn about it later as argument can be a scalar but is normally a list or array Logical Operators Logical and: && as in $x && $y; If $x is true, evaluate $y and return $y If $x is false, evaluate $x and return $x Logical or: || as in $x || $y; If $x is true, evaluate $x and return $x If $x is false, evaluate $y and return $y Logical Not: ! as in ! $x; Return not $x and is same as &&, or the same as || and not is same as ! but lower (indeed lowest) precedence Arithmetic Operators + Addition as in $x + $y; - Subtraction as in $x - $y; * Multiplication as in $x * $y; / Division as in $x / $y; % Modulus as in $x % $y; # 10%3 is 1 etc. ** Power as in $x**$y; Bitwise Logical Operators Definitions and Examples Truth Table for Bitwise Operators Arrays and Lists of Scalars(data) I (Chapter 3 of Llama Book) PERL has four types of variables distinguished by their initial character Scalars with $ as initial character File Handles with Nothing special as their initial characters and conventionally represented by names which are all capitalized Simple arrays with @ as the initial character Associative arrays or dictionaries with % as initial character Arrays , represented by @fred are defined by comma-separated list of scalars @fred = (1, "second entry", $hw); # is an array with three entries Array entries can include scalar variables and more generally expressions which are evaluated when the array entry is USED not when it is DEFINED -- this is a difference from C but is for instance similar to functions in a spreadsheet @fred= (1, $hw . " more", $a+$b); # is an example of this Arrays and Lists of Scalars(data) II -- Construction (Chapter 3 of Llama Book) There is a list construction operator .. which provides a list by values incremented by 1 @fred = (1..4); # is list (1,2,3,4) and @fred = (2,3, $a..$b); # is a list determined by CURRENT (when used) values of integers $a and $b and starting with 2 and 3 We can also use usual assignment operator = in flexible ways @fred = @jeff; # sets two lists equal to each other while @fred= (4,5,@jeff); # defines a list @fred with two more entries than @jeff Arrays and Lists of Scalars(data) III -- Construction (Chapter 3 of Llama Book) More complicatedly we can set constructed lists equal to each other ($a, @fred) = @fred; # sets $a to first element of @fred and removes this first element from @fred ($a,$b,$c) = (1,2,3); # sets $a=1, $b=2, $c=3 Curiously setting a scalar equal to an array returns length of array $a = @fred; # returns $a as length of @fred whereas The function length returns number of characters in a string $a = length (@fred); # returns length in characters of first entry in @fred ($a) = @fred; # defines two lists equal to each other but as LHS only has one element, this instruction sets $a to be first entry of @fred Arrays and Lists of Scalars(data) IV -- Element Access (Chapter 3 of Llama Book) Note that @fred and $fred are totally different variables -- one an array and the other a scalar However the elements of @fred are scalars and like C are labelled starting at 0 not 1 (as in Fortran) -- such elements are referenced by $ NOT @ $a = $fred[0]; # is first element in @fred which we can set explicitly by $fred[0]= "First element of \@fred"; # note get variable interpolation for arrays as well as scalars and so one must "escape(protect)" @ in string Arrays and Lists of Scalars(data) V -- Element Access (Chapter 3 of Llama Book) One can use slices as in Fortran90 @fred= (0..10); # defines an array with 11 entries and @jeff = @fred[1,3]; # creates an array @jeff = (1,2,3); Indices can (of course) be expressions as in most languages and these can be any legal scalar @fred = (0..10); $a=2; $b= $fred[$a-1]; # sets $b equal to 1 Arrays and Lists of Scalars(data) VI -- Undefined (Chapter 3 of Llama Book) When variables are undefined or set to undefined as in $a = $b ; # and $b has not been defined They are given special value undef which typically behaves the same as null (character string) or zero (numeric) value returns undef when End of File is hit $fred =(0,1,2,3); $a = $fred[6]; # sets $a equal to undef $fred = (0,1,2,3); $fred[6]=7; $a= $fred[5]; # leaves $a and $fred[4,5,6] undefined $index = $#fred; # sets $index as index value of last entry in @fred $a=@fred; $b=$#fred; # imply that $b=$a-1 Useful functions defined() and exists() will be discussed in Perl5 notes -- they allow precise tests on defined variables Arrays and Lists of Scalars(data) VII -- Printing (Chapter 3 of Llama Book) The argument of print is just a list or array and so print "Hello"," Rest", "of World"; # or print @fred; # are legal One can read in a whole array to a list where each entry of list is one line of file. For instance one can set @file to be all of standard input by: @file = ; # We will learn how to do for arbitrary files As mentioned, double quoted strings use variable interpolation for arrays as in $string = "This is a full list @fred \n"; $string = " First value in fred list is $fred[0] \n"; Slices and variable indices can also be used in variable interpolation Arrays and Lists of Scalars(data) VIII -- Operators on Arrays (Chapter 3 of Llama Book) push adds information at end of a list(array) push(@stack,$new); # is equivalent to @stack = (@stack, $new); One can also use a list for second argument(s) in push as in push(@stack,6,"next",@anotherlist); pop is inverse operator to push and removes the last element in argument as well as returning value of this last element Note chop(@stack) removes last character of each entry of list -- not like pop which removes last entry of list unshift is idential to push except works on left (lowest indices) of list -- not on end of list shift is idential to pop except works on left (lowest indices) of list -- not on end of list reverse(@list) leaves @list unaltered but returns reversed list sort(@list) leaves @list unaltered but returns sorted list Control Structures -- if,else,unless,elsif (Chapter 4 of the Llama Book) Statement Blocks are sets of semi-colon separated statements enclosed in curly braces {} as in C if( TESTEXPRESSION) { statement block-true; } else { statement block-false; } One can leave off the else part but if statement-true; branch is null, then it is most elegant to use unless instead unless( TESTEXPRESSION ) { statementblock-false; } else { statement block-true; } where again else is optional Both if and unless constructs can use elsif constructs between if/unless and else blocks -- note spelling of elsif! Control Structures -- What is true and false (Chapter 4 of the Llama Book) in PERL all TESTEXPRESSION's are converted to strings and Either the null string or string '0' (same as "0") is evaluated as FALSE Everything else evaluates as TRUE Results of Comparison Operators are what you expect if ( $age < 18 ) evaluates as TRUE iff the numeric value of $age is less than 18 Note the numeric number 0 is converted to "0" and is FALSE as is numeric computation 1-1. The string "0.000" evaluates as TRUE Control Structures -- while,until (Chapter 4 of the Llama Book) The simplest iterations are while and until while ( TESTEXPRESSION) { some statement block; # Execute if TESTEXPRESSION true } is illustrated by while ( ) { Process Current Input line; # until end-of-file seen by null character in } Conversly we can wait for a signal to stop something until ( TESTEXPRESSION) { some statement block; Execute if TESTEXPRESSION false } Control Structures -- for Statement (Chapter 4 of the Llama Book) for ( beginning expression; endtest; doeachloop ) { FOR statement-block; } is just like C and equivalent to: beginning-expression; while ( endtest) { FOR statement-block; doeachloop; } For example, we can print the numbers 1 through 10 with: for ($index=1; $index <=10; $index++ ) { print $index,"\n"; } Control Structures -- foreach Statement (Chapter 4 of the Llama Book) foreach is similar to statement by this name in C-Shell foreach $index (@some_list returned by an expression perhaps) { statement-block for each value of $index; } $index is local to this construct and is returned to any value it had before foreach loop executed An example that also prints 1 to 10 is @back= (10,9,8,7,6,5,4,3,2,1); foreach $num (reverse(@back)) { print $num,"\n"; } In above case one can write more cryptically (a pathological addiction of UNIX programmers) foreach (sort(@back)) { # sort and reverse give same results here print $_,"\n"; # If an expected variable($num here) is omitted PERL uses $_ by default } Associative Arrays -- Definition (Chapter 5 of the Llama Book) An associative array is a "software implemented" associative memory where you can fetch values by names or attributes or technically keys An associative array is a set of pairs (key,value). The whole array is referred to as %dict and is typically set with instructions like $dict{keyname} = value; # NOTE Curly braces {} to show array associative The values can be used in ordinary arithmetic such as $math{pi}=3.14; $math{pi} += .0016; # sets $math{pi}=3.1416; pi or "pi" is allowed for specifying key If key pimisspelt has not been defined then $math{pimisspelt} returns undef as value and so one can easily see if a particular key has been set. Alternatively function exists($math{pimisspelt}) returns false unless key pimisspelt has been set Associative Arrays -- Examples (Chapter 5 of the Llama Book) One can think of an associative array as a simple relational database with two columns and rows labelled by keys. For example, they can be used to keep data defined by MIME or HTTP format message as these protocols are defined in terms of a set of header statements keyname: keyvalue with for example Content-type: text/plain # corresponding to $mime{Content-type} = "text/plain"; # and so on Similarly this data-type can be used to store values read in arguments of a UNIX command as these are either of form -keyname value # or -keyname # just to indicate option set (value = yes or no) Associative Arrays -- Storage and Access (Chapter 5 of the Llama Book) The order of storage of pairs in an associative array is arbitrary and nonreproducible. one cannot push or pop an associative array @listmime = %mime; # produces a list of form (key1,value1,key2,value2 ...) This list can be manipulated like any list One can also create an associate array by defining such a list where adjacent elements are paired so that in above example %newmime = @listmime; # creates an associative array identical to %mime One can delete specific pairs by delete command so for example: %fred = (key1, "one", key2, "two"); # Quotes on key1 optional delete $fred{key1}; # leaves %fred with one pair (key2,"two") Associative Arrays -- Operators: keys, values, each (Chapter 5 of the Llama Book) keys(%dict) returns a list (conventional array) of keys in %dict ( in arbitrary order). This can be used with foreach construct foreach (keys(%mime)) { # $_ will run through keys print "In dictionary we have key $_ as $mime{$_}\n"; } values(%dict) is typically less useful. It returns a list of values (which may be repeated) in any order in associative array %dict each(%dict) returns a single two element list containing the "next" (key,value) pair in %dict. Each call to each(%dict) returns a new such pair until all are cycled through. Finally each will return a null (undefined) list. After this, next call to each will start the cycle again through the entire list of pairs in %dict Basic Input (Chapter 6 of the Llama Book) We have already seen how to read from standard input with $line = ; # returning next line INCLUDING terminal newline @file = ; # returning whole file with one line stored in each element of list @file We can also easily access the arguments of a PERL program. Suppose you invoke a PERL program makePHD with makePHD file1 file2 file3 Then we will see later how to access individual files file1, file2, file3 using standard argument conventions in UNIX however the convention <> (Diamond Operator) will access the concatenation of the three files with all being read into array @argfiles with @argfiles = <>; Basic Output (Chapter 6 of the Llama Book) We have already seen how to use print to output lists print @argfiles; # or you can use parentheses print(@argfiles); One can as in C obtain format control with printf which starts with a special purpose format statement: printf("%10s %6d %10.2f\n", $string, $decimal, $float); prints three variables with $string in a 10 character string field, $decimal in a 6 character integer format and $float in a ten character field with two decimal places Regular Expressions -- Analogy with grep (Chapter 7 of the Llama Book) Regular expressions should be familiar as they are used in many UNIX commands with grep as best known grep pattern file; # Prints out each line of file containing pattern The rules for pattern are rich and we will discuss later -- consider here the simple pattern Fox Then we can write the PERL version of grep as follows: $line =0; while (<>) { if( /Fox/ ) { # Generalize to /Pattern/ to test positive if Pattern in $_ print $line, "$_"; } # $_ is current line by default $line++; } Another familiar operator should be s in sed (the batch or stream line editor) where s/Pattern1/Pattern2/; # substitutes Pattern1 by Pattern2 in each line The same command can be used in PERL with again substitution occuring on $_ Regular Expressions --Patterns (Chapter 7 of the Llama Book) Simple Single-character Patterns are : Single explicit character eg a dot . which matches ANY character except newline \n Character class is a Single-character Patterns and represented as a set [c1c2c3...cN] which matches any one of the listed characters [ABCDE] matches A B C D or E [0-9] is same as [0123456789] [a-zA-Z] matches any lower or upper case letter Negated character class is represented by a carat ^ after left [ square bracket [^0-9] matches any character which is NOT a digit 0 1 2 3 4 5 6 7 8 9 Backslash Escapes (Chapter 2 of the Llama book) Predefined Character Classes in Regular Expressions (Chapter 7 of the Llama Book) \d digits [0-9] \D NOT digits [^0-9] \w word characters [a-zA-Z0-9_] Really anything legal in a PERL variable name after $ % @ indicators of a variable Note \b indicates break between \w and \W \W NOT word characters [^a-zA-Z0-9_] \s White Space [ \r\t\n\f] \S NOT white space [^ \r\t\n\f] Grouping Patterns in Regular Expressions (Chapter 7 of the Llama Book) Sequence is c1c2c3.. -- a sequence of single characters * or {0,} is "zero or more" of previous character + or {1,} is "one or more" of previous character ? or {0,1} is "zero or one" of previous character All matching is greedy -- they maximize number of characters "eaten up" starting with leftmost matching In Perl5 one can follow specification with ? to instruct Perl5 to find smallest match (first occurrence) so that .*?: matches to first : in line while .*: matches to last : in line. Curly Brace Notation: c{n1,n2} means from n1 to n2 instances of character c c{n1,} means n1 or more instances of character c c{n1} means exactly n1 instances of character c c{0,n2} means n2 or less instances of character c Anchoring and Alternation in Regular Expressions (Chapter 7 of the Llama Book) For single characters, alternates can be specified by square brackets with [abc] meaning a or b or c For general strings one can use | to represent or so that above example can also be written a|b|c means a or b or c but this operator can be generalized to longer sequences so that 1995 CPS616 instructor can be written Fox|Furmanski or if we can't spell Polish names Fox|Furmansk(i|y|ie) # See later for use of parentheses Patterns can be Anchored in four ways: /^Keyname:/ matches to Keyname: ONLY if it starts string -- ^ only has this special meaning at start of regular expression /Quit$/ matches Quit ONLY if it ends string -- $ only has this meaning if at end of regular expression \b matches a word (PERL/C variable) boundary so that /Variable\b/ matches Variable but not Variables ( inside [] construct, \b means a backspace as described earlier) \B matches NOT a word boundary so that /Variable\B/ matches Variables but not Variable Parentheses in Regular Expressions (Chapter 7 of the Llama Book) Parentheses can be used as "memory" for relating different parts of a match or for relating substitution to match If a part of a regular expression is enclosed in parentheses, the MATCHED value is stored in temporary variables \1 \2 .. for first,second .. set of parentheses /Geoffrey(.*)Fox/ when matched to Geoffrey Charles Fox stores \1 = ' Charles ' which can be transferred to substitution string which could be /Geoffrey \(\1\) Fox/ for result Geoffrey ( Charles ) Fox Note ONLY use \1 \2 etc. in pattern. Use $1 $2 outside pattern Parentheses can also be used to clarify meaning of a regular expression by defining precedence of a set of operations and so distinguish for instance /(a|b)*/ from /a|(b*)/ There is a definite convention for precendence but as usual I recommend using parantheses and in Perl5(later) we will see how to distinguish use of parantheses for either clarificatiuon or defining matched groups. The Matching Operator in Regular Expressions - I ( =~, m) (Chapter 7 of the Llama Book) We have finally finished study of regular expressions and have illustrated this for substitution operator (s) acting on default variable $_. We can generalize this operation in many ways The result of ( Variable Name =~ /Regular Expression/ ) is true if and only if value of Variable Name matches Regular Expression. For example if ( =~ /^(T|t)(O|o):/ ) { # is $_ ..; # Process to: field of mail } # matches if current input line contains to: with any case at start of line There is an implied match operator above which we can make explicit with m $line =~ m/^(T|t)(O|o):/ and we can use m to change delimiter from / to any character and $line =~ m%^(T|t)(O|o):% # is equivalent to previous statement Note m/^to:/i equivalent to above as modifier i instructs pattern match to ignore case The Matching Operator in Regular Expressions - II Variable Interpolation; i,g options; general substitution (Chapter 7 of the Llama Book) Variables may be used in Regular expressions and are interpolated as in usual double quoted strings. Use \$ to represent a real dollar except at end of string when it safely represents end of string anchor. In match /regexp/i , the i instructs one to ignore case in match In substitution s/regexp1/regexp2/g, the g instructs substitution to occur at all possible places in string -- normally only the first match in a string is found i and g can be used together $line =~ s/regexp1/regexp2/ ; # Illustrates how we use substitution s on general variable As with m, s can use any delimiter and so $line =~ s#regexp1#regexp2# ; # is equivalent form The Matching Operator in Regular Expressions - III \1 $1 $` $& and $' etc. (Chapter 7 of the Llama Book) We have defined \1, \2, \3 .. as variables set by parentheses and used internally to a match. These variables are available outside the regular expression operation with conventional PERL names $1 $2 $3 etc. Use latter even in substitution. In string matched in part by a regular expression, we can identify three parts $` is variable holding part of string BEFORE matched part $& is variable holding part of string matched by regular expression $' is variable holding part of string AFTER matched part So string is concatenation $` . $& . $' Some regular expression Examples /\s0(1+)/ matches "white space", followed by zero and 1 or more ones -- the set of ones is stored in \1 ($1) /[0-9]\.0\D/ matches "the answer is 1.0 exactly" but not "The answer is 1.00". In first case $` is "the answer is ", $& is "1.0 " and $' is "exactly" /a.*c.*d/ matches "axxxxcxxxxcdxxxxd" with $` and $' as null and $& as full string /(a.*b)c.*d/ matches "axxxxbcxxxxbd" with \1 as "axxxxb" -- note backtracking as greedy (a.*b) first matches to "axxxxbcxxxxb" but then tries again when following c.*d fails to match Split and Join Operators (Chapter 7 of the Llama Book) split takes a line and splits into parts which are separated by a delimiter defined as any regular expression. For example @fields = split(/\s/,$line); # splits string $line into several fields stored in $field[0] $field[1] etc. where these fields were separated by white space (\s) in $line join inverts the operation although the join string must now be an ordinary single or double quoted string and not a regular expression as no matching is occuring! $line = join( " \t", @fields); # rebuilds $line with space and tab as separator. index and rindex (Chapter 15 of Llama Book) $loc=index($string,$substr); # returns in $loc the location(first character in $string is location 0) of first occurrence of $substr in $string. If $substr is not located, return -1 $loc=index($string,$substr,$firstloc); # will return $loc which is at least as large as $firstloc Use to find multiple occurrences, setting $firstloc as 1+ previously found location rindex($string,$substr,$lastloc) is identical to index except scanning starts at right (end) of string and not at start. All locations still count from left but if you give a third argument $lastloc, the returned $loc will be at most $lastloc in value substr (Chapter 15 of Llama Book) $partstring = substr($string,$start,$length); # returns in $partstring the partial string starting at position $start (=0 for first character in $string) and of at most $length characters Missing out $length or a huge value for $length returns all characters from starting position to end of $string Negative values of $start count backwards from end character in $string $endchar=substr($string,-1,1); # returns last character in $string substr($string,$start,$length)= $new; # replaces extracted substring with characters in $new which need not be of same length as original $class="CPS600"; substr($class,3)="616"; # leads to $class="CPS616" Functions or Subroutines - I (Chapter 8 of the Llama Book) Functions are defined by the sub construct sub itsname { statements; expression defining returned result; } They are invoked by the & construct $sum = &add; # a simple routine with no arguments sub add { $a1+$a2+$a3; # Sum three global variables returning this expression } Note & becomes optional in PERL5 -- see advanced foilset Functions or Subroutines - II (Chapter 8 of the Llama Book) Arguments can be used and a comma separated list can be used as calling sequence One can write for subroutines or functions either (replace subname by any Perl function) subname LIST or subname(LIST); # Parantheses are optional This list can accessed in function using array(list) @_ with elements $_[0],$_[1] etc. $sum = &add($a1,$a2,$a3); # a similar routine with arguments (can be variable in number as a list) sub add { $_[0]+$_[1]+$_[2]; # sum three arguments } Functions or Subroutines - III -- The local and my constructs (Chapter 8 of the Llama Book) The construct local defines variables which are local(private) to a particular function For example the routine on following foil invoked by @new = &bigger_than(100,@list); Returns in @new all entries in @list which are bigger than 100. local() is an executable statement -- not a declaration! The first two statements in bigger_than can be replaced by: local($test,@values) = @_; # local() returns an assignable list In Perl5, my tends to eplace local as my scope confined to routine but local extends scope to any function called from block in which local statement defined Note can use my/local in any block (not just a function) enclosed in { ... } to define temporary variables of limited scope Functions or Subroutines - IV -- An Example (Chapter 8 of the Llama Book) sub bigger_than { local($test,@values); # Create local variables for test number and original list ($test,@values) = @_; # Split argument list and give nicer names -- see previous foil for nicer notation! local(@result); # A place to store and return result foreach $val (@values) { # Step through argument list if( $val > $test ) { # Should we add this value push(@result,$val); # add to result list } } @result; # Required to specify what to be returned } # Could be pedantic and write return @result cmp <=> Binary Equality Operators (Chapter 15 of Llama Book) We have already seen equality operators == ,!= for numerically equal, unequal eq , ne for stringwise equal, not equal $a <=> $b returns -1,0,1 depending if $a is respectively numerically less than, equal to or greater than $b $a cmp $b returns -1,0,1 depending if $a is respectively stringwise less than, equal to or greater than $b Sorting with various criteria (Chapter 15 of Llama Book) sort() is a builtin PERL function with three modes: @result = sort @array; # equivalent to sort { $a cmp $b} @array; and sorts using stringwise comparisons, the variables in @array returning them in @result @result = sort BLOCK @array; # where statement BLOCK enclosed in {} curly brackets returns -1 0 1 given values of $a $b @result = sort { $age{$a} <=> $age{$b} } @array; # sorts by age if entries in @arrays are keys to associative array %age which holds numeric age for each key @result = sort SUBNAME @array; # uses subroutine (which can be specified as value of scalar variable) to perform sorting sub backsort { $b <=> $a; } # Reverse order for Integers @result = sort backsort @array; # sorts in numerically decreasing order The tr translation operator (Chapter 15 of Llama Book) tr/ab/XY/ translates a to X and b to Y in string $_ As for m and s, one can apply tr to a general string with =~ $string =~ tr/a-z/A-Z/; # translates letters from lower to upper case in $string Note use of - to specify range as in regular expressions although tr does NOT use regular expressions tr can count and returns number of characters matched $numatoz = tr/a-z//; # $numatoz holds number of lower case letters in $_ if final string empty no substitutions are made if second string shorter than first, the last character in second string is repeated tr/a-z/A?/; # replaces a by A and all other lower case letters by ? if the d option used, unspecified translated characters are deleted tr/a-z//d; # deletes all lower case letters the c option complements characters in initial string tr/a-zA-Z/_/c; # translates ALL nonletters into _ the s option squeezes multiple consecutive copies of any letter in final string and replaces them by a single copy. Additional Control Flow Constructs I (Chapter 9 of the Llama Book) last, next and redo allow one simple ways to alter execution flow in loops while (something is tested) { # redo goes to here somecalc1; if (somecondition) { somecalc2; } somecalc3; #next goes to here } # last jumps to here These commands control innermost enclosing for foreach or while loop last jumps out of loop next jumps over rest of loop redo jumps back to start of loop Additional Control Flow Constructs II -- Statement Labels and next,last,redo (Chapter 9 of the Llama Book) next, redo and last can jump to labelled statements with commands such as next LABEL1; # or last LABEL2; # or redo LABEL3; LABEL1: This is start of any statement block; # is typical statement label This is typically used to allow you to jump out over several nests of loops Additional Control Flow Constructs III -- Accelerated Tests (Chapter 9 of the Llama Book) There are a set of ways of doing simple tests which imply fewer curly braces and other punctuation expr1 if testexp; # is equivalent to if (testexp) { expr1; } last, redo and next can be followed by such tests e.g. last DOREALWORK if userendofinitializationhit ; There are similar abbreviations for unless,while,until dothisexpression unless conditionholds; dostandardstuff while normalconditionholds; dostandardstuff until specialconditionseen; # should be self explanatory Additional Control Flow Constructs IV -- && || and ? (Chapter 9 of the Llama Book) thatcommand if thiscondition; # is equivalent to thiscondition && thatcommand; because PERL will not continue with && (logical and) if it finds a false condition. So if thiscondition is false, thatcommand is not executed Similarily: thatcommand unless thiscondition; # is equivalent to thiscondition || thatcommand; Note can use and instead of && and or instead of || not (instead of !) and xor (instead of ^) also allowed We can use a C like expression expression ? Truecalc : Falsecalc; # which is equivalent to if (expression) { Truecalc; } else { falsecalc; } FileHandles I -- open close die (Chapter 10 of Llama Book) Files are like statement labels designated by a string without a special initial character. It is recommended that you use all capitals in such labels STDIN STDOUT STDERR (and diamond <> null name) have been introduced and correspond to UNIX stdin, stdout and stderr (and concatenation of argument files if <> operator) Filehandles allow you to address general files and the syntax is similar to UNIX standard I/O (stdio.h) support open(FILEHANDLE,"unixname"); # opens file unixname for reading -- can use < open(FILEHANDLE,">unixname"); # opens file unixname for writing open(FILEHANDLE,">>unixname"); # opens file unixname in append mode close(FILEHANDLE) closes file Errors can be handled with die construct open(FH,'>'.$criticalfile) || die("Print an error message if file can't be opened\n"); # Note how we add '>' (or ',' '>>') to file name stored in Perl variable Using FileHandles and Testing Files (Chapter 10 of Llama Book) As illustrated reads either single line or full file depending on whether one stores it in a scalar or an array print FILEHANDLE list; # writes list onto FILEHANDLE and simple print list; # is equivalent to print STDOUT list; There are a whole set of test operators which act on File NAMES not FileHANDLES -e $filename returns true if $filename EXISTS -r $filename returns true if $filename is READABLE -w $filename returns true if $filename is WRITABLE -x $filename returns true if $filename is EXECUTABLE The Perl EOF Syntax Very Convenient for text output is syntax print FILEHANDLE < $title

$title

EOF Here EOF is an arbitary string to denote end of data Note that variables are interpolated in this syntax which is equivalent to a "" form which is less clear! print FILEHANDLE "\n$title\n\n"; # etc. PERL as a Practical Extraction and Report Language PERL is designed to produce simple reports with a format that allows close control over what appears where on the output page. The general syntax is reminiscent of Fortran! We need to describe format definitions, write to output formats and data, and a rather peculiar way of associating formats with files The syntax is a little cleaner in PERL5 where filehandles are almost true objects with methods. Here specify use English; # to specify that both long understandable and short cryptic names are allowed use FileHandle; # which allows you to set variables associated with FILEHANDLE by method FILEHANDLE EXPR; # to set method=EXPR or alternatively use FILEHANDLE->method(EXPR); # where allowed methods are many such as open, close, format_name, seek ...... Format Definitions (Chapter 11 of Llama Book) format FORMATNAME = fieldline (called picture line in Perl Manual) value1, value2, value3 ... fieldline value1, value2, value3 ... etc . The terminal dot as first character of line terminates format definition FORMATNAME is label of this format and in simplest case one uses a format label which is identical to that of FILEHANDLE on which we wish to output fieldlines specify fixed text as well as places and formats to print data which are listed as Perl variable names on following valueline. Clearly white space is significant in fieldline but not associated value line. Example of a Format Definition (Chapter 11 of Llama Book) $~ = "ADDRESSLABEL"; # sets format for current FILEHANDLE to ADDRESSLABEL $FORMAT_NAME = "ADDRESSLABEL"; if use English FILEHANDLE->format_name("ADDRESSLABEL"); if use FileHandle format ADDRESSLABEL = ==================================== | @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< | $name | @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< | $address | @<<<<<<<<<<<<<<<<, @< @<<<<<<<<<<<<<< | $city, $state,$zip ==================================== . @ followed by N <'s specifies left justified field with N+1 characters in it. write ; # outputs current values of $name,$address,$city,$state,$zip into 5 line template on currently selected file. Basic Text and Numeric Fieldholders (Chapter 11 of Llama Book) Text Fields are specified in "picture lines" by @<<<<<< Left Justified characters @>>>>>> Right Justified Characters @||||||||||| Centered Characters Fields that are too long are truncated, those too short padded Character count includes @ sign (@ on its own is one character long) Numeric fields are @####.### with . specifying position of decimal point. Multiline Format Fields and Expressions (Chapter 11 of Llama Book) The Multiline text field designator @* will print a stream of characters with newlines output wherever they are in specified variable Note here and in other variable specifications you can replace scalar variable by any expression returning a scalar variable. This could involve calling an optimized subroutine indicated by &subname here one can use sprintf(formatstring,value1,value2,..) which returns formatted string using format familiar from C to output numeric data in specialized fashion $>>>>>>>> &sprintf("%8.2f",$val) . is equivalent to @#####.## $val . Filled Fields (Chapter 11 of Llama Book) A filled field is designed for text which is to be broken at word boundaries but could be of arbitrary length. Here such text is specified by initial carat: ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $text ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $text ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $text . where number of <'s plus one is maximum size of field on each line Note how one can repeat input variable so that Perl will process $text in above example over three input lines. Characters output on one line are discarded and remaining part (of $text) passed to next line The special ~~ construct ~~ ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $text . will repeat pictureline following ~~ until associated variable (here $text) is exhausted Top of Page and its Format (Chapter 11 of Llama Book) Each write statement starts a newpage (outputs top of form -- see later) if it wont fit on remainder of current page You can print page numbers because $% or $FORMAT_PAGE_NUMBER holds current page number Further you can specify a special format defaulted to FILEHANDLE_TOP to be used for top of page. This could be: format STDOUT_TOP = This is a new page of my award winning CPS616 project @<< $% . $^ or $FORMAT_TOP_NAME holds name of format to be used at top of page Default Filehandles and Formats (Chapter 11 of Llama Book) print and write without specific file handles output on currently selected filehandle. select(FILEHANDLE_NEW) sets new filehandle FILEHANDLE_NEW and returns old (current) filehandle. $oldhandle = select(NEWHANDLE); $~ = "Newformat"; or $FORMAT_NAME instead of $~ if use English select($oldhandle); # ingeniously sets "Newformat" to be format associated with NEWHANDLE One can have lots of FILEHANDLES and lots of formats but each FILEHANDLE is only associated with one format at a time. Page Limits and Positions (Chapter 11 of Llama Book) $= or $FORMAT_LINES_PER_PAGE is number of lines to be output on each page It can be changed in same way already described for $~. The Filehandle module can be invoked so that $= = 66; # is equivalent to format_lines_per_page FILEHANDLE 66; # if FILEHANDLE is currently selected $- or $FORMAT_LINES_LEFT is the number of lines left on page for currently selected output channel Reset $- to zero to force top of form Note $^L or $FORMAT_FORMFEED which defaults to \f and is formfeed to use at top of page Some Special Capabilities in formatted writes Note $| or $OUTPUT_AUTOFLUSH which if nonzero forces a flush after every write or print on current output channel. Default is 0. Note $: or $FORMAT_LINE_BREAK_CHARACTERS which is set of characters on which to break when processing filled continuation lines (Carat ^ format) default is "\s\n-" to break on whitespace, newline or hyphen Related is $/ or $INPUT_RECORD_SEPARATOR or $RS (defaulted to newline) which is very useful when processing HTML where newlines are irrelevant and you set $/ to say < or > to scan to next tag or end of tag and ignore newlines This is valid in conventional syntax $^A or $ACCUMULATOR is an accumulator which holds results of write command. This is emptied after write finishes a format but you can access directly through formline() function defined in PERLFUNC and PERLFORM manpages. syntax is formline (PICTURELINE,LIST); which takes LIST of variables and outputs according to PICTURELINE Globbing (Chapter 12 of Llama Book) We use * notation in shell to match sets of files -- this is NOT same as regular expression as * is equivalent to (.*) except normally files beginning with . are not accessed with a simple glob Presumably glob is "short" for globalize not globular @a= ; returns a list (one per element of @a) of files matching globbed specification For example @a= < *cps616*> returns all files in current directory with string cps616 somewhere in their name. Variable Interpolation is allowed in globbing e.g. $home="~gcf"; # gcf's home directory is ~gcf @a = <$home/*>; # returns all non initial . files in gcf's home directory Directory Access (Chapter 12 of Llama Book) chdir($name); transfers to directory specified in $name mkdir($name, mode); # makes directory with given name $name and MODE (typically 3 octal characters such as 0755) opendir(DIRHANDLE,$name); # opens directory with directory handle DIRHANDLE. Such names can be assigned independently of all other names and are in particular not connected with FILEHANDLEs closedir(DIRHANDLE); # closes directory associated with handle DIRHANDLE readdir(DIRHANDLE); # returns file names (including . and ..) in directory with handle DIRHANDLE If scalar result, readdir returns "next" file name If array result, readdir returns all file names in directory Execution of UNIX Commands -- system (Chapter 15 of Llama Book) system("shellscript"); # dispatchs shellscript to be execute by /bin/sh and anything allowed by shell is allowed in argument system returns code returned by shellscript system("date > tempfil"); # executes UNIX command date returning standard output from date to file tempfil in current directory system("rm *") && die ("not allowed\n"); # terminates if error in system call as shell programs return nonzero if failure (opposite of open and most PERL commands) Variable Interpolation is done in double quoted arguments and so one can include Perl variables in arguments of system $prog="nobel.c"; system("cc -o $prog"); # (I) is equivalent here to $ccompiler="cc"; system($ccompiler,"-o","nobel.c"); # (II) but in general not identical as in first form (I) shell interprets command list but in second form (II) the arguments are handed directly to command given in first entry in list given to system Processing the Environment %ENV (Chapter 15 of Llama Book) %ENV is set as the shell environment which the Perl program was invoked Any UNIX processes invoked by system, fork, backquotes, open inherits an environment specified by %ENV at invocation of child process. One can change %ENV in the same way as any associative array %ENVIN = %ENV ; $oldpath = $ENV{"PATH"}; # saves input environment $ENV{"PATH"} = $oldpath . ":/web/cgi"; # resets PATH to include an extra directory to be used by child process -- later we run %ENV=%ENVIN; # Restores original environment One can see what has been passed in %ENV by using Perl keys function foreach $key (sort keys %ENV ) { print "$key=$ENV{$key}\n"; # both $key $ENV{} are interpolated } Execution of UNIX Commands -- backquotes (Chapter 15 of Llama Book) $now= "Todays date: " . `date`; # sets $now to be the specified label followed by result of shell's date invocation `who` would naturally return a set of lines and the result can be stored into an array -- one array entry for each output line Both system and backquote mechanism invoke a shell command which normally share standard input, standard output and standard error with the Perl program This can be reset as for instance in `rm fred 2>&1`; # using shell syntax to send standard error to same place as standard output Execution of UNIX Commands -- Filehandle Mechanism (Chapter 15 of Llama Book) open(WHOHANDLE, "who|"); # opens WHOHANDLE for reading output of system call to who the | at right means we will be able to treat output of who as though we were reading it as a file @whosaid = ; # defines an array whosaid holding output of who command open(LPRHANDLE,"|lpr -Pgcf"); # with | at left opens lpr process so that if we write to filehandle LPRHANDLE it is as though we handed file to input of lpr print LPRHANDLE "This is a test\n"; # for example close(LPRHANDLE); # waits until lpr command has finished and closes handle Execution of UNIX Commands -- fork and exec (Chapter 15 of Llama Book) This is most powerful method with fork creating two identical copies of program -- parent and child unless (fork) { ;} # child indicated by fork=0 ; # otherwise fork=child process number for parent The child program typically invokes exec which replaces child original by the argument of exec. Meanwhile parent should wait until this exec is complete and child has gone away. unless (fork) { exec("date"); # child process becomes date command sharing environment with parent } wait; # parent process waits until date is complete The child process need not terminate naturally as with exec() and if child code was for instance print FILEHANDLE @hugefile; # in parallel with parent exit; # is required else child will continue with parents code whereas we wanted parent and child to work in parallel on separate jobs Signals, Interrupt Handlers, kill (Chapter 15 of Llama Book) The associative array %SIG is used to define signal handlers (subroutines) used for various signals. The keys of %SIG are the UNIX names with first SIG removed. For instance, to set handler() as routine that will handle SIGINT interrupts do something like: $SIG{'INT'} = 'handler'; sub handler { # First argument is signal name local($sig) = @_; print("Signal $sig received -- shutting down\n"); exit(0); } kill $signum, $child1, $child2; # sends interrupt $signum to process numbers stored in $child1 and $child2 $signum is NUMERICAL label (2 for SIGINT) and $child1,2 the child process number as returned by fork or open(PROCESSHANDLE,..) to parent The eval Function and Indexed Arrays of Associative Arrays As in many interpreters, PERL allows you to generate a line from the interpreter using an eval function (JavaScript is similar) Suppose you had two arrays $fred[$index] and $jim[$index] and you wanted to load them given value of $index and an ascii string $name (which could have been read in) taking value 'fred' or 'jim'. This can be achieved by: eval('$' . $name . '[' . $index . ']') = $value; eval returns result of evaluating(executing) argument as PERL script and continues In this case, you can achieve the same results with indexed associative arrays: $options[$index]{$name} = $value; using the nultidimensional array notation introduced in PERL5