Larry Wall’s Practical Extraction and Reporting Language (PERL) was originally developed in 1987 as a general-purpose Unix scripting language that borrowed features from C, sh, awk, sed, BASIC, and LISP. In the late 1990s, before PHP became more popular, PERL was commonly used for CGI scripting. PERL is still the go-to tool for many sysadmins who need something more powerful than sed or awk when writing complex parsing and automation scripts. It has a somewhat high learning curve due to its dense notation. But a recent survey indicates that PERL developers earn 54 per cent more than the average developer. So it may still be a worthwhile language to learn.

PERL is far too complex to cover in any significant detail in this magazine. But this short series of articles will attempt to demonstrate a few of the most basic features of the language so that you can get a sense of what the language is like and the kind of things it can do.

An example PERL program

PERL was originally a language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. To demonstrate how this core feature of PERL works, a very simple Tic-Tac-Toe game is provided below. The below program scans a textual representation of a Tic-Tac-Toe board, extracts and manipulates the numbers on the board, and prints the modified result to the console.

00 #!/usr/bin/perl 01 02 use feature 'state'; 03 04 use constant MARKS=>[ 'X', 'O' ]; 05 use constant BOARD=>' 06 ┌───┬───┬───┐ 07 │ 1 │ 2 │ 3 │ 08 ├───┼───┼───┤ 09 │ 4 │ 5 │ 6 │ 10 ├───┼───┼───┤ 11 │ 7 │ 8 │ 9 │ 12 └───┴───┴───┘ 13 '; 14 15 sub get_mark { 16 my $game = shift; 17 my @nums = $game =~ /[1-9]/g; 18 my $indx = (@nums+1) % 2; 19 20 return MARKS->[$indx]; 21 } 22 23 sub put_mark { 24 my $game = shift; 25 my $mark = shift; 26 my $move = shift; 27 28 $game =~ s/$move/$mark/; 29 30 return $game; 31 } 32 33 sub get_move { 34 return (<> =~ /^[1-9]$/) ? $& : '0'; 35 } 36 37 PROMPT: { 38 state $game = BOARD; 39 40 my $mark; 41 my $move; 42 43 print $game; 44 45 last PROMPT if ($game !~ /[1-9]/); 46 47 $mark = get_mark $game; 48 print "$mark\'s move?: "; 49 50 $move = get_move; 51 $game = put_mark $game, $mark, $move; 52 53 redo PROMPT; 54 }

To try out the above program on your PC, you can copy-and-paste the above text into a plain text file and save and run it. The line numbers will have to be removed before the program will work. Of course, the command that one uses to perform that sort of textual extraction and reporting is perl.

Assuming that you have saved the above text to a file named game.txt, the following command can be used to strip the leading numbers from all the lines and write the modified version to a new file named game:

$ cat game.txt | perl -npe 's/...//' > game

The above command is a very small PERL script and it is an example of what is called a one-liner.

Now that the line numbers have been removed, the program can be run by entering the following command:

$ perl game

How it works

PERL is a procedural programming language. A program written in PERL consists of a series of commands that are executed sequentially. With few exceptions, most commands alter the state of the computer’s memory in some way.

Line 00 in the Tic-Tac-Toe program isn’t technically part of the PERL program and it can be omitted. It is called a shebang (the letter e is pronounced soft as it is in the word shell). The purpose of the shebang line is to tell the operating system what interpreter the remaining text should be processed with if one isn’t specified on the command line.

Line 02 isn’t strictly necessary for this program either. It makes available an advanced command named state. The state command creates a variable that can retain its value after it has gone out of scope. I’m using it here as a way to avoid declaring a global variable. It is considered good practice in computer programming to avoid using global variables where possible because they allow for action at a distance. If you didn’t follow all of that, don’t worry about it. It’s not important at this point.

PERL scopes, blocks and subroutines

Scope is a very important concept that one needs to be familiar with when reading and writing procedural programs. In PERL, scope is often delineated by a pair of curly brackets. Within the global scope, the above Tic-Tac-Toe program defines four sub-scopes on lines 15-21, 23-31, 33-35 and 37-54. The first three scopes are prefixed with subroutine declarations and the last scope is prefixed with the label PROMPT.

Scopes serve multiple purposes in programming languages. One purpose of a scope is to group a set of commands together as a unit so that they can be called repeatedly with a single command rather than having to repeat several lines of code each time in the program. Another purpose is to enhance the readability of the program by denoting a restricted area where the value of a variable can be updated.

Within the scope that is labeled PROMPT and defined on lines 37-54 of the above Tic-Tac-Toe program, a variable named mark is created using the my keyword (line 40). After it is created, it is assigned a value by calling the get_mark subroutine (line 47). Later, the put_mark subroutine is called (line 51) to change the value in the square that was chosen by the get_move subroutine on line 50.

Hopefully it is obvious that the mark that put_mark is setting is meant to be the same mark that get_mark retrieved earlier. As a programmer though, how do I know that the value of mark wasn’t changed when the get_move subroutine was called? This example program is small enough that every line can be examined to make that determination. But most programs are much larger than this example and having to know exactly what is going on at all points in the program’s execution can be overwhelming and error-prone. Because mark was created with the my keyword, its value can only be accessed and modified within the scope that it was created (or a sub-scope). It doesn’t matter what subroutines at parallel or higher scopes do; even if they change variables with the same name in their own scopes. This property of scopes — restricting the range of lines on which the value of a variable can be updated — improves the readability of the code by allowing the programmer to focus on a smaller section of the program without having to be concerned about what is happening elsewhere in the program.

Lines 04 and 05 define the MARKS and BOARD variables, respectively. Because they are not within any curly bracket pairing, they exist in the global scope. It is permissible to create constant variables in the global scope because they are read-only and therefore not subject to the action at a distance concern. In PERL, it is traditional to name constants in all upper case letters.

Notice that scopes can be nested such that variables defined in outer scopes can be accessed and modified from within inner scopes. This is why the MARKS and BOARD variables can be accessed within the get_mark subroutine and PROMPT block respectively — they are sub-scopes of the global scope.

The statements in the program are executed in order from top to bottom and left to right. Each statement is terminated with a semi-colon (;). The semi-colon can be omitted from the last statement in any scope and from after the last block of many statements that define the flow of the program such as sub, if and while.

In PERL nomenclature, scopes are called blocks. Scope is the more general term that is typically used in online references like Wikipedia, but the remainder of this article will use the more perlish term blocks.

The statements within the first three blocks are not immediately executed as the program is evaluated from top to bottom. Rather, they are associated with the subroutine name preceding the block. This is the function of the sub keyword — it associates a subroutine name with a block of statements so that they can be called as a unit elsewhere in the program. The three subroutines get_mark (lines 15-21), put_mark (lines 23-31), and get_move (lines 33-35) are called on lines 47, 51 and 50 respectively.

The PROMPT block is not associated with a subroutine definition or other flow-control statement, so the statements within it are immediately executed in sequence when the program is run.

PERL regular expressions

If there is one feature that is more central to PERL than any other it is regular expressions. Notice that in the example Tic-Tac-Toe program every block contains a =~ (or !~) operator followed by some text surrounded with forward slashes (/). The text within the forward slashes is called a regular expression and the operator binds the regular expression to a variable or data stream.

It is important to note that there are different regular expression syntaxes. Some editors and command-line tools (for example, grep) allow the user to select which regular expression syntax they prefer to use. PERL-Compatible Regular Expressions (PCRE) are by far the most powerful.

Regular expressions used in matching operations

The result of applying the regular expression to a variable or data stream is usually a value that, when used in a flow-control statement such as if or while, will evaluate to true or false depending on whether or not the match succeeded. There are modifiers that can be appended to the closing slash of a regular expression to change its return value.

Line 45 of the Tic-Tac-Toe program provides a typical example of how a regular expression is used in a PERL program. The regular expression [1-9] is being applied to the variable game which holds the in-memory representation of the Tic-Tac-Toe game board. The expression is a character class that matches any character in the range from 1 to 9 (inclusive). The result of the regular expression will be true only if a character from 1 to 9 is present in what is being evaluated. On line 45, the !~ operator applies the regular expression to the game variable and negates its sense such that the result will be true only if none of the characters from 1 to 9 are present. Because the regular expression is embedded within the conditional clause of the if statement modifier, the statement last PROMPT is only executed if there are no characters in the range from 1 to 9 left on the game board.

The last statement is one of a few flow-control statements in PERL that allow the program execution sequence to jump from the current line to another line somewhere else in the program. Other flow-control statements that work in a similar fashion include next, continue, redo and goto (the goto statement should be avoided whenever possible because it allows for spaghetti code).

In the example Tic-Tac-Toe program, the last PROMPT statement on line 45 causes program execution to resume just after the PROMPT block. Because there are no more statements in the program, the program will terminate.

The label PROMPT was chosen arbitrarily. Any label (or none at all) could have been used.

The redo PROMPT statement at the end of the PROMPT block causes program execution to jump back to the beginning of the PROMPT block.

Notice that the state keyword like the my keyword creates a variable that can only be accessed or modified within the block that it is created (or a nested sub-block if any exist). Unlike the my keyword, variables created with the state keyword keep their former value when the blocks they are in are called repeatedly. This is the behavior that is needed for the game variable because it is being updated incrementally each time the PROMPT block is run. The mark and move variables are meant to be different on each iteration of the PROMPT block, so they do not need to be created with the state keyword.

Regular expressions used for input validation

Another common use of regular expressions is for input validation. Line 34 of the example Tic-Tac-Toe program provides an example of a regular expression being used for input validation. The expression on line 34 is similar to the one on line 45. It is also checking for characters from 1 to 9. However, it is performing the check against the null filehandle (<>); it is using the =~ operator; and it is prefixed and suffixed with the zero-width assertions ^ and $ respectively.

The null filehandle, when accessed as it is on line 34, will cause the program to pause until one line of input is provided. The regular expression will evaluate to true only if the line contains one character in the range from 1 to 9. The assertions ^ and $ do not match any characters. Rather, they match the beginning and end positions, respectively, on the line. The regular expression effectively reads: “Begin (^) with one character in the range from 1 to 9 ([1-9]) and end ($)”.

Because it is embedded in the conditional clause of the ternary operator, line 34 will return either what was matched ($&) if the match succeeded or the character zero (0) if it failed. If the input were not validated in this way, then the user could submit their opponent’s mark rather than a number on the board.

Regular expressions used for filtering data

Line 17 demonstrates using the global modifier (g) on a regular expression. With the global modifier, the regular expression will return the number of matches instead of true or false. In list context, it returns a list of all the matched substrings.

Line 17 uses a regular expression to copy all the numbers in the range from 1 to 9 from the game variable into the array named nums. Line 18 then uses the modulo operator with the integer 2 as its second argument to determine whether the length of the nums array is even or odd. The formula on line 18 will result in 0 if the length of nums is odd and 1 if the length of nums is even. Finally, the computed index (indx) is used to access an element of the MARKS array and return it. Using this formula, the get_mark function will alternately return X or O depending on whether there are an odd or even number of positions left on the board.

Regular expressions used for substituting data

Line 28 demonstrates yet another common use of regular expressions in PERL. Rather than being used in a match operator (m), the regular expression on line 28 is being used in a substitution operator (s). If the value in the move variable is found in the game variable, it will be substituted with the value of the mark variable.

PERL sigils and data types

The last things of note that are used in the example Tic-Tac-Toe program are the sigils ($ and @) that are placed before the variable names. When creating a variable, the sigil indicates the type of variable being created. It is important to note that a different sigil can be prefixed to the variable name when it is accessed to indicate whether one or many items should be returned from the variable.

There are three built-in data types in PERL: scalars ($), arrays (@) and associative arrays (%). Scalars hold a single data item such as a number, character or string of characters. Arrays are numerically indexed sets of scalars. Associative arrays are arrays that are indexed by scalars rather than numbers.