Presentation on theme: "Regular Expressions and perl"— Presentation transcript:

1 Regular Expressions and perl

perl and Bioinformatics Dan Rochowiak perlRegEx.ppt



2 Regular Expressions What is a regular expression?

A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations.



3 Regular Expressions In regular expressions, the following character set symbols match a single character. So, to handle arbitrary blank space you must use \s+. These symbols may be used in a character set. So, when looking for a hex integer you could use [a-fA-F\d]. Symbol Equiv Description \w [a-zA-Z0-9_] A "word" character (alphanumeric plus "_") \W [^a-zA-Z0-9_] Match a non-word character \s [ \t

\f\r] Match a whitespace character \S [^\s] Match a non-whitespace character \d [0-9] Match a digit character \D [^0-9] Match a non-digit character



4 Regular Expressions Perl has the standard regex quantifiers or closures, where r is any regular expressions. r* Zero or more occurences of r (greedy match). r+ One or more occurences of r (greedy match). r? Zero or one occurence of r (greedy match). r*? Zero or more occurences of r (match minimal). r+? One or more occurences of r (match minimal). r?? Zero or one occurence of r (match minimal). Let q be a regex with a quantifier. If there are many ways for q to match some text, a greedy quantifier will match (or "eats up") as much text as possible; a minimal matcher does the opposite. If a regex contains more than one quantifier, the quantifiers are "fed" left to right.



5 Regular Expressions The two main regex operations are searching/finding and substituting. In searching, we test if a string contains a regular expression. In substituting, we replace part of the original string with a new string; the new string is often based on the original. Both of these operations use the regular expression operator =~ which consists of two characters. This operator is not related to either equals = or ~



6 Searching To determine if the string $line contains a recent year such as or 1983, use the search operator =~ /.../. The slashes '/' delimit the beginning and the end of the regular expression. if ($line =~ /19[89]\d/) { # found a year in $line } In general, to determine if string $var contains the regular expression re use any of the following forms. If the regular expression contains a slash '/' itself, then use mXreX form, where each X is the same single character not appearing in re. In mX...X, the m stands for "match". if ($var =~ /re/) { ... } if ($var =~ m:re:) { ... } # can replace ':' with any other character while ($var =~ m/re/) { ... } # can replace '/' with any other character



7 Substrings To access the substring in $var matched by part of the regular expression re, put the part of re in parenthesis. The matched text is accessible via the variables $1, $2, ..., $k, where $k matches the k-th parenthesized part of the regular expression. For example to break up an address in $line we could do if ($line =~ { # \S = any non-space character my($user, $machine) = ($1, $2); ... } The submatch variables $1, $2, ... $k are updated after each successful regex operation, which wipes out the previous values.



8 Substrings Use \k, not $k, in the regular expression itself to refer to a previously matched substring. For example, to search for identical begining and ending HTML tags <xyz> ... </xyz> on a single line $line use if ($line =~ m|<(.*)>(.*)</\1>|) { # search for: <xyz>stuff</xyz> my($stuff) = $2; ... }



9 Substitution To replace or substitute text in $var from the regular expression old to new use the following form. $var =~ s/old/new/; # replace old with new if ($var =~ s:old:new:) { ... } # replace ':' with any other character To use part of the actual text matched by the old regex, the new regex can use the $k variables. Taking our previous example involving years, to replace the year 19xy with xy, use $line =~ s/19(\d\d)/$1/;



10 Examples: Simple Case The simplest regexp is simply a word, or string of characters. A regexp consisting of a word, matches any string that contains that word: "Hello World" =~ /World/; # matches "Hello World" is a double quoted string. World is the regular expression and the // enclosing /World/ tells perl to search a string for a match. The operator =~ associates the string with the regexp match and produces a true value if the regexp matched, or false if it did not match. In this case, World matches the second word in "Hello World", so the expression is true.



11 Examples: More Simple Cases

There are useful variations. The sense of the match can be reversed by using !~ operator: if ("Hello World" !~ /World/) { print "It doesn't match

"; } else { print "It matches

"; The literal string in the regexp can be replaced by a variable: $greeting = "World"; if ("Hello World" =~ /$greeting/) {



12 Examples: Special Cases

If you're matching against the special default variable $_, the $_ =~ part can be omitted $_ = "Hello World"; if (/World/) { print "It matches

"; } else { print "It doesn't match

"; And finally, the // default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front: "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}'



13 Examples: Exact Match Consider how different regexps would match "Hello World": "Hello World" =~ /world/; # doesn't match "Hello World" =~ /o W/; # matches "Hello World" =~ /oW/; # doesn't match "Hello World" =~ /World /; # doesn't match The first regexp world doesn't match because regexps are case-sensitive. The second regexp matches because the substring 'o W' occurs in the string . The space character ' ' is treated like any other character and is needed to match in this case. The lack of a space character is the reason the third regexp doesn't match. The fourth regexp doesn't match because there is a space at the end of the regexp, but not at the end of the string. Regexps must match a part of the string exactly in order for the statement to be true.



14 Examples: Metacharacters

With respect to character matching, there is another point to consider. Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regexp notation. The metacharacters are {}[]()^$.|*+?\ A metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! "The interval is [0,1)." =~ /\[0,1\)\./ # matches "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches In the last regexp, the forward slash '/' is also backslashed, because it is used to delimit the regexp.



15 Matching and Variables

Similar escape sequences are used in double-quoted strings and the regexps in Perl are mostly treated as double-quoted strings.This means that variables can be used in regexps . The values of the variables in the regexp will be substituted in before the regexp is evaluated for matching purposes. So: $foo = 'house'; 'housecat' =~ /$foo/; # matches 'cathouse' =~ /cat$foo/; # matches 'housecat' =~ /${foo}cat/; # matches



16 Anchors "housekeeper" =~ /keeper/; # matches

So far, if the regexp matched anywhere in the string it was considered a match. Sometimes we'd like to specify where in the string the regexp should try to match. To do this, use the anchor metacharacters ^ and $. The anchor ^ means match at the beginning of the string and the anchor $ means match at the end of the string, or before a newline at the end of the string. So: "housekeeper" =~ /keeper/; # matches "housekeeper" =~ /^keeper/; # doesn't match "housekeeper" =~ /keeper$/; # matches "housekeeper

" =~ /keeper$/; # matches The second regexp doesn't match because ^ constrains keeper to match only at the beginning of the string, but "housekeeper" has keeper starting in the middle. The third regexp does match, since the $ constrains keeper to match only at the end of the string.



17 Character Classes A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regexp. Character classes are denoted by brackets [...], with the set of characters to be possibly matched inside. So: /cat/; # matches 'cat' /[bcr]at/; # matches 'bat, 'cat', or 'rat' /item[ ]/; # matches 'item0' or ... or 'item9' "abc" =~ /[cab]/; # matches 'a' In the last statement, even though 'c' is the first character in the class, 'a' matches because the first character position in the string is the earliest point at which the regexp can match.



18 Character Classes /[yY][eE][sS]/; # match 'yes' as case-insensitive # 'yes', 'Yes', 'YES', etc. This regexp displays a common task: perform a case-insensitive match. Perl provides away of avoiding all those brackets by simply appending an 'i' to the end of the match. Then /[yY][eE][sS]/; can be rewritten as /yes/i;. The 'i' stands for case-insensitive and is an example of a modifier of the matching operation. There are more modifications.



19 Character Class - Ranges

The special character '-' acts as a range operator within character classes, so that a contiguous set of characters can be written as a range. With ranges, the unwieldy [ ] and [abc...xyz] become the svelte [0-9]and [a-z]. So: /item[0-9]/; # matches 'item0' or ... or 'item9' /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', # 'baa', 'xaa', 'yaa', or 'zaa' /[0-9a-fA-F]/; # matches a hexadecimal digit /[0-9a-zA-Z_]/; # matches an alphanumeric character, # like those in a perl variable name If '-' is the first or last character in a character class, it is treated as an ordinary character; [-ab], [ab-] and [a\-b] are all equivalent.



20 Character Classes - Negation

The special character ^ in the first position of a character class denotes a negated character class, which matches any character but those in the brackets. Both [...] and [^...] must match a character, or the match fails. So: /[^a]at/; # doesn't match 'aat' or 'at', but # matches all other 'bat', 'cat, '0at', '%at', etc. /[^0-9]/; # matches a non-numeric character /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary

