Introduction to POSIX Regular Expressions

What are regular expressions?

Regular expressions are a group of text, consisting of characters that are symbolic or literal, which are used to identify patterns of text. Just as an example, a regular expression such as ^hello. would match all characters that start with the text "hello", then a single character. In this series we'll go over how to construct such regular expressions.

What is the POSIX standard?

POSIX , an acronym for Portable Operating System Interface, is a set of standards defined by the IEEE Computer Society for maintaining compatibility between operating systems. The standard was implemented in order to make software portable between variations of Unix and other operating systems.

How did POSIX affect Unix regular expressions?

Regular expressions vary from the languages it's implemented in as well with the tools used. On the command line, there used to be three different commands for regular expressions. grep included all basic regular expressions (BRE), while extended grep egrep included more notations that deemed it more powerful, for the cost of efficiency. The collection of features it includes is known extended regular expressions (ERE). Additionally, there was fast grep fgrep , which allowed for multiple fixed string matching. These variations were merged to grep by POSIX in 1992.

Now, this merging of basic regular expressions (BRE) and extended regular expressions (ERE) did not mean their notations were combined. When using grep , the default is set to BRE notation, but you can easily switch to ERE with the -E option.

Did I confuse you enough yet? Hopefully not... We'll go over which regex notations need the -E option, so no need to be lost!

Regular Expressions with grep

The command line uses the grep command, which stands for global regular expression printing to use regular expressions.

Why learn regex and grep?

The knowledge of grep and how to construct regular expressions is extremely powerful. For example, you may complete text manipulations such as searching and substitutions all in one line. These can be incorporated into a shell scripts to automate work-flow for fast and easy text processing. Regular expressions can also be used in text editors such as Vim and emacs, file viewers such as less and man, along with programming languages such as awk, python, and perl.

A sample test

Let's try out a simple grep command to get started. We won't be using any special characters, just literal values.

$ ls /usr/bin | grep 'zip'

bunzip2 bzip2 bzip2recover funzip gunzip gzip unzip unzipsfx zip zipcloak zipdetails zipdetails5.16 zipdetails5.18 zipgrep zipinfo zipnote zipsplit

Here, we can see all the commands with the word zip in them within our /usr/bin folder. Remember the bin folder is where all our commands are stored in binaries format.

Avoid Parsing ls This article makes a good claim on why you shouldn't parse the results of ls . In this tutorial, we'll be parsing the contents in our /usr/bin folder, just as examples. However, be sure to give this article a quick read and understand that parsing the results of ls may give unexpected results.

Options with grep

Here is a list of useful options you can use with the grep command.

-c Print the number of matches. -E Use extended regular expressions. -e Input a list of patterns. Returns any matches from that list. -F Using fixed strings (ignore special characters). -f Read patterns from a newline-separated file. -h Suppress the output of file-names. -i Ignore casing. -L Prints the name of files that weren't matched. -l List names of files that match the pattern (instead of printing matched lines). -n prefix each matching line w the number of the line within the file -q Doesn't print anything, but exits quietly. -r Search recursively through specified folder. -s Suppresses error messages. -v Print the lines that didn't match any patterns.

Now that you have an understanding of regular expressions, POSIX standards, and grep, let's learn about the two types of characters in regex.

Literal and Special Characters

There are two types of characters you should look out for when reading a regular expression - literal and special characters.

Literal Characters

Literal characters are exactly what they sound like they mean - they hold the literal values of text. These include alphabet letters; for example, if I typed a regular expression testing then it would match all occurrences of the sequence t , e , s , t , i , n , g (in that order) of our query.

Using just literal characters is limiting. We want our search queries to be dynamic and expansive so that we can find more than just a literal word. For example, how can we search for all character groupings containing the YYYY/MM/DD format? This is where special characters come into play.

Special Characters

Special characters , also known as metacharacters , hold special meanings and are usually symbolized by punctuation.

The following text is a list of all the metacharacters we'll look at. We'll start with basic regular expressions (BRE), then reach out the extended group (ERE). Glance over them, and we'll go in detail within the following lessons.

Basic Regular Expressions (BRE)

Basic regular expressions are the grep default. No need to apply any options when using the following:

\ Take the literal value of the following character. . Match any single character, but cannot be NULL. * Match zero or many times the character that precedes it. ^ Anchor element used to specify "starts with" some pattern. $ Anchor used to specify "ends with" some pattern. [...] Matches any one of the characters enclosed. [^...] Matches the inverse of characters enclosed. \{n,m\} Matches at least n occurrences, but at most m. \{n,\} At least n occurrences. \{n\} Exactly n occurrences. \(\) Group parts of regular expressions together.

Matches the preceding value n number of times.

Note how the parentheses and braces metacharacters need a backslash character to escape it in BRE.

Extended Regular Expressions (ERE)

When using extended regular expressions, we must type in the -E option. This group adds more metacharacters to BRE. Furthermore, they don't need a backslash before any brace or parentheses ( {n,m} ).

{n,m} Same as BRE, but no \. (...) Groupings. + match one or more instances of the preceding character or grouping. ? Match zero or more instances of the preceding character or grouping. | OR. Match any instances that come before or after the pipe.

Done skimming through each metacharacter? Great! Let's move on to see some examples and get a more detailed look at how each special character works.

Matching single characters

Using literal characters

Let's start by matching a single character. The most obvious way would be to use a literal character for alphabetic letters. If there is a metacharacter that you would like to match, you can escape it with the backslash ( \ ).

$ ls /usr/bin | grep 'c\.d'

creatbyproc.d filebyproc.d newproc.d pidpersec.d runocc.d syscallbyproc.d syscallbysysc.d

Here's a simple regex example. We are looking for patterns that has a c , followed by a period ( . ), then a d .

From this we can notice two things:

We need to use the backslash to escape metacharacters (the period).

Order matters.

This isn't a "fuzzy search", where cat can match something like combat .

Using dot metacharacter

The dot ( . ) metacharacter matches any single character.

$ ls /usr/bin/ | grep 'c.t'

bzcat codesign_allocate colcrt cut gencat gzcat locate policytool tccutil xmlcatalog ypcat zcat

Notice how a character must be present within between c and t . The dot metacharacter does not allow for the NUL character.

In the next section, we'll learn how to use bracket expansions, which will give you even more power in selecting single characters.

Bracket Expressions

Brackets allow you to specify a single character from a group . For example, if you wanted any single vowel, you can use [aeiou] .

$ ls /usr/bin | grep 'b[aeiou]t'

batch bitesize.d smbutil

Negating Bracket Expressions

To negate all characters within brackets, precede the characters within the brackets with a caret ( ^ )

$ ls /usr/bin | grep 'b[^aeiou]t'

rwbytype.d

This would specify some text pattern that has a single character that is not [aeiou] between b and t .

Simplifying with a range

We can specify a range if we want a range of characters or numbers.

$ ls /usr/bin | grep '[a-d][e-g][h-l]' afhash afida afinfo cancel git-receive-pack ldapdelete mdfind snmpdelta

With this command, we selected words that contains a first letter from a , b , c , or d , a second letter from e , f , or g , and a third letter from h , i , j , k , or l . Notice how this sequence of three letters can appear anywhere in the word.

Portability conflicts with range

A severe downside to using the - metacharacter for range is that it's not portable due to different character collation orders. To explain this, we need to learn a bit of history.

Unix was first developed with just ASCII characters. These were the canonical English characters which had order from 0 to 127, including characters such as control codes, printable characters, and upper/lowercase letters with numbers and punctuation marks. For letters, we had an ordering for characters like:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

As other countries began adopting Unix, they had to make room for more characters. They had to include special characters such as an e with an accent over it, or a c with a squiggly line beneath. Thus, some collations arose with an ordering like this:

aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

You could probably imagine the problem already. An expression such as A-Z would capture all uppercase letters in the first example, but all letters except a in the second.

Thus, try not to use the range character too much. You can instead rely on Character Classes (below), which are POSIX standard.

Checking your locale

To check your current locale, print the $LANG variable.

$ echo $LANG en_US.UTF-8

Character Classes

Because of the discrepancies in collation ordering, Unix provides several character classes in order to make shell scripts more portable. Here is a list of the character classes:

[:alnum:] Alphanumeric [:alpha:] Alphabet [:blank:] Space and Tab [:cntrl:] Control [:digit:] Numeric [:graph:] Non-space [:lower:] Lowercase [:print:] Printable [:punct:] Punctuation [:space:] Whitespace [:upper:] Uppercase [:xdigit:] Hexadecimal digits

When using character classes you must place them within brackets.

$ ls /usr/bin/ | grep '[[:digit:]][[:alpha:]][[:digit:]]' a2p5.16 a2p5.18 s2p5.16 s2p5.18

This matched any files that had a sequence containing a digit, an alphabet character, followed by another digit.

Using metacharacters within brackets

When metacharacters are placed within brackets, they lose their special meaning.

The following code would match any listings with a minus symbol ( - ), a period ( . ) or the letter x .

$ ls /usr/bin/ | grep '[-.x]'

... weblatency.d wish8.4 wish8.5 xar xargs xattr xattr-2.6 xattr-2.7 xcode-select ...

If you want to specify the bracket ( ] ) or the minus symbol ( - ), place them first in the list.

Non-English Environment

In some languages, two letters in sequence may identify itself as a one unit.

For example, if we were to consider the characters 'ts' as one unit, we could do so by placing them in brackets and periods [.ts.] .

Furthermore, we can specify characters that have some variations such as an accent mark or tilde. By having the expression [=a=] , we can specify all variations of the letter a . This includes à , á , â , and ã .

Matching multiple characters

With an Asterisk

The asterisk ( * ) allows you to match zero or a multiple number of the preceding character.

$ ls /usr/bin | grep 'bc*t'

btmmdiagnose ibtool libtool pfbtops

In this example, we are matching commands within our bin folder that have a b , any number of c 's (including none), then a t .

Although convenient, the asterisk may be too powerful for your purposes. To have more control over the number of repeating elements, we can use braces.

Using Braces to get more specific

With braces ( {...} ), we are able to better control how many times an element occurs. In BRE, remember that we must escape the brace characters with a backslash.

\{n\} Match exactly n occurrences of the preceding regex. \{n,\} Match at least n occurrences of the preceding regex. \{n,m\} Match between n and m occurrences of the preceding regex.

$ ls /usr/bin | grep '[[:alpha:]]{2}'

perlthanks5.16 perlthanks5.18 piconv5.16 piconv5.18 pl2pm5.16 pl2pm5.18 pod2html5.16 pod2html5.18 pod2latex5.16 pod2latex5.18 pod2man5.16 pod2man5.18 pod2readme5.16 pod2readme5.18

The above code would search for lines of text that have a two digit number.

RE_DUP_MAX

Note that the values for n and m must be between 0 and RE_DUP_MAX . This variable signifies the largest number of repetitions you are allowed in regular expressions.

To check your system's setting for this value, use the command getconf .

$ getconf RE_DUP_MAX 255

Backreferences and Anchors

Backreferences

When constructing a regular expression, you may need to reference some previously matched regex term. To do so, we may use backreferences .

The phrases that you want to reference are to be enclosed with parentheses ( \(...\) ). To reference it, use \ digit , where \1 represents the first referenced phrase, \2 , the second, and so on.

For example, a regex such as \(foo\)ber\(buzz\)*\2\1 simplifies to (foo)ber(buzz)*buzzfoo . Since the asterisk can have zero or many elements of the previous regex, this would match an text such as fooberbuzzbuzzfoo , fooberbuzzbuzzbuzzfoo , and so on.

Begins with and Ends with Anchors

So far we have learned how to use regex to search within lines, with no restrictions. Thus we can use zip to find elements that have the text 'zip' anywhere within its text. This is great, but what if we want to search strings that start or end with specific letters? To do precisely this, we use the ^ and $ symbols.

$ ls /usr/bin/ | grep '^zip' # files that start with zip

zip zipcloak zipdetails zipdetails5.16 zipdetails5.18 zipgrep zipinfo zipnote zipsplit

# files that end with zip $ ls /usr/bin | grep 'zip$'

funzip gunzip gzip unzip zip

# files that start and end with zip $ ls /usr/bin | grep '^zip$'

zip

Matching empty lines

The ^$ matches empty strings or lines, and may be used as grep -v ^$ to filter out all empty lines.

Using ^ within text

In BRE, when ^ or $ are used anywhere else beside the beginning or end of the line, it has no effect, so it turns back into a literal character. For example, [ab^cd] means the letters a , b , ^ , c , or d .

The two uses of the caret ^ The ^ is used to signify that some characters begins with some regex. However, remember that if it appears as the first element in brackets, it'll change its meaning to negation. Thus, we [^abcd] is different from ^[abcd] . The first matches any character not including a , b , c , or d . The second searches for text that starts with a , b , c , or d .

Extended Regular Expressions

Up until now we have covered all the characters used under basic regular expressions (BRE). Here, we'll go over extended regular expressions (ERE), which can be used with grep 's -E option.

Similarities and Differences

Here is a list of similarities and differences between ERE's and BRE's:

In ERE's there are no backreferences, although the parentheses have special meanings.

Same notation for matching single characters

No need for escape keys when uses braces ( {n, m} or parentheses.

Parentheses

Rather than symbols for backreferences, parentheses in ERE are used for groupings. This allows you to specify groups of text as regular expressions, which is helpful for metacharacters to reference a group of previous regex.

For example, (toy)* would select the letters toy zero or many times.

Extended features

There are three extended metacharacters that ERE offers.

Question mark (?)

The ? matches zero or one of the preceding regex.

Plus sign (+)

The plus symbol ( + ) is used to match one or more of the preceding regex. It's very similar to the asterisk ( * ), but allows for no NUL values.

$ ls /usr/bin | grep -E '[[:alpha:]][[:digit:]]+[[:alpha:]]' enc2xs5.16 enc2xs5.18 eqn2graph find2perl find2perl5.16 find2perl5.18 grap2graph h2ph h2ph5.16 h2ph5.18 h2xs h2xs5.16 h2xs5.18 hdxml2manxml headerdoc2html ip2cc

This would select all text with a letter, followed by one or more digits, then another letter. We could have gotten the same result with [[:alpha:]][[:digit:]][[:digit:]]*[[:alpha:]] but the previous example looks must cleaner.

Alternations (|)

Alternations give you the flexibility of choosing between two or more regex expressions.