Theory: Regular languages

Many tools for searching and sculpting text rely on a pattern language known as regular expressions.

The theory of regular languages underpins regular expressions.

(Caveat: Some modern "regular" expression systems can describe irregular languages, which is why the term "regex" is preferred for these systems.)

Regular languages are a class of formal language equivalent in power to those recognized by deterministic finite automata (DFAs) and nondeterministic finite automata (NFAs).

[See my post on converting regular expressions to NFAs.]

In formal language theory, a language is a set of strings.

For example, { "foo" } and { "foo" , "foobar" } are formal (if small) languages.

(Mathematicians don't typically put quotes around a string, preferring to let the fixed-width typewriter font distinguish it as one, but I'm guessing that programmers are more comfortable with the quotes around strings.)

In regular language theory, there are two atomic languages:

$\epsilon$ -- the null language, which contains the string of length zero; and

$\emptyset$ -- the empty language, which contains no strings at all.

In almost every programming language, the null string is written "" .

Mathematicians are often sloppy with the notation for the null language, using $\epsilon$ to represent both the null language, { "" }, and the null string, "" .

For each character c in the alphabet, there is a corresponding one-character primitive language, { "c" }.

(The alphabet is a set of characters, usually denoted $\Sigma$ or $A$.)

Once again, mathematicians are often sloppy in their notation, using the character c to mean the language { "c" }.

Regular languages are those that can be obtained by unrestricted composition of the operations union, concatenation and Kleene star on the atomic and primitive languages:

The union of languages $L_1$ and $L_2$, written $L_1 \cup L_2$, is set union: \[ L_1 \cup L_2 = \{ x \mathrel{|} x \in L_1 \text{ or } x \in L_2 \} \text. \]

The concatenation of two languages $L_1$ and $L_2$, written $L_1 \circ L_2$, is akin to Cartesian product: \[ L_1 \circ L_2 = \{ \mathtt{"}xy\mathtt{"} \mathrel{|} \mathtt{"}x\mathtt{"} \in L_1 \text{ and } \mathtt{"}y\mathtt{"} \in L_2 \} \text. \] Concatenation is often written as juxtaposition: $L_1 L_2 = L_1 \circ L_2$.

The language $L$ to the $n$th power, written $L^n$, is the language contaning $n$ strings from $L$ concatenated together: \[ L^n = \{ \mathtt{"}x_1\cdots x_n\mathtt{"} \mathrel{|} \mathtt{"}x_i{"} \in L \text{ for all } i \text { between } 1 \text{ and } n \} \text. \] Of course, $L^0 = \epsilon$.

The Kleene star (the "possible empty repetition") of a language $L$, written $L^\star$, contains a language concatenated with itself for every possible combination: \[ L^\star = \bigcup_{i=0}^\infty\; L^i \text. \]

For example, the set $((\mathtt{a} \circ \mathtt{b}) \cup \mathtt{c})^*$ contains strings like "" , "ab" , "c" , "abab" , "ababc" and "cab" .

There are also a few common non-primtive regular operations:

The non-empty repetition of a language $L$, written $L^+$, is the same as Kleene star, but at least one copy of $L$ must be matched: \[ L^+ = L \circ L^\star \text. \]

The option of a language $L$, written $L^?$, is either $L$ or the null string: \[ L^? = L \cup \epsilon \]

The bounded repetition of a language $L$, written $L^{[n,m]}$, consists of between $n$ and $m$ occurrences of a language: \[ L^\star = \bigcup_{i=n}^m\; L^i \text. \]

The theory of regular languages provides algorithms and techniques to answer questions like:

Given a string $s$ and a language $L$, is $s$ in $L$?

Given a string $s$ and a language $L$, which substrings of $s$ are in $L$?

Given a language $L$, is it regular?

Regular expressions in code

In code, regular expressions describe matchable patterns over text.

They are often used to describe locations in text (e.g. all lines that match this pattern) and to transform text (e.g. transform text matching a pattern into something different text).

There is no standard for regular expressions in code, but most languages employ a dialect from a common ancestor.

The three major dialects every programmer should know are:

basic regular expressions (BRE);

extended regular expressions (ERE); and

Perl-compatible regular expressions (PCRE).

Since this article is an introduction, it covers BRE and ERE. (PCRE is largely an extension of ERE).

The notation used in all regular expression implementations is inspired by the mathematical formalism.

The following table describes a generic regular expression pattern language:

Math Pattern Pattern meaning $\emptyset$ no equivalent $\epsilon$ no character at all matches "" c c matches "c" $L_1 \circ L_2$ p1p2 matches p1 then p2 $L_1 \cup L_2$ p1 | p2 matches p1 or p2 $L^\star$ p * matches "" or p repeated $L^+$ p + matches p repeated, but not "" $L^?$ p ? matches p or "" $L^n$ p {n} matches p repeated n times $L^{[n,m]}$ p {n,m} matches p repeated n to m times $\Sigma$ . matches any character $\{c_1,\ldots,c_n\}$ [c 1 ...c n ] matches $c_1$ or $c_2$ or ... or $c_n$ $\Sigma - \{c_1,\ldots,c_n\}$ [^c 1 ...c n ] matches any char but $c_1$ or ... or $c_n$ $(L)$ ( p ) matches p, remembers submatch no equivalent \ n matches string from nth submatch no equivalent \b matches a word boundary no equivalent \w matches a word character, e.g., alphanumeric no equivalent \W matches a nonword character, e.g., punctuation no equivalent \s matches a whitespace character, e.g., space, tab, return no equivalent \S matches a non-whitespace character, e.g., alphanumeric, punctuation no equivalent \d matches a digit character, i.e., 0-9 no equivalent \D matches a non-digit character, e.g., alphanumeric, punctuation no equivalent ^ matches start of line/string no equivalent $ matches end of line/string no equivalent [c 1 -c 2 ] matches $c_1$ through $c_2$

Backreferences are numbered by left parentheses: the $n$th left parenthesis denotes the $n$ submatch.

The sections ahead discussing individual tools will note individual differences for dialects like BRE and ERE.

grep: POSIX basic regular expressions

The tool grep can filter a file, line by line, against a pattern.

The command grep pattern file prints each line of file which contains a match for pattern. Given no file, it reads from the standard input.

The equally useful command grep -v pattern file prints each line of the file file which does not contain a match for pattern.

By default, grep uses basic regular expressions (BRE).

BRE differs syntactically in several key ways. Specifically, the operators {} , () , + , | and ? must be escaped with \ , and many of the character class shortcuts have names instead:

Math BRE Pattern meaning $\emptyset$ no equivalent $\epsilon$ no character at all matches "" c c matches "c" $L_1 \circ L_2$ p1p2 matches p1 then p2 $L_1 \cup L_2$ p1 \| p2 matches p1 or p2 $L^\star$ p * matches "" or p repeated $L^+$ p \+ matches p repeated, but not "" $L^?$ p \? matches p or "" $L^n$ p \{n\} matches p repeated n times $L^{[n,m]}$ p \{n,m\} matches p repeated n to m times $\Sigma$ . matches any character $\{c_1,\ldots,c_n\}$ [c 1 ...c n ] matches $c_1$ or $c_2$ or ... or $c_n$ $\Sigma - \{c_1,\ldots,c_n\}$ [^c 1 ...c n ] matches any char but $c_1$ or ... or $c_n$ $(L)$ \( p \) matches p, remembers submatch no equivalent \ n matches string from nth submatch no equivalent \b matches a word boundary no equivalent [[:word:]] matches a word character, e.g., alphanumeric no equivalent [[:space:]] matches a whitespace character, e.g., space, tab, return no equivalent [[:digit:]] matches a digit character, i.e., 0-9 no equivalent [[:xdigit:]] matches a hex digit character, i.e., A-F, a-f, 0-9 no equivalent [[:upper:]] matches a upperspaced character no equivalent [[:lower:]] matches a lowerspaced character no equivalent ^ matches start of line/string no equivalent $ matches end of line/string no equivalent [c 1 -c 2 ] matches $c_1$ through $c_2$

A common use case for grep is command | grep word , which will dump out the lines from the output of command containing the word.

For instance, ps u | grep matt will dump out processes run by the user matt (and possibly a few others that happen to have matt on the line).

A fun way to learn how to use grep is to run it against the dictionary file, /usr/share/dict/words .

Suppose you're playing the crosswords, and you know a word is seven letters long, with a for it second letter and x for the sixth. Get a hint:

$ grep '^.a...x.$' /usr/share/dict/words cachexy carboxy martext panmixy

We can submatch backreferences to print out words that repeat themselves:

$ grep '^\(.*\)\1$' /usr/share/dict/words aa adad akeake anan arar atlatl baba barabara benben beriberi bibi ...

The \1 refers back to the string matched by the first parenthesized submatch. In this case, that's \(.*\) .

Recall that the $n$th left parenthesis denotes the $n$th submatch.

(Technically, backreferences break the regularity of grep.)

We could find strings that consist of a two different repeated strings:

$ grep '^\(.\+\)\1\(.\+\)\2$' /usr/share/dict/words susurr

Apparently, there's only one match in my dictionary!

Using the start-of-line and end-of-line markers were necessary here. Without them, we get words that contain a substring that repeats itself:

$ grep '\(.\+\)\1' /usr/share/dict/words aa aal aalii aam aardvark aardwolf abactinally abaff abaissed abandonee

In this case, changing the * to \+ also became necessary, since .* matches even the null string, which every string trivially contains.

If you need to find a specific IP address, say 1.10.3.20, in a log file, you can do that by escaping the dots:

$ grep '\b1\.10\.3\.20\b' log

The word-boundary pattern \b is necessary to prevent lines containing text like 101.10.3.20 from matching.

Useful grep flags

-v inverts the match.

inverts the match. --color colors the matched text.

colors the matched text. -F interprets the pattern as a literal string.

interprets the pattern as a literal string. -H, -h print (or don't print) the matched filename

print (or don't print) the matched filename -i matches case insensitively.

matches case insensitively. -l prints names of files that match instead.

prints names of files that match instead. -n prints the line number.

prints the line number. -w forces the pattern to match an entire word.

forces the pattern to match an entire word. -x forces patterns to match the whole line.

egrep: POSIX extended regular expressions

The tool egrep is identical to grep, except that it uses POSIX extended regular expressions.

POSIX extended regular expressions are identical to basic regular expressions, but the operators {} , () , + , | and ? should not be escaped.

This change substantially unclutters complex expressions, such as the double word example:

$ egrep '^(.*)\1$' /usr/share/dict/words aa adad akeake anan arar atlatl baba barabara ...

Consider a search for all words that have an oo at least one letter before and ee , or an ee at least one character before an oo :

$ egrep 'oo.+ee|ee.+oo' /usr/share/dict/words beechwood beechwoods beefwood beetroot beetrooty bloodweed bookkeeper bookkeeping bootee brookweed ...

Consider a search for words that contain between 5 and 7 vowels:

$ egrep '^([^aieou]*[aieou]){5,7}[^aieou]*$' /usr/share/dict/words abacinate abacination abaisance abalienate abalienation abandonable abandonee abarticular abarticulation abastardize ...

Warning: Due to strangeness with grep's handling of Unicode, the previous example only worked with the environment variable LANG=C set.

The power of backreferences: Prime-finding

Backreferences, as noted, break the regularity of the pattern language.

There's a famous regex which uses backreferences to match composite (non-prime) numbers in unary form:

$^(11+)(\1)+$

Thus, egrep -v '^(11+)\1+$' will print out only lines of prime length:

$ egrep -v '^(11+)\1+$' <<EOF 11 111 1111 11111 111111 1111111 11111111 111111111 1111111111 11111111111 EOF 11 111 11111 1111111 11111111111

Most variants of this reegx use the perl-extended (11+?) in place of (11+) .

The +? means try the minimal match first, which directs the backtracking to be a little more intelligent in the order that it searches.

But, for correctness, minimal-match-first is not necessary.

If there exists a match at all, then the number is not prime.

For more discussion of this (and related) regexen and its limits, see Andrei Zmievski's write-up.

According to the lore, Abigail created this regex.

sed

sed is a "stream editor."

It reads a file line-by-line, conditionally applying a sequence of operations to each line and (possibly) printing the result.

By default, sed uses POSIX basic regular expression syntax. To use the (more comfortable) extended syntax, supply the flag -E .

Most sed programs consist a single sed command: substitute.

For example, to substitute instances of the regular expression [ch]at for ball , use:

$ sed 's/[ch]at/ball/g' < in > out

A proper sed program is a sequence of sed commands.

Most sed commands have one of three forms:

operation -- apply this operation to the current line. address operation -- apply this operation to the current line if at the specified address. address 1 ,address 2 operation -- apply this operation to the current line if between the specified addresses.

Numeric addresses

The simplest address is a line number.

For example, to print the first 12 lines, use sed '12q' . The command q quits sed. So, this program prints after it hits the 12th line.

To print only the fourth line, use sed -n '4p' . The flag -n suppresses the default printing behavior, while the command p prints the line.

For convenience, the address $ refers to the last line.

Pattern addresses

Addresses can be regular expressions in the form of /pattern/ .

For example, to extract the text between <body> and </body> in a file use the following sed program:

#!/usr/bin/sed -E -n -f /<body>/,/<\/body>/ p

But, this also prints out the body tags.

A group command { ... } helps here:

#!/usr/bin/sed -E -n -f /<body>/,/<\/body>/ { /<body>/b /<\/body>/b p }

In this case, the b command skips to the next line.

But, this will miss text on the same line as the opening and closing tags.

Using substitute commands to strip out the tags fixes this problem:

#!/usr/bin/sed -E -n -f /<body>/,/<\/body>/ { s/^.*<body>// s/<\/body>.*$// p }

But, this breaks in the (rare) case of a body tag being on one line, as in:

<body> hello world </body>

The problem is that ranges cannot start and end on the same line.

To get around this, add a special case to catch it:

#!/usr/bin/sed -E -n -f /<body>.*<\/body>/ { s/<body>(.*)<\/body>/\1/ p q } /<body>/,/<\/body>/ { s/^.*<body>// s/<\/body>.*$// p }

But, this script still breaks if there are nested body tags in the document.

If nesting in a pattern matters, it's probably time to switch to a formalism more powerful than regular languages, such as context-free languages.

Useful operations

The group operation { operation 1 ; ... ; operation n } executes all of the specified operations, in order, on the given address.

executes all of the specified operations, in order, on the given address. The operation s/ pattern / replacement / arguments replaces instances of pattern with replacement according to the arguments in the current line. In the replacement, \ n stands for the n th submatch, while & represents the entire match.

replaces instances of with according to the in the current line. In the replacement, stands for the th submatch, while represents the entire match. The operation b branches to a label, and if none is specified, then sed skips to processing the next line. Think of this as a break operation.

branches to a label, and if none is specified, then skips to processing the next line. Think of this as a operation. The operation y/ from / to / transliterates the characters in from to their corresponding character in to .

transliterates the characters in to their corresponding character in . The operation q quits sed .

quits . The operation d deletes the current line.

deletes the current line. The operation w file writes the current line to the specified file.

Common arguments to the substitute operation

The most common argument to the substitute command is g , which means "globally" replace all matches on the current line, instead of just the first.

Sometimes, other arguments are useful:

n tells sed to replace the n th match only, instead of the first.

tells sed to replace the th match only, instead of the first. p prints out the result if there is a substitution.

prints out the result if there is a substitution. i ignores case during the match.

ignores case during the match. w file writes the current line to file .

Useful flags

-n suppresses automatic printing of each result; to print a result, use command p .

suppresses automatic printing of each result; to print a result, use command . -f sedfile uses sedfile as the sed program.

Examples

Strip comment lines starting with # :

$ sed '/^#/d'

Delete C++-style // comments

$ sed 's/\/\/.*$//'

Encrypt with the Caeser cipher:

$ sed 'y/abcdefghijklmnopqrstuvwxyz/defghijklmnopqrstuvwxyzabc/'

Decrypt with the Caesar cipher:

$ sed 'y/defghijklmnopqrstuvwxyzabc/abcdefghijklmnopqrstuvwxyz/'

Change names from "Last, First [Middle/Middle Initial.]" to "First [Middle/Middle Initial.] Last":

$ sed -E 's/([A-Z][a-z]*), ([A-Z][a-z]*( [A-Z][a-z]*[.]?)?)/\2 \1/g' Might, Matthew B. Matthew B. Might

Next steps with sed

sed is much more powerful than this summary alludes.

There are label ( : ) and branching commands ( b , t ) that allow loops, and in theory, arbitrary (Turing-equivalent) computation.

sed keeps track of both a pattern space (the current line) and hold space, and there are commands to manipulate both of them, e.g., g , G , h and H .

That said, you should probably never use these commands!

If you find yourself tempted to use these more advanced constructs, it's a sign that you want to use a tool like awk or Perl instead.

AWK

The awk command provides a more traditional programming language for text processing than sed .

Those accustomed to seeing only hairy awk one-liners might not even realize that AWK is a real programming language. For example, here's a comprehensible AWK program that prints the factorial of each line:

#!/usr/bin/awk -f { print factorial($0); } function factorial(n) { if (n == 0) return 1; else return n*factorial(n-1); }

Of course, AWK can be terse and obtuse too. Here's a popular one-liner that prints out the unique lines of a file:

awk '!a[$0]++' file

The major difference in philosophy between AWK and sed is that AWK is record-oriented rather than line-oriented.

Each line of the input to AWK is treated like a delimited record.

The AWK philosophy melds well with the Unix tradition of storing data in ad hoc line-oriented databases, e.g., /etc/passwd .

That is, where sed sees a file like this:

line 1 line 2 line 3 ...

awk sees a files like this:

record 1 record 2 record 3 ...

where each record is:

field 1 field 2 field 3 ...

The command line parameter -F regex sets the regular expression regex to be the field delimiter.

For instance, awk -F "," sees each record as:

field 1 ,field 2 ,field 3 ,...

To print out the account name and uid from /etc/passwd , use:

$ awk -F : '/^[^#]/ { print $1, $3 }' /etc/passwd nobody -2 root 0 daemon 1 ...

AWK programs

An AWK program consists of pattern-action pairs:

pattern { statements }

followed by an (optional) sequence of function definitions.

In fact, an action is optional, and a pattern by itself is equivalent to:

pattern { print }

As each record is read, each pattern is checked in order, and if it matches, then the corresponding action is executed.

Function definition

The form for function defintion is:

function name(arg 1 ,...,arg n ) { statements }

As in C, a return statement returns the result of the function.

Patterns

The most common one-line pattern in AWK is the blank pattern, which matches every line.

The other pattern forms include:

/ regex / , which matches if the regex matches something on the line;

, which matches if the matches something on the line; expression , which matches if expression is nonzero or non-null;

, which matches if is nonzero or non-null; p1 , p2 , which matches all records (inclusive) between p1 and p2 .

, which matches all records (inclusive) between and . BEGIN , which matches before the first line read;

, which matches before the first line read; END , which matches after the last line is read;

Some implementations of awk , like gawk , provide additional patterns:

BEGINFILE , which matches before a new file is read; and

, which matches before a new file is read; and ENDFILE , which matches after a file is read.

Expressions

AWK expressions appear in both patterns and in statements.

A basic AWK expression is either:

a special variable, e.g., $1 or NF ;

or ; a regular variable, e.g., foo

a string literal, e.g., "foobar" ;

; a numeric constant, e.g., 3 , 3.1 ;

, ; a regex constant, e.g., /foo|bar/

A regex constant can be passed as a first-class value to a function.

AWK supports a match expression form, exp1 ~ exp2 , where the assumption is that exp1 will evaluate to a string, exp2 will evaluate to a regex, and the result of matching is returned.

A lone regex constant in a conditional is implicitly equivalent to a match against the current record; that is, /regex/ becomes $0 ~ /regex/ .

For example, to filter lines that contain both foo and bar :

$ awk '/foo/ && /bar/ { print }'

or just:

$ awk '/foo/ && /bar/'

AWK brings the expected C-like arithmetic (like + ), comparison (like == ) and Boolean operators (like && ).

As in C, variable assignment is an expression rather than a statement.

For example, to print account names from /etc/passwd where the account number is 500 , use:

$ awk -F : '$3 == 500 { print $1 }' /etc/passwd

String concatenation is simply juxtaposition. As a result, it may be necessary to surround strings to be concatenated with parentheses, e.g., ("bar = " bar ".") .

Abutting a name with parentheses indicates function call; for example, the following program surrounds every line of input with curly braces:

#!/usr/bin/awk -f { print f($0) } function f(line) { return ("{" line "}") ; }

Arrays

AWK supports both scalars and arrays.

Arrays in AWK are associative, much like objects in JavaScript.

To reference an index in an array, use the C-style subcript notation, variable-name[index] , where index can be any expression that evaluates to a scalar value.

There is no need to create an array explicitly: just assign into an index in an undefined variable name.

To check for the existence of an index, use the in operator: index in variable-name .

For example, to print the account name with the highest uid run the following on /etc/passwd :

#!/usr/bin/awk -F : -f /^#/ { next ; } { users[$3] = $1 ; } END { max = 0 ; for (i in users) { if ((i+0) > (max+0)) max = i ; } print users[max]; }

The (i+0) and (max+0) is necessary to forcibly convert them to numerics. Otherwise, < compares them lexically as strings.

Arrays have a split first-/second-class status in AWK.

Arrays are passed as parameters to procedures by reference.

But, it is not possible to assign an array to a variable.

#!/usr/bin/awk -f BEGIN { arr[0] = 1 ; print 0, arr[0] ; # prints 0 1 modify_array(arr) ; # ok print 0, arr[0] ; # prints 0 2 brr = arr ; # error exit ; } function modify_array(array) { for (k in array) { array[k]++ ; } }

Arrays may not be returned from procedures either.

Special variables

There are several special variables in AWK:

Variable Meaning $0 text of the matched record $n the nth entry in the current record FILENAME name of current file NR number of records seen thus far FNR number of records thus far in this file NF number of fields in current record FS input field delimiter, defaults to whitespace RS record delimiter, defaults to newline OFS output field delimiter, defaults to space ORS output record delimiter, defaults to newline

These special variables can be used in patterns.

For instance, one could print the even lines:

$ awk 'NR % 2 == 0 { print }'

Special variables like OFS can also be assigned as the program executes.

Technically, $n is not a variable.

In fact, $ is a special pseudoarray applied to the expression on its right.

For example, $(0) is an expression, as are $i and $(a[i]) .

And, by extension, $NF is the last field.

Statements

AWK is a small language, with only a handful of forms for statements.

The man page lists all of them:

if (expression) statement [ else statement ] while (expression) statement for (expression; expression; expression) statement for (var in array) statement do statement while (expression) break continue { [ statement ... ] } expression print [ expression-list ] [ > expression ] printf format [ , expression-list ] [ > expression ] return [ expression ] next nextfile delete array[expression] delete array exit [ expression ]

The most common statement is print , which is equivalent to print $0 .

If arguments to print are comma-separated, then they are spliced together with OFS .

For example:

$ echo foo bar | awk '{ OFS="::" ; print $1, $2 ; exit }' foo::bar

Most of these statements should be familiar to programmers, and some look eerily similar to those found in JavaScript.

The delete statement deletes an index from an array, or alternately, the entire array.

Control statements

AWK supports C-style control constructs like if , for and while .

It also supports a special for form for iterating over the keys in an associative array:

for (var in array-name) statement

The control statements next and nextfile skip to the next line of input and the next file respectively.

Built-in functions

AWK comes with a large set of built-in functions.

These are also listed in the AWK man page.

Perhaps the most useful is gensub(regex, replacement, params [ , input ]) , which returns roughly the result of sed 's s/regex/replacement/params run on input or $0 by default.

For example, to change C++-style // comments to C-style comments:

$ awk '{ print gensub(/\/\/[ ]?(.*)/, "/* \\1 */", "g" ) }'

Not all AWK implementations support gensub , so you might have to use the specializations sub and gsub instead.

Useful flags

-f filename uses the provided file as the AWK program.

uses the provided file as the AWK program. -F regex sets the input field separator.

sets the input field separator. -v var=value sets a global variable. Multiple -v flags are allowed.

vim and emacs

Text editors in the Unix tradition excel at manipulating text.

If you haven't yet taken the (brief) tutorial for both editors, do so at your earliest convenience.

You can apply the knowledge from this article inside vim and emacs , which have their own rich regex-based search-and-replace systems:

Command vim emacs search /pattern C-M-s pattern RET replace :s/pat/new/ M-x replace-regexp RET pat RET new RET

Both editors default to a BRE-like syntax.

In both, the escape

expands into the nth submatch.

In emacs, the escape \& expands into the matched text, while just the character & expands into the matched text in vim.

You can also direct both editors to interact with sed and AWK, or any other shell command for that matter:

Command vim emacs insert output of command :r!command M-1 M-! command pipe selection to command :'<'>!command M-1 M-| command RET

Related posts and further reading