I blog for a variety of reasons. To practice my writing and communications skills. To vent frustration in a safe place. For public intellectual exhibitionism.

Today I blog to remember something I had forgotten.

A design pattern, generally speaking, is a common pattern that we use to solve a problem. Unfortunately that observation has been reduced by most programmers I know into an exercise in name calling–and like the practice in high school of teaching history as a dry list of dates and event names, has a way of turning a wonderful observation as to how problems can be solved using common themes into mindless recital. (The next time someone tells me “why didn’t you use a fubar pattern here?” I’m going to smack them.)

Design patterns, however, are not just a dry list of names and UML class diagrams, just as history is not a dry list of dates and event names. And some design patterns from yesterday are nearly forgotten–at least I had nearly forgotten them–which as far as I’m concerned is the same thing.

All this came about because I was reading the paper Regular Expression Matching Can Be Simple And Fast. Because I’ve spent the past couple of months locked in meetings and other brain-scrambling exercises, I figured it would be good to exercise that most important muscle, the brain, on an interesting problem: writing my own regular expression matcher engine in Java, and do it without looking at the code in the paper. (Yes, I know: I can just use java.util.regex, or compile the code in the paper–but the point here was not to mindlessly go “oook! regular expression!” like a brainless monkey. The idea here was to engage in a mental exercise. Do you think Sudoku puzzles are pulled out of thin air without first computing a solution then working backwards to create the puzzle?)

The first design pattern comes about in the discussion of the Regular Expression paper; this is the State Machine. While I’ve linked the Wikipedia paper, the key thing to remember is that a state machine consists of a set of states S and a set of actions A, such that each state {s 1 } triggers action a 1 and transitions to a new state s 2 depending on the action’s outcome.

Typically states are represented as an integer, and your state machine code often looks like:

int state = 1; boolean done = false; while (!done) { switch (state) { case 1: state = doAction1(); break; case 2: state = doAction2(); break; ... } }

What’s great about state machines is twofold: first, it helps break a computational problem into small chunks of code. This can be useful if you need to write code that fits into a small footprint (like an embedded processor you’d find in a remote control or a watch or calculator), and is especially useful in this context if you are emulating a multi-threaded process in an environment where you don’t have a multi-tasking operating system to rely upon. (I once built a print spool server that ran on a 6809 embedded processor without any operating system by building the entire network stack as a sequence of states in a state machine. I could then monitor up to 8 separate incoming network connections at the same time simply by having my program walk through 8 separate chunks of memory which held the state for each incoming connection and running the state machine in each: update state in slot 1, then update state in lot 2, then…, then update state in slot 8, then repeat with slot 1, etc.)

The second thing that is great about state machines is that it allows you to represent a dynamic program which is built from some input string (such as a language specification in BNF form or a regular expression parser); in this case, each state is not an integer in a switch statement, but an object in a linked list of objects. It’s great; rather than hard-code how your program works, you load the string, build the state machine, and your execution code simply works the state machine until it’s done.

The second design pattern that came about when creating the regular expression parser–and I’m kicking myself because it took me four hours to notice this last night (because of the brain-scrambling effects of managerial meetings) is the shunting yard algorithm, which is used to convert a statement in infix notation into a parse tree which can then be turned into a state machine or solved or turned into assembly language code.

Typically the shunting yard algorithm is used to solve algebraic equations in infix notation, so we can convert the string “34 + 4*5 – (-3 + 4) * 8” into a parse tree (shown here in LISP style notation) (- (+ 34 (* 4 5)) (* (+ (- 3) 4) 8)), which can eventually be solved:

(- (+ 34 (* 4 5)) (* (+ (- 3) 4) 8)) => (- (+ 34 20) (* (+ -3 4) 8)) => (- 54 (* 1 8)) => (- 54 8 ) => 46

The nice thing about the shunting yard algorithm is that it works for any string which uses a series of prefix, postfix and infix operators: we could use it to solve, for example, -(1+2) + 3!, where the first minus sign is a negation operator, and the “!” is a factorial operator.

The algorithm itself is rather simple: you need an operator stack (a push/pop stack which stores the operators you’re using, such as ‘+’, ‘*’, etc), and an ‘eval’ stack, which can be as simple as a push/pop stack of numbers or a stack of parse trees or whatever intermediate representation you’re using to store the equation. Then it’s just a matter of reading the string as a series of tokens: each value is pushed onto the eval stack (popping prefix operators as needed after the push, updating the eval stack as needed), and each operator is pushed onto the stack, popping and evaluating operators that are of lower precedence. The details are in the Wikipedia article linked above.

What took me four hours to remember–and why I’m upset at myself today–is that this algorithm is not just good for solving math equations, but is useful for any sort of notation which is expressed as a series of prefix, infix and postfix operators.

Such as regular expressions.

When you get right down to it, a regular expression is a series of values (letters or characters you’re matching against), with a series of implicit and explicit operators.

‘Concatenation’ is an implicit operator here. Using # to represent the concatenation operator, it’s easy to see that /ABC/ (which matches the substring “ABC” in DABCDEF) can be represented as ‘A’ (i.e. “match letter ‘A'”) # ‘B’ # ‘C’.

The ‘|’ (or ‘or’ operator), which for the expression /ABC|DEF/ matches either “ABC” or “DEF”, is an explicit infix operator: the regular expression /ABC|DEF/ can be represented by the parse tree (in LISP notation): (| (# A B C) (# D E F)).

And the ‘*’ (match 0 or more), ‘+’ (match 1 or more) and ‘?’ (match 0 or 1) are all postfix operators with higher precedence than the concatenation operator: ABC* matches the string ABCCCCCCCC or AB and all in between.

The point is that all of this (including the ‘[…]’ character group, which is simply a single value object matching multiple letters, and the ‘(…)’, which operates in the same way as the parenthesis in an algebraic statement) can all be parsed using the shunting yard algorithm.

As an aside, what’s really spifilicious about the regular expression algorithm in the paper is the observation that a state doesn’t have to be a single integer value, but can actually be an array of values, with each value representing a specific action to carry out–and a single action can spawn two or more values representing two or more actions to carry out at the same time.

For example if you have a regular expression ‘a*a’, essentially meaning “match 0 or more ‘a’ characters, followed by a single ‘a’ character, and you’re matching against “aaa”, then is the second ‘a’ in that string part of the ‘a*’ (match 0 or more ‘a’ characters), or part of the terminal ‘a’ (match a single ‘a’) character? In the case of the algorithm outlined in the regular expression paper at the top of this article, the answer can be ‘both!’ In other words, you can maintain a state which is an array of vectors, and if we represent our regular expression ‘a*a’ with ‘a*’ as 1, ‘a’ as 2, and ‘done’ as 3, we could have our state go from {1} (read first ‘a’) to {1,2} (read second ‘a’) to {1,3,2} (read third ‘a’) to {1,3,3,2} (read end of string) to {3,3,3}: that is, we can simultaneously be doing multiple things at once.

This, by the way, is the point of the paper: by maintaining multiple “states” at once, or rather, by representing our current match state as a vector rather than backtracking (that is, rewinding the matcher back so we can try an alternate matching strategy), we can solve degenerate matching cases without having the program grind to a halt.

I’m blogging this so I can remember. And what I want to remember is this: that the shunting yard algorithm and the state machine are cornerstone design patterns nearly a half century old. For a regular expression, they both can be brought to bare on the problem: the shunting yard algorithm can be used to build a parse tree from a regular expression, and the parse tree can then be used to build an optimal state machine for matching strings against the regular expression.

And more importantly, they are two forgotten tools in a very old toolbox which are a hell of a lot more powerful than the ‘singleton design pattern’, which, to be honest, barely deserves to be named outside of the halls where computer scientists engage in taxonomy for the purpose of obtaining tenure.