I hate the Pumping Lemma for regular languages. It’s a complicated way to express an idea that is fundamentally very simple, and it isn’t even a very good way to prove that a language is not regular.

Here it is, in all its awful majesty: for every regular language L, there exists a positive whole number p such that every string w∈L that has p characters or more can be broken down into three substrings xyz, where y is not the empty string and the total length of xy is at most p, and for every natural number i the string xyiz is also in L.



Did you understand that statement? I’d be willing to bet you didn’t, if you haven’t seen it before. It has a ferociously intimidating logical structure, with no fewer than five alternating quantifiers: “for every L … there exists p … such that for every w … there exist x, y and z … such that for every i …”. Beginning students of analysis are apt to struggle with the definition of continuity, because it takes a while to get used to having two nested quantifiers: for every epsilon > 0 there exists a delta > 0, etc. If two are a struggle, five is cruelty.

The real insult is that the actual underlying idea, and the proof, is shockingly simple. It is essentially the pigeonhole principle: the principle that if you put more than n pigeons into n holes then there must be a hole with more than one pigeon in. Take the regular language L, and express it as a deterministic finite automaton with p states. Any string in L determines a path through the automaton; so any string with p or more characters must visit the same state twice, forming a loop. This looped part can be repeated any arbitrary number of times to produce other strings in L.

If you understand the idea, it is easy to write down the incomprehensible formal statement of it. If you do not, the formal statement is not likely to lead you to enlightenment.

But the world is full of ways to express simple ideas in complicated ways. Why is this particular one foisted upon Computer Science students? Because it is used to show that some languages are not regular, which is important both in theory and in practice.

This leads to the second reason for my distaste for the Accursed Lemma: there are better ways to prove non-regularity, which are more powerful and give more insight into the nature of regularity, and which are more straightforward to state: they are omitted from the typical undergraduate curriculum, presumably because by the time the poor students have understood the hideous Pumping Lemma, there is no time left.

Here is my favourite: one might call it the Myhill-Nerode Theorem à la Brzozowski. Brzozowski invented the marvellous idea of differentiating formal languages. Let L be a language: not necessarily a regular language, just any set of strings; let w be a string, not necessarily in L. Then the derivative d/dw (L) is the language { v | wv ∈ L }.

For example, if L is the set of English words then d/d‘w’ (L) is the words that start with ‘w’, with the ‘w’ removed. So it contains such strings as “ord” and “hy” and “ibble”, and even “anker”.

It’s easy enough to see that any derivative of a regular language is again regular: taking a derivative just corresponds to changing the start state in a deterministic automaton. By the same argument, any regular language has only a finite number of different derivatives.

The really good thing is that it works the other way round as well: if a language has only a finite number of different derivatives, then it is regular; if it has infinitely many, it is not. Again the proof is easy: given a language L with a finite number of different derivatives, form a deterministic automaton with a state for each derivative. Put an edge labelled ‘x’ from A to B just when d/d‘x’ (A) = B. The start state is L itself, and any derivative that contains the empty string is marked as an accepting state.

So this condition completely characterises the regular languages, unlike the Pumping Lemma. There are non-regular languages that are nevertheless pumpable, but they will still have infinitely many different derivatives.

In practice it is usually easy to use, as well. Take the classic example of L = { aibi | i ∈ }. Clearly d/d(an)(L) = { aibn+i | i ∈ } for all n, and these are all different, so there are infinitely many derivatives and L cannot be regular.

Just to rub in how this is better than the Pumping Lemma, let’s look at the example from Wikipedia of a language where the Pumping Lemma fails. It is L = A∪B∪C, over the alphabet {0,1,2,3}, where A is the strings that contain a doubled digit (00, 11, 22, 33), B is the strings that somewhere in them have the same digit twice with another digit in between (010, 020, 030, 101, 121, etc.); and C is the strings precisely 1/7 of whose digits are 3s.

We can express C as { w∈{0,1,2,3}* | n {0,1,2} (w) = 6 n {3} (w) }. Here I am writing n S (w) to mean the number of characters of w that belong to S, where S is a subset of the alphabet.

Now for every natural number i, let’s compute d/d((0123)i)(C): it is

{ w∈{0,1,2,3}* | n {0,1,2} ((0123)iw) = 6 n {3} ((0123)iw) }

= { w∈{0,1,2,3}* | 3i + n {0,1,2} (w) = 6i + 6 n {3} (w) }

= { w∈{0,1,2,3}* | n {0,1,2} (w) = 3i + 6 n {3} (w) }

These languages are all different, because for example for each i you can see that d/d((0123)i)(C) contains the string (012)i, but none of the others do. What is more, none of the strings (012)i are in any d/d((0123)k)(A∪B), therefore d/d((0123)i)(L) contains (012)j just when i=j, hence L has infinitely many derivatives and is not regular.

PS. I don’t have anything against the pumping lemma for context-free languages.

PPS. This is not a complaint about my own education. I was taught finite automata by an enlightened lecturer who explained the simple idea behind the Pumping Lemma clearly and briefly, and had the good taste not to state it formally. It is driven rather by compassion for those less fortunate.

Update: For a dissenting view, see this post by Lance Fortnow.