I don’t often use RegExp. But when I do, it’s a variation of this pattern.

Note: This blog post uses JavaScript as an example, but is not JavaScript-specific.

Every now and then you find yourself wanting to extract quoted strings, HTML tags or something in-between curly braces from a bigger string of text. While it would be more robust, maintainable and readable to write proper parser, regular expressions (or RegExp for short) are often chosen because you can just search for a ready-made RegExp and use it. The RegExp-based solutions you find on the internet are often suboptimal and it’s hard to understand why they work.

The naïve way

Let’s consider the quoted string example: Your input might be a piece of JavaScript where you are looking to extract a quoted string:

Input: console.log("Hello world!"); console.log("Hello back!");

Desired matches: ['"Hello world!"', '"Hello back!"']

A person who’s new to RegExps might be thinking that /".*"/gus does the job. Let’s dissect what this RegExp means:

/…/gus — In JavaScript (and many other languages) RegExp are delimited by slashes, followed by mode flags. g means that there may be multiple matches in the same string and we are interested in all of them. u enables Unicode mode which generally makes more sense and s allows . to also match

.

— In JavaScript (and many other languages) RegExp are delimited by slashes, followed by mode flags. means that there may be multiple matches in the same string and we are interested in all of them. enables Unicode mode which generally makes more sense and allows to also match . " — Expect a " character

— Expect a character . — Expect any character

— Expect any character * — Repeat the last operator 0 or more times

— Repeat the last operator 0 or more times " — Expect a " character (again)

Note: In the context of JavaScript, both s and u are fairly new flags and might not be supported in all browsers.

Running this RegExp on our input string gives an unexpected (or undesired) result:

"Hello world!"); console.log("Hello back!"

This is one of those cases where the computer is “technically correct” — the string does have quotes on either end and a series of arbitrary characters in-between — but not actually what we were trying to achieve.

The solution I see most often here is people switching to the “non-greedy” version of * :

/".*?"/gus

This RegExp is the same one as above but tells the * operator from above to “consume” as little as possible, giving us the desired result.

Personally I have trust issues when it comes to non-greedy matchers, but more critically: What happens when we run our RegExp on console.log("Hello \\"world\\"!"); ?

Oh noes.

The trick

The backslash has betrayed us! So what now? This is where the trick I promised comes in. Imma throw my RegExp at you and then I’ll tell you how I got there:

/"([^"\\]|\\.)*"/gus

Yeah, we just made one of those RegExps. Isn’t it beautiful?

The first realization to have is that while /".*?"/gus kinda works, it doesn’t really express what you actually mean. We don’t want to accept any character between our double quotes. We want anything but a double quote. How do we do that? In RegExps you can use character groups to match against an entire set of characters:

[abc] — Expect the character a , b or c

— Expect the character , or [a-z] — Expect any letter between a and z

— Expect any letter between and [^abc] — Expect any character but a , b or c .

With this in mind, we can write our original RegExp without a non-greedy matcher:

/"[^"]*"/gus

This, however, still doesn’t solve our backslash issue. For this we need to augment our statement above: Between our double quotes, we want to accept anything but double quotes but if it’s a backslash we don’t care what the next character is.

And that brings us back to the cryptic original RegExp (with some added spaces for grouping):

/" ( [^"\\] | \\. )* "/gus

(...|...) – Expect any of the listed alternatives, delimited by | (there’s only 2 alternatives here).

– Expect any of the listed alternatives, delimited by (there’s only 2 alternatives here). [^"\\] – Expect anything but a double quote or a backslash.

– Expect anything but a double quote or a backslash. \\. – Expect a backslash and then any character.

The trick here is to offer alternatives that are mutually exclusive. The first alternative cannot match a double quote. If a string with a double quote is supposed to match this RegExp the second alternative has to be the one matching it. And that can only happen, if it’s preceded by a backslash.

…and lo and behold, this RegExp matches strings even with escaped quotes!

Pretty cool, right?

Bonus: HTML tags

HTML tags are a funny one because the (infinite number of) escape sequences for “ > ” don’t contain the character “ > ” (like > ). Because of this most simple RegExps like /<[^>]*>/gus work just fine.

...until you use “ > ” in an attribute value!

Input: <a href="javascript:alert('>');" target="_blank">lol</a>

Desired output: <a href="javascript:alert('>');" target="_blank">

To handle this case, we have to use our new trick twice. The only way a closing tag “ > ” can appear in an HTML tag is inside a string. So our first alternative will accept anything that is not a closing tag or a double quote. The second alternative is our string RegExp from before:

/<([^>"]| string RegExp )*>/gus

or fully written out:

/<([^>"]|"([^"\\]|\\.)*")*>/gus

Oof, quite the mouthful, isn’t it? But it works!

Parting advice

There’s a reason Jamie Zawinski’s quote is so famous:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

RegExps tend to build into an unmaintainable mess of crypticism very quickly. Use them sparingly and with care. And remember that most language grammars can’t be parsed with RegExp. But most importantly: It’s often better to not try to look smart and rather just do string manipulation with simple method calls.