Introduction

Regular expressions (regexes) are a handy tool for text processing. People use them for things like input validation. For example, the question “Does the string the user entered look like an email?” can be answered using a regex.

Since regexes are generally deemed pretty useful, they are built into most modern programming languages. Each programming language has its own regex engine that accepts a regex and a string and decides whether or not the regex matches the string.

As it turns out, it can be expensive to determine whether or not a regex matches a string. Regex matches in most programming languages have worst-case exponential complexity and can also have polynomial complexity (quadratic, cubic, etc.).

Regular Expression Denial of Service (ReDoS)

When one of these expensive regex matches can be triggered by a malicious party, it may cause Regular Expression Denial of Service (ReDoS). If computational resources can be diverted to an expensive regex match instead of to legitimate users, this will deny service to legitimate users.

This type of expensive behavior has been discussed in the academic community since the 1970s, and as far as I know its relevance to web security issues was first noted in 2003. However, there have been no large-scale studies of ReDoS in the wild and I had always thought of it more as an academic parlor trick than as a serious problem in practice. However, during my Sense of Time project I was surprised to find that about 10% of the security vulnerabilities reported in npm were ReDoS vulnerabilities. This made me wonder if ReDoS might be a serious problem in practice.

ReDoS regex example 1: Exponential usernames from Microsoft

During some exploratory work we found this regex in a Microsoft project. It is used to match Windows usernames.

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/

In many languages, this regex is exponential on the input ‘aaa…aaa!’. Here’s an example you can try in Node.js:

/^[a-zA-Z0–9]+([._]?[a-zA-Z0–9]+)*$/.exec('a'.repeat(100) + '!')

ReDoS regex example 2: Quadratic emails in Django

We found this regex in Django. It was used to match emails.

/^\S+@\S+\.\S+$/

In many languages, this regex is quadratic on the input ‘@@@…@@@’. Here’s an example you can try in Node.js:

/^\S+@\S+\.\S+$/.exec('@'.repeat(100000))

Why do these regex matches take so long?

Most regex engines use a naive backtracking algorithm that works something like this:

Match characters like you would expect.

Any time you have a choice, try one path. If it doesn’t work, try the other one later.

If you find a match, return success.

This algorithm means that if there are more than a linear number of possible paths to try, then the regex engine will take more than linear time.

Here is an example regex with worst-case exponential behavior. It is a simplified version of the Microsoft regex for usernames shown above.

Railroad diagram for /^(?:a+)+$/. Generated by https://regexper.com

Let’s think about its behavior on the string “aaa!”.

Every time the regex engine sees an “a” it can take either the inner loop or the outer loop from the “a” vertex.

This means that every “a” it sees will double the number of paths it will try on a mismatch: “all of the paths it was already going to try, followed by either the inner loop or the outer loop on this particular ‘a’”.

the number of paths it will try on a mismatch: “all of the paths it was already going to try, followed by either the inner loop or the outer loop on this particular ‘a’”. The final “!” ensures that the regex engine will not find a match along any of these paths.

So with three “a”s we have 2*2*2 paths to try. If we add another “a” we will have (2*2*2)*2 paths to try. For inputs of this form the behavior will be exponential in the number of “a”s.

Collecting regexes

As I mentioned earlier, I was suspicious about the possibility of lots of ReDoS regexes in the wild after my previous study on denial of service attacks in npm modules. To check how common ReDoS regexes are in the wild, we started out by collecting all of the regexes we could get our hands on.

This figure shows how we collected regexes for our analysis.

Get the list of all modules in the npm (JavaScript) and pypi (Python) registries.

Keep the ones that have repositories we can clone.

Statically extract regexes from JavaScript (npm) and Python (pypi) files. We found that regexes were used pretty widely: 45% of the npm modules we studied, and 35% of the pypi modules we studied, included at least one regex.

Reduce to (giant) lists of unique regexes.

Regex collection process

Identifying ReDoS Regexes

This figure shows how we identified ReDoS regexes from our giant lists of regexes (“corpuses”).

Apply ReDoS regex detectors (see here for drivers).

Dynamically validate the regexes they think are dangerous in the language of interest (npm → Node.js; pypi → Python).

Determine how many modules contained those regexes in their GitHub/etc. project.

We found similar results in npm and pypi.

How we identified ReDoS regexes

Prominent places we found ReDoS regexes

We found ReDoS regexes (eek!) in:

MongoDB

Hapi

Django

We also found them (double eek!) in:

Node.js core libraries

Python core libraries

Happily, all of these have now been fixed.

How vulnerable are ReDoS regexes?

We calculated the time it took the regex engines to perform regex matches on these ReDoS regexes, for varying-sized inputs. Most of the regexes had quadratic complexity. A few had exponential or cubic.

Again, we found similar results in npm and pypi.

Degree of vulnerability of ReDoS regexes

How do people fix ReDoS regexes?

We wanted to understand how people fix ReDoS regexes after they identify them in their code. We looked at the fixes for the 37 ReDoS vulnerabilities reported in the CVE and Snyk.io databases.

Developers used one of three strategies:

Trim Don’t touch the regex. Just trim the input so it’s not too long, thus bounding the worst-case behavior. This works OK provided that (1) legitimate input shouldn’t be “too long” (e.g. 500 characters?); and (2) the worst-case behavior is not exponential.

Don’t touch the regex. Just trim the input so it’s not too long, thus bounding the worst-case behavior. This works OK provided that (1) legitimate input shouldn’t be “too long” (e.g. 500 characters?); and (2) the worst-case behavior is not exponential. Revise Tweak the regex so it is no longer vulnerable to ReDoS but it still matches the same language. Or similar. Be careful — we saw several cases where a revised regex became safe to one input but vulnerable to a new style of input.

Tweak the regex so it is no longer vulnerable to ReDoS but it still matches the same language. Or similar. Be careful — we saw several cases where a revised regex became safe to one input but vulnerable to a new style of input. Replace Replace the regex with string operations.

How developers fix ReDoS regexes

We then sent vulnerability disclosures to 284 projects describing possible ReDoS regexes within them. 48 of the projects responded with a fix. “Revise” remained the most popular option. Interestingly, many of these revised regexes did not match the same set of strings that the original regex did.

Conclusions

We believe our study shows that