Rosie Pattern Language: Improving on 50-Year Old Regular Expression Technology

Regular expressions are everywhere, including in the inner loops of most data mining code. But they don't scale! Almost every implementation uses exponential backtracking, which can stall mining of big data, where input format anomalies are likely. And building collections of regex is fraught, because they don't compose. Perhaps most importantly, regex don't scale to teams of people, because they are famously hard to read, understand, and maintain.

The Rosie Pattern Language (RPL) addresses all of these scale challenges: big data is processed in linear time in the input size; packages of composable patterns are easily shared; and it has a readable syntax, with named patterns, flexible whitespace, and comments, like a programming language.

You can see the advantage of named patterns even on the command line. To extract network addresses from a file, would you rather type this:

egrep -o '(([0-9]{1,3})([.][0-9]{1,3}){2}|\w+([.]\w+)+)'

(and hope you got it right, and keeping in mind that you won't get ipv6 addresses with this pattern), or would this be easier:

rosie grep -o subs net.any

RPL is based on Parsing Expression Grammars, which are more powerful than regular expressions but share many their features. (Unlike regex, PEGs can match recursively defined data, like HTML/XML, JSON, and more.) RPL will look familiar to people who use regex, and is meant to be easy to use for everyone. It is open source software, under the MIT license.

Jamie Jennings

IBM

Jamie received her Ph.D. in Computer Science from Cornell University, and has held positions in academia and industry. As a Senior Technical Staff Member in IBM, she led the creation of an international technical standard for over the air update of cell phone software, a standard in wide use by the mobile industry today. She has been chief architect for a number of IBM products, and has also done protocol design and security engineering work in her 20-year career as a technologist. These days, she is part of an Advanced Technology team in IBM's Cloud division. In her spare time she plays ice hockey and writes compilers.