Date Thu 30 July 2015 Tags python / regex

Awhile back, I needed to parse some strings that were created using Python's string formatter. There's actually a package called parse that's intended to do just that: Reuse the string formatting syntax to extract data from a string. Unfortunately, there were a couple of reasons that prevented me from using that package, so I rolled a quick version of my own using regular expressions (i.e. "regexes").

Note that this article assumes you're already familiar with the basic syntax of regexes. If you're not familiar with regexes, http://regexone.com/ offers a good interactive tutorial.

Named regex patterns For this use case, I'm going to be using named regex patterns (a.k.a. named capturing groups). Unfortunately, the syntax for named regex patterns hurts my eyes. To fix that, let's create a really simple name_regex function that makes it more readable: def name_regex ( name , pattern ): """Return regex string as a named capture group.""" return r'(?P<{name}>{pattern})' . format ( name = name , pattern = pattern ) To clarify what's going on, you can just pass in two strings: print name_regex ( 'myname' , 'Tony' ) which gives: (?P<myname>Tony) This string can be used as a regex to find the desired pattern ( 'Tony' ) and store the result as a named group ( 'myname' ). For a more interesting example, suppose you want to extract a price from some text: You could look for a dollar sign, followed by numbers and decimal points. # This isn't a great regex pattern for a price because any number of decimal # points and digits are accepted, but let's keep this simple. rx_price = name_regex ( 'price' , r'\$[\d.]+' ) You can just use this like any other regex pattern with Python's built-in regex package, re : import re match = re . search ( rx_price , "All your's for only $9.95!" ) print match . groupdict ()[ 'price' ] $9.95 That extracted the text we wanted, but saving a named regex isn't that useful if you're looking for a single value.

Named regexes with string formatting Instead of creating a single named regex, let's create a dictionary with (name, pattern) pairs: def named_regexes ( ** names_and_patterns ): """Return dictionary with regexes transformed into named capture groups. """ return { k : name_regex ( k , p ) for k , p in names_and_patterns . items ()} If that looks a bit strange, we're just packing arbitrary keyword-arguments into a dictionary and applying a dictionary comprehension on that dictionary. We can use this to create regexes for parts of a timestamp: rx_letters = r'[A-z]+' rx_patterns = named_regexes ( month = rx_letters , # any letters day = r'\d{1,2}' , # 1 or 2 digits time = r'\d{2}:\d{2}:\d{2}' , # 3 pairs of digits separated by ':' year = r'\d{4}' # 4 digits ) The result looks like: from pprint import pprint pprint ( rx_patterns ) {'day': '(?P<day>\d{1,2})', 'month': '(?P<month>[A-z]+)', 'time': '(?P<time>\d{2}:\d{2}:\d{2})', 'year': '(?P<year>\d{4})'} That's not really readable, but the point is to actually use it. For example, let's consider the following timestamp: timestamp = "Date: Apr 12 09:51:23 2015 -0500" We can parse data from it with a format string and the dictionary of regex patterns that we just defined: rx_timestamp = "Date: {month} {day} {time} {year}" . format ( ** rx_patterns ) print re . search ( rx_timestamp , timestamp ) . groupdict () {'month': 'Apr', 'year': '2015', 'day': '12', 'time': '09:51:23'} Success! We've extracted the data we wanted in a form that's easy use.

Putting it all together Let's wrap this up into a single function that returns a dictionary of interesting data from a string containing that data, a template string, and named regexes: def match_regex_template ( string , template , ** keys_and_patterns ): """Return dictionary of matches. Parameters ---------- string : str String containing desired data. template : str Template string with named fields. keys_and_patterns : str Regexes for each field in the template. """ named_patterns = named_regexes ( ** keys_and_patterns ) pattern = template . format ( ** named_patterns ) match = re . search ( pattern , string ) if match is None : raise RuntimeError ( error_message . format ( string = string , template = template , pattern = pattern )) return match . groupdict () error_message = """ string: {string} template: {template} pattern: {pattern} """ All this really does is combine the pieces that we discussed above. Inevitably, you'll run into errors when writing regexes, so there's also a bit of error handling to help with debugging. To test this out, let's do a roundtrip: First, we take a template string, plus some data, and produce an output string. greeting_template = "Hey {name}! Welcome to {site}!" input_attrs = dict ( name = 'you' , site = 'tonysyu.github.io' ) greeting = greeting_template . format ( ** input_attrs ) print greeting Hey you! Welcome to tonysyu.github.io! Then let's take the output string and extract the data using match_regex_template . rx_anything = '.+' attrs = match_regex_template ( greeting , greeting_template , name = rx_anything , site = rx_anything ) print attrs {'name': 'you', 'site': 'tonysyu.github.io'} Success!