Regular Expressions Explained

A gentle introduction to Regular Expressions. Learn about main concepts, common patterns, and functions with examples.

I bet you all have encountered regular expressions at some points. They are very powerful tools that are universally supported in many platforms, including programming languages like Python, R, Java, SQL, Scala.

As a data scientist/developer, having a solid understanding of Regex can help you perform various data munging and text mining tasks very easily. Personally, I use them for lots of random stuffs, mostly when I have to work with text data or do Natural Language Processing projects.

Regular expressions can seem intimidating at first, but they are very rewarding once you grasp the basics and apply them to your work properly.

What is Regular Expression (RegEx)?

A regular expression or regex is a text string that defines a search pattern.

"\w+" # this is a regex

Typically, these patterns are used for four main tasks:

Find text within a larger body of text

text within a larger body of text Validate that a string conforms to a desired format

that a string conforms to a desired format Replace or insert text

or text Split strings

Let’s take a quick look at some common regex patterns before we apply them to our codes.

Common Patterns

Earlier, we have this regex example:

\w+

“w” here means word. “+” means one or more. The backlash character “\” is the escape character for regular expressions. This regex matches word characters, including ASCII letter, digit, or underscore. Now, suppose we want to match the first word we can find in a string. First, we import the re module.

import re

Then we define a pattern, and use the function re.match() to match the first word it finds:

# define a regex pattern

word_regex = '\w+' re.match(word_regex, 'hello world!') # this matches the first word it finds

>>><re.Match object; span=(0, 2), match='hi'>

Some common patterns:

w matches word characters

matches word characters d matches digits, while D matches non-digit characters

matches digits, while matches non-digit characters s matches whitespace characters, while S matches non-whitespace characters

matches whitespace characters, while matches non-whitespace characters . (dot) matches any letter or symbol (wildcard)

(dot) matches any letter or symbol (wildcard) [a-z] matches lowercase group. The bracket [] matches characters in it

matches lowercase group. The bracket [] matches characters in it Use () to define a group

to define a group Use [] to define explicit character ranges

Now as you already have some regex patterns in hand, let’s move on to some important functions.

The match() function

This function matches pattern to string. It returns a match object on success, None on failure.

re.match('\w+', 'hello world!')

>>><re.Match object; span=(0, 5), match='hello'>

The findall() function

This function returns a list of all instances of the pattern in the string. Matches are returned in the order left-to-right of the original string.

re.findall('[A-Z]\w+', 'hello World!')

>>>['World']

The search() function

The search() function scans through string, looking for instances of the pattern in the string. It returns a match object on success, None on failure. This function is like the match() function, but it goes through the whole string. See search() vs match() for more details.

re.search('ef', 'abcdef')

>>><re.Match object; span=(4, 6), match='ef'>

The split() function

This function splits string by the occurrences of pattern.

re.split('\s+', 'hello world this is andre')

>>>['hello', 'world', 'this', 'is', 'andre']

Random Thoughts

Who else loves Regex?

I love regular expressions. However, it is important to remember that while regex are very powerful tools, it is extraordinarily easy to overuse them. A few things to note: