Posted May 17, 2011

Beyond the LIKE Operator: Advanced Pattern Matching with MySQL

By Rob Gravelle

Don't get me wrong; the LIKE operator is great for finding words or phrases within strings. I've used it for those times that you need to match complex string patterns in MySQL. But you'll be happy to learn that it provides a form of advanced pattern matching that is based on extended regular expressions used by Unix utilities such as vi, grep and sed.

The LIKE Operator

With LIKE, you can test for simple patterns using wildcards. It returns 1 if the expression matches the pattern; otherwise it returns 0. If either the expression or pattern is NULL, the result is also NULL. You can assign your own escape character instead of the backslash character (\). Here is the syntax:



expression LIKE pattern [ESCAPE 'escape_char']



LIKE recognizes the following two wildcard characters:

_ matches one character

% matches any number of characters, including zero characters

I recently used the LIKE statement to check email contact display names where users who were members of either the local network (LAN) or a partner network. The local contacts were in the format of "LastName, FirstName." Those who belonged to the other department's LAN were displayed as "LastName.FirstName." To match both these cases, the following statement was used:



SELECT user FROM user_data WHERE email_display LIKE 'Gravelle%Robert';



That single statement would return rows where the email_display was formatted as either "Gravelle, Robert" or Gravelle.Robert@GravelleConsulting.com.

Introducing the REGEXP operator

The regular expression, or regexp, is well known across many programming languages. Languages that support the regexp include C, C++, .NET, Java, JavaScript, PHP, Perl and many others. It is largely based on the powerful UNIX vi, grep, sed search tools.

MySQL's implementation of regular expressions is based on the work of Henry Spencer. It works in much the same way that the LIKE operator does, except that it adds a lot of extra pattern matching capability. In fact, it even has an alias of RLIKE! Here's the syntax using both statements:



expression REGEXP pattern

OR

expression RLIKE pattern



Matching with Wildcard Characters

Regular expression wildcards differ slightly from those of the LIKE statement. Rather than matching any one character (_) or more characters (%), regular expressions match specific patterns called elements. You can still emulate the behavior of the LIKE statement, but the REGEXP can do a whole lot more.

To match any one character, use the period (.). You could say that it replaces the underscore character (_). However, when positioned before an asterisk (*), it then matches any number of characters, including none; that is to say, zero to N characters. We could therefore reformulate the LIKE expression as the following REGEXP:



'Gravelle, Robert' REGEXP 'Gravelle.*Robert'



The only caveat to using such a general pattern is that regular expressions are greedy! They will match as many characters as possible. Case in point, the following text would not match the 'Gravelle%Robert' LIKE expression because it matches against the entire string. However it would for a REGEX, because it doesn't care where in the string the pattern matches:

Dear Mr. Gravelle,

I would like to thank you for your fan mail to Robert Pattinson.

You could say that the REGEX behaves more like '%Gravelle%Robert%'.

For that reason, it's prudent to avoid such general patterns. If we know that there will be some delimiter in between the names, we can match using the plus sign (+). It matches the preceding element one or more times. Hence, there has to be at least one character there for the expression to match:



'Gravelle, Robert' REGEXP 'Gravelle.+Robert'



Better still, if we know how many characters may be used to delimit the names, we can specify this in the pattern as well, using curly braces ({m,n}), where it specifies m through n instances of the preceding element:



... RLIKE 'Gravelle.{1,2}Robert'



That will match 'Gravelle, Robert' or 'Gravelle.Robert', but it will still match too many other patterns for my liking.

Getting into Specifics

Let's get down to brass tacks and outline exactly what we want. The first character should either be a period or a comma. In OR situations such as these, it's best to define a valid character group by enclosing them between square brackets ([]). There may also be a space. To define a zero or one condition, follow the element by a question mark (?). This statement will really narrow down the field!



... REGEXP 'Gravelle[.,] ?Robert'



Many characters have equivalent named constants that you can use to make your code more readable. To use a named constant, enclose it between square brackets and periods. For instance, replacing the space in the above expression with its constant would yield the following:



... REGEXP 'Gravelle[.,][.space.]?Robert'



You can make the grouped characters exclusive, meaning that they are excluded from a match, by adding a caret (^) after the opening square bracket. The following expression will match as long as the "R" in "Robert" does not immediately follow the period or comma:



... REGEXP 'Gravelle[.,][^R]?Robert'



Grouping Elements

Of course an element can contain more than one character. To create a multi-character element, enclose them in parentheses (()). This construct would allow us to separate our conditions into a period (.) and a comma, followed by a space (, ). As you recall, the period is reserved to match against any character. Therefore, we need to escape it. This is accomplished by preceding it by a backslash (\). To create an OR condition with multi-character elements, use the vertical bar (|) character:



... REGEXP 'Gravelle(\.|(,[.space.]))Robert'



Position Markers



If that still isn't precise enough for you, you can also denote positions within the string: namely the start (the caret (^)) and end (the dollar sign ($)). Obviously, these characters must be placed at the very start or end of the pattern! Since we know that our pattern matches the start of the string, we can include the caret:



... REGEXP '^Gravelle\.|(,[.space.])Robert'



And Now for Some Bad News



The extended POSIX.2 Regular Expressions on which the MySQL REGEX is based does not support backreferences, which are temporary variables that are created whenever parentheses are used. These can be referenced further along in the expression by using a numeric identifier such as $1, $2,…$n or \1, \2,…

. Therefore, we can't reference the first name again to make up the second part of the OR condition:



... REGEXP '^Gravelle(\.(Robert))|(,[.space.]$1@GravelleConsulting.com)$'





The $1 above refers to the "Robert" that is enclosed in the innermost parentheses.

Instead, we have to rewrite it:



... REGEXP '^Gravelle(\.(Robert))|(,[.space.]Robert@GravelleConsulting.com)$'

Conclusion



As long as you bear in mind that the MySQL implementation of Regular Expressions is not all-inclusive, I'm sure that you'll agree that it's leagues above the LIKE statement. Also, be prepared to accept the associated learning curve, because Regular Expressions really are a language all to themselves. Here's some links that will give you the full run down on the MySQL REGEXP statement:



See all articles by Rob Gravelle