Table of Contents

Introduction to REGEX

REGEX is a module used for regular expression matching in the Python programming language. In fact, REGEX is actually just short for regular expressions, which refer to the pattern of characters used in a string. This concept can apply to simple words, phone numbers, email addresses, or any other number of patterns. For example, if you search for the letter “f” in the sentence “For the love of all that is good, finish the job,” the goal is to look for occurrences of the character “f” in the sentence. This is the most basic application of regular expressions: you can look for only alphabetic characters in strings mixed with letters, numbers, and special characters. On the other hand, in a string that read “a2435?#@s560” you could choose to look only for the letters within that string. You could also look through text specifically for phone numbers (###-###-####). The format of a phone number is a very specific pattern of numbers and hyphens and more than just a single character – the general syntax of which we’ll discuss next.

First, it should be quickly noted that regex is generally case-sensitive: the letter “a” and the letter “A” would be considered to be separate characters. Also, when dealing with numbers, you will never deal with more than one digit at a time, since there isn’t a single character that represents anything beyond 0 through 9. Let’s go through some of the important meta-characters used to type out the patterns we need to look for. Just like regular strings, the patterns always start and end with double quotations (“”). So let’s say you’re looking for occurrences of the letter “e”: you can exactly write “e”. If you’re looking for a phrase, a part of a word, or a whole word such as “was”, then you can write exactly “was”. The two different applications of regular expressions are no different from entering a regular string.

Using characters to create indentations

Now let’s get into something special: we can actually use the period (.) to represent any character other than a newline character, which creates indentations. Let’s say the pattern you’re looking for is “h.s”: this means any character ranging from a letter, a number, or a special character can be between the “h” and the “s”. Finally, we have two characters that reference the specific position of a pattern.

The caret (^) looks for a pattern that starts the string or text. So if you had the sentence “This looks like a tree” and you look for the pattern “^This” it will successfully match since “This” is in the beginning. The caret must be the first character of the pattern.

On the opposite end of the spectrum, we have the dollar sign ($) which indicates the pattern must be at the end. So taking the previous example, if the pattern is “tree$”, you will return a successful match since the word “tree” ends the string. The dollar sign must always conclude the pattern.

The next couple of meta-characters refer to the number of times a regex occurs in a string.

The asterisk (*) checks for zero or more occurrences of a pattern. This means that regardless of if the specific character, characters, or pattern actually occurs or not, it will always be a match. For example, if we had the pattern“abc*”, then as long as we have a string containing “ab” it will pass. The “c” can occur or not and it’s will meet requirements. So the strings “ab”, “abc”, and “abccc” all match the pattern.

The plus sign (+) checks for one or more occurrences of a pattern. This means that as long as the pattern is matched at least once, a successful match has been made. No occurrence means that the match was unsuccessful. You can also do braces () and in between you enter the specific number of occurrences you are looking for. All of these meta-characters follow the regex.

The vertical bar (|), much like in programming languages, represents “or”. If you had the sentence “I’m departing from Miami at six o’clock” and the regex is “go|departing”, the match would be successful because even though “go” isn’t present, “departing” is.

Sets in REGEX

Next, we’ll discuss sets created by brackets ([]). A set expands the possibilities for making patterns, and represents exactly 1 character. For example, if you have the pattern “abc”, then that means you’re literally looking for “abc”. However, when the pattern is “[abc]”, you’re looking for occurrences “a”, “b”, or “c”. Similarly, “0123” means you are literally looking for “0123”. If you have “[0123]”, then you’re looking for occurrences of 0, 1, 2, or 3.

A hyphen (-) between two letters or characters means that any occurrence of a character between the two are a match. So “[0-9]” refers to all numerical digits while “[a-zA-Z]” refers to all alphabetical characters whether they are lower case or upper case. You can also limit the characters: for example, “[4-7]” or “[p-v]” are perfectly acceptable as well.

The function of a caret (^) changes within a set. The caret looks for everything except the pattern you entered. So if you have [^abc], you want to match any character except “a”, “b”, or “c”. Other than caret, the meta-characters in sets have no special function. That means that “[+]” is literally looking for occurrences of the character “+” and is no longer considered a meta-character. If you want to apply meta-characters to sets, you use them outside the set, like “[0-9]*” or “[G-N]$”. You can make many different patterns by combining sets like “[v-z][a-g]”. This is how you find numbers with multiple digits. You can do “[0-9][0-9]” to search for a two digit number.

Special sequences using the backslash

Lastly, we’ll briefly discuss special sequences. First, special sequences are initiated by another meta-character not previously discussed, the backslash (\) and a particular letter depending of the sequence. Special sequences work very similarly to other meta-characters in the sense that they perform special functions just like meta-characters. In fact, some of these share the same function as some meta-characters. The sequences “\A”, “\b”, and “\B” refers to the specific position of the characters just like the caret and the dollar sign.

The “\A” sequence checks if the pattern matches the beginning of the string. For example, if we had the pattern “\AThe” and we had the string “The Tree”, then the pattern matches. However, if we had the string “Find The Tree”, then there is no match because “the” does not initiate the string.

The sequence “\b” indicates that a pattern either begins or ends a word within the string.

If you would like to see if a word begins with “eb”, the pattern would look like “\beb”.

If you would like to see if a word begins with “eb”, the pattern would look like “eb\b”.

If we had the word “celeb”, it will not match the pattern “\beb” since it does not start with “eb”.

The word “celeb” will match the pattern “eb\b” since the word ends with “eb”. The sequence “\B” is implemented the same way as “\b” but has the exact opposite meaning. The sequence “\B” matches as long a word does not begin or end with the pattern. Let’s look at the previous example again. If we have the word “celeb” and the pattern “\Beb”, then the pattern matches since “eb” does not start the word. If we have the pattern “eb\B”, the word would match not match the pattern since “eb” ends the word.

Many of the other sequences are meant to segregate specific types of characters. For example, “\d” returns a match for any character that is a digit and “\D” returns matches for anything but a digit. For this reason, special sequences are used for very broad applications. If you just want to search all numbers, letters, or anything just as broad, special sequences are more convenient. Otherwise, the other meta-characters are recommended.