Part 1: What are Regular Expressions?

1. Syntax

1.1. Character Classes

. Dot , any character ( may or may not match line terminators , read on ) \d A digit : [ 0 - 9 ] \D A non - digit : [ ^ 0 - 9 ] \ s A whitespace character : [ \t

\x0B \f \r ] \S A non - whitespace character : [ ^ \ s ] \w A word character : [ a - zA - Z_0 - 9 ] \W A non - word character : [ ^ \w ] . Dot, any character (may or may not match line terminators, read on) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t

\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]

String pattern = " \\ d \\ D \\ W \\ w \\ S \\ s" ; String pattern = "\\d \\D \\W \\w \\S \\s";

1.2. Quantifiers

* Match 0 or more times + Match 1 or more times ? Match 1 or 0 times { n } Match exactly n times { n , } Match at least n times { n , m } Match at least n but not more than m times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times

1.3. Meta-characters

\ Escape the next meta - character ( it becomes a normal / literal character ) ^ Match the beginning of the line . Match any character ( except newline ) $ Match the end of the line ( or before newline at the end ) | Alternation ( ‘ or ’ statement ) ( ) Grouping [ ] Custom character class \ Escape the next meta-character (it becomes a normal/literal character) ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation (‘or’ statement) () Grouping [] Custom character class

Visual Regex Tester

2. Examples

2.1. Basic Expressions

“I lost my wallet”

“I lost my wallet”

wallet

"I lost my \\ w+" "I lost my \\w+"

“\w”

“+”

“I lost my sablefish”

“I lost my parrot”

“I lost my: trooper”

":"

"I lost my:? \\ w+" "I lost my:? \\w+"

":"

2.2. Basic Grouping

| Alternation ( ‘ or ’ statement ) ( ) Grouping | Alternation (‘or’ statement) () Grouping

()

"I lost my:? (wallet|car|cell phone|marbles)" "I lost my:? (wallet|car|cell phone|marbles)"

“I lost my”

":"

"|"

"I lost my wallet" matches "I lost my wallets" matches the ‘s’ is not needed, is ignored "I lost my: car" matches "I lost my- car" doesn’t match ‘ - ‘ is not allowed in our pattern "I lost my: cell" doesn’t match all of ‘cell phone’ is needed "I lost my: cell phone" matches "I lost my cell phone" matches "I lost my marbles" matches "I lost my wallet" matches "I lost my wallets" matches the ‘s’ is not needed, is ignored "I lost my: car" matches "I lost my- car" doesn’t match ‘-‘ is not allowed in our pattern "I lost my: cell" doesn’t match all of ‘cell phone’ is needed "I lost my: cell phone" matches "I lost my cell phone" matches "I lost my marbles" matches

Quiz: Can you figure out all possible matches for this pattern? (See the answers.) "I lost my:? (wallet|car|cell phone|marbles)" "I lost my:? (wallet|car|cell phone|marbles)" Answer: This is a trick question! Because this regular expression is unlimited (has no beginning `^` and no ending `$` meta-characters to terminate the match,) the pattern we’ve created will actually match any string containing one of the results below. In short, nearly infinite possible matches; however, if we did want to limit our pattern to just these results, we could use add the required terminators to our pattern – like so: "^I lost my:? (wallet|car|cell phone|marbles)$" "^I lost my:? (wallet|car|cell phone|marbles)$" "I lost my wallet" "I lost my wallets" "I lost my: wallet" "I lost my: wallets" "I lost my car" "I lost my car" "I lost my: car" "I lost my: car" "I lost my cell phone" "I lost my cell phone" "I lost my: cell phone" "I lost my: cell phone" "I lost my marbles" "I lost my marbles" "I lost my: marbles" "I lost my: marbles" "I lost my wallet" "I lost my wallets" "I lost my: wallet" "I lost my: wallets" "I lost my car" "I lost my car" "I lost my: car" "I lost my: car" "I lost my cell phone" "I lost my cell phone" "I lost my: cell phone" "I lost my: cell phone" "I lost my marbles" "I lost my marbles" "I lost my: marbles" "I lost my: marbles"

2.3. Matching/Validating

Sample code import java.util.ArrayList; import java.util.List; public class ValidateDemo { public static void main(String[] args) { List<String> input = new ArrayList<String>(); input.add("123-45-6789"); input.add("9876-5-4321"); input.add("987-65-4321 (attack)"); input.add("987-65-4321 "); input.add("192-83-7465"); for (String ssn : input) { if (ssn.matches("^(\\d{3}-?\\d{2}-?\\d{4})$")) { System.out.println("Found good SSN: " + ssn); } } } }

Found good SSN: 123-45-6789</br> Found good SSN: 192-83-7465 Found good SSN: 123-45-6789</br> Found good SSN: 192-83-7465

Dissecting the pattern:

"^( \\ d{3}-? \\ d{2}-? \\ d{4})$" "^(\\d{3}-?\\d{2}-?\\d{4})$"

^ match the beginning of the line ( ) group everything within the parenthesis as group 1 \d { n } match n digits , where n is a number equal to or greater than zero -? optionally match a dash $ match the end of the line ^ match the beginning of the line () group everything within the parenthesis as group 1 \d{n} match n digits, where n is a number equal to or greater than zero -? optionally match a dash $ match the end of the line

2.4. Extracting/Capturing

Sample code import java.util.ArrayList; import java.util.List; import java.util.regex.*; public class ExtractDemo { public static void main(String[] args) { String input = "I have a cat, but I like my dog better."; Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)"); Matcher m = p.matcher(input); List<String> animals = new ArrayList<String>(); while (m.find()) { System.out.println("Found a " + m.group() + "."); animals.add(m.group()); } } }

Found a cat. Found a dog. Found a cat. Found a dog.

Dissecting the pattern:

"(mouse|cat|dog|wolf|bear|human)" "(mouse|cat|dog|wolf|bear|human)"

( ) group everything within the parenthesis as group 1 mouse match the text ‘mouse’ | alternation : match any one of the sections of this group cat match the text ‘cat’ //...and so on () group everything within the parenthesis as group 1 mouse match the text ‘mouse’ | alternation: match any one of the sections of this group cat match the text ‘cat’ //...and so on

2.5. Modifying/Substitution

‘clientId=’

Sample code import java.util.regex.*; public class ReplaceDemo { public static void main(String[] args) { String input = "User clientId=23421. Some more text clientId=33432. This clientNum=100"; Pattern p = Pattern.compile("(clientId=)(\\d+)"); Matcher m = p.matcher(input); StringBuffer result = new StringBuffer(); while (m.find()) { System.out.println("Masking: " + m.group(2)); m.appendReplacement(result, m.group(1) + "***masked***"); } m.appendTail(result); System.out.println(result); } }

Masking: 23421 Masking: 33432 User clientId=***masked***. Some more text clientId=***masked***. This clientNum=100. Masking: 23421 Masking: 33432 User clientId=***masked***. Some more text clientId=***masked***. This clientNum=100.

Dissecting the pattern:

"(clientId=)( \\ d+)" "(clientId=)(\\d+)"

( clientId = ) group everything within the parenthesis as group 1 clientId = match the text ‘clientId = ’ ( \ \d + ) group everything within the parenthesis as group 2 \ \d + match one or more digits (clientId=) group everything within the parenthesis as group 1 clientId= match the text ‘clientId=’ (\\d+) group everything within the parenthesis as group 2 \\d+ match one or more digits

Often unknown, or heralded as confusing, regular expressions (regex) have defined the standard for powerful text manipulation and search. Without them, many of the applications we know today would not function. This two-part series explores the basics of regular expressions in Java, and provides tutorial examples in the hopes of spreading love for our pattern-matching friends. (Read part two .)Regular expressions are a language of string patterns built in to most modern programming languages, including Java 1.4 onward ; they can be used for: searching, extracting, and modifying text. This chapter will cover basic syntax and use. Regular expressions, by definition, are string patterns that describe text. These descriptions can then be used in nearly infinite ways. The basic language constructs include character classes, quantifiers, and meta-characters. Character classes are used to define the content of the pattern. E.g. what should the pattern look for?However; notice that in Java, you will need to “double escape” these backslashes.Quantifiers can be used to specify the number or length that part of a pattern should match or repeat. A quantifier will bind to the expression group to its immediate left.Meta-characters are used to group, divide, and perform special operations in patterns.To get a more visual look into how regular expressions work, try our visual java regex tester Every string is a regular expression. For example, the string,, is a regular expression that will match the text,, and will ignore everything else. What if we want to be able to find more things that we lost? We can replacewith a character class expression that will match any word.As you can see, this pattern uses both aand asays match a word character, andsays match one or more. So when combined, the pattern says “match one or more word characters.” Now the pattern will match any word in place of “wallet”. E.g., but it will not match, because as soon as the expression finds thecharacter, which is not a word character, it will stop matching. If we want the expression to be able to handle this situation, then we need to make a small change.Now the expression will allow an optionaldirectly after the word ‘my’.An important feature of regular expressions is the ability to group sections of a pattern, and provide alternate matches.These two meta-characters are core parts of flexible regular expressions. For instance, in the first example we lost our wallet. What if we knew exactly which types of objects we had lost, and we wanted to find those objects but nothing else? We can use a group, with an ‘or’ meta-character in order to specify a list of expressions to allow in our match.The new expression will now match the beginning of the string, an optional, and then any one of the expressions in the group, separated by alternators,; any one of the following: ‘wallet’, ‘cell phone’, ‘car’, or our ‘marbles’ would be a match.As you can see, the combinations for matches quickly become very large. This is not the complete set, as there are several more phrases that would match our simple pattern.Regular expressions make it possible to find all instances of text that match a certain pattern, and return a Boolean value if the pattern is found/not found. (This can be used to validate input such as phone numbers, social security numbers, email addresses, web form input data, scrub data, and much more. Eg. If the pattern is found in a String, and the pattern matches a SSN, then the string is an SSN)Specific values can be selected out of a large complex body of text. These values can be used in the application.Values in text can be replaced with new values, for example, you could replace all instances of the word, followed by a number, with a mask to hide the original text. (See below) For sanitizing log files, URI strings and parameters, and form data, this can be a useful method of filtering sensitive information. A simple, reusable utility class can be used to encapsulate this into a more streamlined method.

Notice how groups begin numbering at 1, and increment by one for each new group. However; groups may contain groups, in which case the outer group begins at one, group two will be the next inner group. When referencing group 0, you will be given the entire chunk of text that matched the regex.

( ( ) ( ( ) ( ) ) ) ( ) //and so on 1 2 3 4 5 6 //0 = everything the pattern matched ( ( ) ( ( ) ( ))) ( ) //and so on 1 2 3 4 5 6 //0 = everything the pattern matched

3. Conclusion & Next Steps

Wrapping up, regular expressions are not difficult to master – in fact, they are quite easy. My strategy, whenever building a new regular expression, is to start with the simplest, most general match possible. From there, I continuously add more and more complexity until I have matched, substituted, or inserted exactly what I need.

Don’t be afraid to “express” yourself! When you’ve got the hang of these techniques, or need something a little fancier, read part two for more information on lookaheads, lookbehinds, and configuring the matching engine.