The road to hell is paved with regular expressions

- Ilian Iliev

1. Dot is a special char

Simple elegant, you read it, you understand it

# process all ruby files process_ruby_file(file_name) if file_name.match(/.rb/) # and profit

But ouch it matches also sample.erb or paris_rb.html

You know what? A regular expression is a program in itself and it deserves tests!

Want to test it?

Take the matching code and depending on your OO taste, make a class : class RubyFileMatcher method : def ruby_file?(file_name) or Regex.ruby_source_file_regex constant RUBY_FILE_REGEXP (perhaps too ugly)

Take your favorite testing framework and throw valid/invalid string to verify the behavior.

processor.ruby_file?('sample/app.rb').must_equal true processor.ruby_file?('views/paris_rb.html').must_equal false processor.ruby_file?('views/paris_rb').must_equal false processor.ruby_file?('views/paris.erb').must_equal false

How can I fix it?

You can fix the regular expression by escaping the . and matching the end of the string with:

/\.rb$/

In fact you don't really need a regex. Let's drop the regex and use

File.extname(file) == '.rb'

to detect all ruby files like rubocop's TargetFinder

More on the subject:

small ads for the Paris.rb meetup

File.extname

GitLab repository/user regexp are centralized in a utility class Regex and can be reused in validations, routes,...

2. Carriage return in content

You have a Git repository model and want to validate the fullname ( rails/rails )

class Repository < ActiveRecord::Base ... validates(:full_name, allow_blank: false, format: { with: /^[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*$/ } ) end

But when we provide a multiline content

"rails/rails

alert('pwned!');"

... it's still valid!

You can catch that trap with Brakeman

Repository | Format Validation | Insufficient validation for 'full_name' using /^[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*$/. Use \A and \z as anchors near line 41

You can fix the code by using the anchors with:

/\A[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*\z/

Shameless Plug: in fact no need to launch Brakeman, PullReview will detect it as you push to GitHub, BitBucket or your GitLab instance.

More on the subject:

3. Denial of Service by regexp

Some input can cause pathological lookup and performance degradation

require 'bm' ANCHORED_VERSION_PATTERN = /\A\s*(#{VERSION_PATTERN})*\s*\z/ Benchmark.measure do '2.3422222.222.222222222.22222.ads0as.dasd0.ddd2222.2.qd3e.' =~ ANCHORED_VERSION_PATTERN end

55 seconds later, you finally got the result

=> #<Benchmark::Tms:0xb7f203fc @label="", @real=55.452385805, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=55.30000000000007, @total=55.310000000000066>

That issue was reported in Bundler regexp validate version string, removing the backtracking helped the performance

ANCHORED_VERSION_PATTERN = /\A\s*(#{VERSION_PATTERN})?\s*\z/

You use that regexp in an online form, you could face a Denial of Service when "well crafted" content is submitted. With a few HTTP posts all your Unicorm/Passenger process would get exhausted.

As a corollary, don't let user submit regexp in your app content.

More on the subject:

4. Non escaping of special chars

You have a Cucumber step like this one

Then /^I should see "(.*)"$/ do |text| assert_match /#{text}/, @response.body end

What if I pass the following text to the step to check my HTML content

Security (2)

It won't work as expected... due to the parentheses interpreted as a regexp group.

The fix is to escape the text

Then /^I should see "(.*)"$/ do |text| assert_match /#{Regexp.escape(text)}/, @response.body end

Remember escaping white spaces (\t,...) and other special chars like "(.*/) ?+"

More on the subject:

Escape ruby regular expressions by Josh Nichols

use char class for white spaces

should have perhaps used assert string.include?(text) instead of assert_match ;)

5. Over specifying

You want to parse a line like this one and extract each attributes of the soldier:

line = 'Name: Some Soldier, Rank: Major, Serial: 426879824, Boots: black'; line.match /(\w*): (\w*\s*\w*), (\w*): (\w*), (\w+): (\d{9}), (\w*): (\w*)/ => #<MatchData "Name: Some Soldier, Rank: Major, Serial: 426879824, Boots: black" 1:"Name" 2:"Some Soldier" 3:"Rank" 4:"Major" 5:"Serial" 6:"426879824" 7:"Boots" 8:"black">

Small variation in format (a new attribute Gender), different spacing may easily break your code, your are doing too much with a regexp, over-specifying the format.

It's often better to identify pattern in a pattern with scan or split method:

line = 'Name: Some Soldier, Rank: Major, Serial: 426879824, Boots: black'; data = line.split(',').map do |label_value| match_data = label_value.match(/\s?(.*):\s*(.*)/) [match_data[1],match_data[2]] end Hash[data] => {"Name"=>"Some Soldier", "Rank"=>"Major", "Serial"=>"426879824", "Boots"=>"black"}

More on the subject:

Common Regex Gotchas (perl)

regex crossword

again in this sample, you can replace the last regexp with a split(':') and a strip of extra white spaces.

and a of extra white spaces. Want to parse known format like CSV, HTML, XML, JSON, ...? Don't roll your own regular expression use a dedicated Gem or stdlib !

6. Validating an URI with a regexp.

Please don't use a home made/googled regexp

validates( :website, allow_blank: true, uri: { format: /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix } )

The URI scheme is tricky to implement with a single regexp.

Relying on the same validation as your network library is probably safer if you want to restrict the url to https:// and https:// use the following code.

require 'uri' def valid?(url) uri = URI.parse(url) uri.kind_of?(URI::HTTP) rescue URI::InvalidURIError false end

More on the subject:

7. Write once debug everywhere DSL

As explained by Brandon Beacher in his article

Regexp in ruby can be cryptic:

/(\d{3})-(\d{3})-(\d{4})/.match("678-248-2440") if $~ PhoneNumber.new( area_code: $1, prefix: $2, line_number: $3 ) end

Named group might help keeping the intent clearer

match_data = /(?<area_code>\d{3})-(?<prefix>\d{3})-(?<line_number>\d{4})/.match("678-248-2440") if match_data PhoneNumber.new( area_code: match_data[:area_code], prefix: match_data[:prefix], line_number: match_data[:line_number] ) end

Named group helps on readability and make the regexp "self-documenting". In other user cases this would also keep the code more flexible in case of small variation in the output, you receive a provider name before the phone number.

match_data = /(?<provider_name>.*)\s(?<area_code>\d{3})-(?<prefix>\d{3})-(?<line_number>\d{4})/.match("BelgiumTelecom 678-248-2440")

The phone number parsing stays the same, as the code doesn't depend on the position/index, you just add a provided named group to extract the provider.

Closing words

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

- Jamie Zawinski

First regular expressions are powerful, their simplicity hides often complex processing. Solving your text parsing/validation problems in one line is great but this line needs way much more testing than the other lines in your codebase. You don’t have a PhD in Regular Expression (and most of your colleague either), so don’t leave them blind when they’ll have to adapt your regexp. Leave a few unit tests to show what is an expected valid/invalid content.

Second "when you have a hammer, everything looks like a nail", most of the time, you can drop your homemade expression and favor a well tested implementation (eg File.extname ).

Want to learn more about ruby and regexp?