TRE : A Regex Engine with Approximate Matching

November 4, 2018

TRE is a regex engine that allows for approximate matching. It does this by using calculating the Levenshtein Distance (number of insertions, deletions, or substitutions it would take to make the strings equal), as it searchs for a match.

The re::engine::TRE is a Perl wrapper around the TRE engine. It swaps out the default Perl regex engine with the TRE engine within the lexical scope that it is used.

Before we dive into the nitty gritty, here are a few examples.

S=Substitution, I=Insertion, D=Deletion

Fuzzy matching:

0 > use re::engine::TRE max_cost => 2 1 > say $& if 'pesarl' =~ /perl/x pesa 2 > say $& if 'prjl' =~ /perl/x prjl

Fuzzy captures:

0 > use re::engine::TRE max_cost => 6 1 > say "\$1 = $1, \$2 = $2" if 'Fussy wuzzy has a beer' =~ /(fuzzy) wuzzy was a (bear)/xi $1 = Fussy, $2 = beer

Perl regex variables returned (this is so handy):

0 > use re::engine::TRE max_cost => 2 1 > say "$1 starts at $-[1] and ends at $+[1]" if 'GATACA' =~ /GA(TCC)A/x TAC starts at 2 and ends at 5

Implementation details

So now that you’ve seen the basic flavor of what the TRE is, let’s clear up some of the details that I found really confusing when I started with it.

The ERE vs the BRE Syntax

They syntax that TRE uses can be switched to either use the POSIX Extended Regular Expression syntax, ERE, or the Basic Regular Expression syntax, BRE. The docs page, however, does not make this clear. The re::engine::TRE implementation looks for the /x flag in order to switch on ERE, which is almost certainly what you want.

The Approximate match specifier

TRE does mix in some of its own syntax as well though. The ‘Approximate Match Settings’ are the primary example of this. They allow you to set the number of mismatches, in total or specific type, for an atom. They also allow you to set the score of each type of mismatch for that atom.

EX:

0 > use re::engine::TRE max_cost => 6 1 > say "$1 starts at $-[1] and ends at $+[1]" if 'GATTACA' =~ /GA(TCC){ #1+1~2 }A/x TTAC starts at 2 and ends at 6 2 > say "$1 starts at $-[1] and ends at $+[1]" if 'GATGGGGA' =~ /GA(TACAC){ #1+1-1~2 }A/x

The last example says that my capture group can have only one insertion, 1 deletion, or 1 substitution, but that it can’t have more than 2 errors in total. This brings up one of the gotchas though. It is sometimes hard to ‘see’ the edits that lead to a match. For example, GATCCCCA would have matched and given ‘TCCC’ as my capture, which is one substitution, and 1 deletion in the capture group, and 1 insertion outside of it.

Pragmas

When importing re::engine::TRE, you can set various ‘global’ pragma, such as ‘max_cost’, which specify the overall settings for TRE to use. The defaults can be found in the re::engine::TRE docs. If you set the global max cost to 2, but have an atom allowing up the 3 errors, the global cost will overrull that and no match will be returned.

I should note that I don’t know the details of how Perl’s use statement works well enough to speak to what kind of overhead throwing use re::engine::TRE max_cost => 2 into a function result in. I think that use is a compile time thingy, and therefor you can put it in a function had have low to no overhead from the ‘import’ each time the function is called. But I could be totally wrong about that.

Other places TRE can be found

tre-agrep

agrep (not exaclty the same)

Other Fuzzy Matching Options

Excellent Articles

How to Write a Spelling Corrector by Peter Norvig

SymSpell by Wolf Garbe

A Bunch of Example Test Cases

Some day I will hopefully PR more test cases into re::engine::TRE