#stringmetric String metrics and phonetic algorithms for Scala. The library provides facilities to perform approximate string matching, measurement of string similarity/distance, indexing by word pronunciation, and sounds-like comparisons. In addition to the core library, each metric and algorithm has a command line interface.

Metrics and algorithms

Depending upon

SBT:

libraryDependencies + = " com.rockymadden.stringmetric " %% " stringmetric-core " % " 0.27.4 "

Gradle:

compile ' com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.4 '

Maven:

< dependency > < groupId >com.rockymadden.stringmetric</ groupId > < artifactId >stringmetric-core_2.10</ artifactId > < version >0.27.4</ version > </ dependency >

Similarity package

Useful for approximate string matching and measurement of string distance. Most metrics calculate the similarity of two strings as a double with a value between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.

Dice / Sorensen Metric:

DiceSorensenMetric ( 1 ).compare( " night " , " nacht " ) // 0.6 DiceSorensenMetric ( 1 ).compare( " context " , " contact " ) // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.

Hamming Metric:

HammingMetric .compare( " toned " , " roses " ) // 3 HammingMetric .compare( " 1011101 " , " 1001001 " ) // 2

Note the exception of integers, rather than doubles, being returned.

Jaccard Metric:

JaccardMetric ( 1 ).compare( " night " , " nacht " ) // 0.3 JaccardMetric ( 1 ).compare( " context " , " contact " ) // 0.35714285714285715

Note you must specify the size of the n-gram you wish to use.

Jaro Metric:

JaroMetric .compare( " dwayne " , " duane " ) // 0.8222222222222223 JaroMetric .compare( " jones " , " johnson " ) // 0.7904761904761904 JaroMetric .compare( " fvie " , " ten " ) // 0.0

Jaro-Winkler Metric:

JaroWinklerMetric .compare( " dwayne " , " duane " ) // 0.8400000000000001 JaroWinklerMetric .compare( " jones " , " johnson " ) // 0.8323809523809523 JaroWinklerMetric .compare( " fvie " , " ten " ) // 0.0

Levenshtein Metric:

LevenshteinMetric .compare( " sitting " , " kitten " ) // 3 LevenshteinMetric .compare( " cake " , " drake " ) // 2

Note the exception of integers, rather than doubles, being returned.

N-Gram Metric:

NGramMetric ( 1 ).compare( " night " , " nacht " ) // 0.6 NGramMetric ( 2 ).compare( " night " , " nacht " ) // 0.25 NGramMetric ( 2 ).compare( " context " , " contact " ) // 0.5

Note you must specify the size of the n-gram you wish to use.

Overlap Metric:

OverlapMetric ( 1 ).compare( " night " , " nacht " ) // 0.6 OverlapMetric ( 1 ).compare( " context " , " contact " ) // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.

Ratcliff/Obershelp Metric:

RatcliffObershelpMetric .compare( " aleksander " , " alexandre " ) // 0.7368421052631579 RatcliffObershelpMetric .compare( " pennsylvania " , " pencilvaneya " ) // 0.6666666666666666

Weighted Levenshtein Metric:

WeightedLevenshteinMetric ( 10 , 0.1 , 1 ).compare( " book " , " back " ) // 2 WeightedLevenshteinMetric ( 10 , 0.1 , 1 ).compare( " hosp " , " hospital " ) // 0.4 WeightedLevenshteinMetric ( 10 , 0.1 , 1 ).compare( " hospital " , " hosp " ) // 40

Note you must specify the weight of each operation. Delete, insert, and then substitute. Note that while a double is returned, it can be outside the range of 0 to 1, based upon the weights used.

Phonetic package

Useful for indexing by word pronunciation and performing sounds-like comparisons. All metrics return a boolean value indicating if the two strings sound the same, per the algorithm used. All metrics have an algorithm counterpart which provide the means to perform indexing by word pronunciation.

Metaphone Metric:

MetaphoneMetric .compare( " merci " , " mercy " ) // true MetaphoneMetric .compare( " dumb " , " gum " ) // false

Metaphone Algorithm:

MetaphoneAlgorithm .compute( " dumb " ) // tm MetaphoneAlgorithm .compute( " knuth " ) // n0

NYSIIS Metric:

NysiisMetric .compare( " ham " , " hum " ) // true NysiisMetric .compare( " dumb " , " gum " ) // false

NYSIIS Algorithm:

NysiisAlgorithm .compute( " macintosh " ) // mcant NysiisAlgorithm .compute( " knuth " ) // nnat

Refined NYSIIS Metric:

RefinedNysiisMetric .compare( " ham " , " hum " ) // true RefinedNysiisMetric .compare( " dumb " , " gum " ) // false

Refined NYSIIS Algorithm:

RefinedNysiisAlgorithm .compute( " macintosh " ) // mcantas RefinedNysiisAlgorithm .compute( " westerlund " ) // wastarlad

Refined Soundex Metric:

RefinedSoundexMetric .compare( " robert " , " rupert " ) // true RefinedSoundexMetric .compare( " robert " , " rubin " ) // false

Refined Soundex Algorithm:

RefinedSoundexAlgorithm .compute( " hairs " ) // h093 RefinedSoundexAlgorithm .compute( " lambert " ) // l7081096

Soundex Metric:

SoundexMetric .compare( " robert " , " rupert " ) // true SoundexMetric .compare( " robert " , " rubin " ) // false

Soundex Algorithm:

SoundexAlgorithm .compute( " rupert " ) // r163 SoundexAlgorithm .compute( " lukasiewicz " ) // l222

Convenience objects

StringAlgorithm:

StringAlgorithm .computeWithMetaphone( " abcdef " ) StringAlgorithm .computeWithNysiis( " abcdef " )

StringMetric:

StringMetric .compareWithJaccard( 1 )( " abcdef " , " abcxyz " ) StringMetric .compareWithJaroWinkler( " abcdef " , " abcxyz " )

Decorating

It is possible to decorate algorithms and metrics with additional functionality, which you can mix and match. Decorations include:

withMemoization: Computations and comparisons are cached. Future calls made with identical arguments will be looked up, rather than computed.

withTransform: Transform arguments prior to computation/comparison. A handful of pre-built transforms are located in the transform module.

Non-decorated:

MetaphoneAlgorithm .compute( " abcdef " ) MetaphoneMetric .compare( " abcdef " , " abcxyz " )

Using memoization:

( MetaphoneAlgorithm withMemoization).compute( " abcdef " )

Using a transform so that we only examine alphabetical characters:

( MetaphoneAlgorithm withTransform filterAlpha).compute( " abcdef " ) ( MetaphoneMetric withTransform filterAlpha).compare( " abcdef " , " abcxyz " )

Using a functionally composed transform so that we only examine alphabetical characters, but the case will not matter:

val composedTransform = (filterAlpha andThen ignoreAlphaCase) ( MetaphoneAlgorithm withTransform composedTransform).compute( " abcdef " ) ( MetaphoneMetric withTransform composedTransform).compare( " abcdef " , " abcxyz " )

Making your own transform:

val myTransform : StringTransform = (ca) => ca.filter(_ == 'x' ) ( MetaphoneAlgorithm withTransform myTransform).compute( " abcdef " ) ( MetaphoneMetric withTransform myTransform).compare( " abcdef " , " abcxyz " )

Using memoization and a transform:

(( MetaphoneAlgorithm withMemoization) withTransform filterAlpha).compute( " abcdef " )

Building the CLIs

$ git clone https://github.com/rockymadden/stringmetric.git $ cd stringmetric $ sbt clean package $ ./project/build.sh $ ./target/cli/jarometric abc xyz

Using the CLIs

Get help:

$ metaphonemetric --help Compares two strings to determine if they are phonetically similarly, per the Metaphone algorithm. Syntax: metaphonemetric [Options] string1 string2... Options: -h, --help Outputs description, syntax, and options.

Get comparison value with metrics:

$ jarowinklermetric dog dawg 0.75

Get representation value with phonetic algorithms:

$ metaphonealgorithm dog tk

License