SimString A fast and simple algorithm for approximate string matching/retrieval

Introduction SimString is a simple library for fast approximate string retrieval. Approximate string retrieval finds strings in a database whose similarity with a query string is no smaller than a threshold. Finding not only identical but similar strings, approximate string retrieval has various applications including spelling correction, flexible dictionary matching, duplicate detection, and record linkage. SimString supports cosine, Jaccard, dice, and overlap coefficients as similarity measures. SimString uses letter n-grams as features for computing string similarity. SimString has the following features: Fast algorithm for approximate string retrieval. For example, SimString can find strings in Google Web1T unigrams (13,588,391 strings) that have cosine similarity ≧0.7 in 1.10 [ms] per query (on Intel Xeon 5140 2.33 GHz CPU).

100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.

Unicode (wchar_t) support. For languages using multi-byte characters, developers can use Unicode characters (wchar_t) instead of single-byte characters (char) as a character representation.

Implementation in C++ header files. Developers can add the funtionality of approximate string retrieval into C++ programs just by including a header file.

Python and Ruby bindings via SWIG. Developers can easily perform approximate string retrieval in scripting languages.

Download The current release is SimString version 1.0. Source code SimString is distributed under the modified BSD license. Please use the following BibTex entry when you cite SimString in your papers. @InProceedings{Okazaki:Coling2010, author = {Okazaki, Naoaki and Tsujii, Jun'ichi}, title = {Simple and Efficient Algorithm for Approximate Dictionary Matching}, booktitle = {Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)}, month = {August}, year = {2010}, address = {Beijing, China}, pages = {851--859}, url = {http://www.aclweb.org/anthology/C10-1096} }

Change log SimString 1.0 (2010-03-07) Initial release.

How to build Building simstring utility To build the simstring utility, please follow the general procedure, configure & make. The source code of the simstring utility is located at frontend/main.cpp . Running "make" builds a binary frontend/simstring . $ ./configure $ make # To install the SimString header files $ make install Using the SimString library from C++ programs Add the include directory in the distribution into the INCLUDE path when compiling your C++ program. Please refer to the sample program in C++, the source code of the simstring utility ( frontend/main.cpp ), and SimString C++ API Documentation. The API specification is so simple that a developer can use it just by looking at the sample program. Building a Python/Ruby binding of SimString Please refer to the sample program of SimString module and SimString SWIG module documentation. The API is so simple that a developer can use it just by looking at the sample program. Currently, build instructions for Python and Ruby modules are available, but it should be easy to build modules for other languages via SWIG. (It would be very helpful if you could submit a sample program in other scripting languages; I'm not so familar with scripting languages other than Python.) Building a Python module Build simstring.py and _simstring.so , and install them. $ ./configure $ cd swig/python $ ./prepare.sh $ python setup.py build_ext $ python setup.py install Adding "--inplace" option to the command-line argument for build_ext builds simstring.py and _simstring.so in the current directory. If these files are placed on the directory included in the module path of Python (e.g., the current directory where a Python process is created), one can try the SimString module without installing it. Building a Ruby module Build and install simstring.so . $ ./configure $ cd swig/ruby $ ./prepare.sh $ ruby extconf.rb $ make $ make install Running "make" builds simstring.so in the current directory. If the file is placed in the directory in the module path of Ruby (e.g., the current directory where a Ruby process is created), one can try the SimString module without installing it.