:: libTextCat ::

wiseguys > software > libtextcat

What is it?

Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization" [1]. It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple out-of-place metric. See the article for more details.

Considerable effort went into making this implementation fast and efficient. The language guesser processes over 100 documents/second on a simple PC, which makes it practical for many uses. It was developed for use in our webcrawler and search engine software, in which it it handles millions of documents a day.

News

Dec 5, 2003 - version 2.2. Long overdue version with autotools config (thanks to Jeff Johnson)

May 20, 2003 - version 2.1. Now includes Gertjan van Noord's language models (with Gertjan's explicit consent). Much cleaner makefile.

May 15, 2003 - unleashed the code

Download

The library is released under the BSD License, which basicly states that you can do anything you like with it as long as you mention us and make it clear that this library is covered by the BSD License. It also exempts us from any liability, should this library eat your hard disc, kill your cat or classify your attorney's e-mails as spam.

The current version is 2.2.

At the moment there is no development version.

Previous releases

Documentation

We have some snippets of documentation online. These should be enough to get you started.

References

[1] The document that started it all: William B. Cavnar & John M. Trenkle (1994), N-Gram-Based Text Categorization

[2] The Perl implementation by Gertjan van Noord (code + language models): downloadable from his website

Related Links

JTextCat - A Java interface to libTextCat by Patrick Debois

Contact

Praise and flames may be directed at us through libtextcat AT wise-guys.nl. If there is enough interest, we'll whip up a mailing list. The current project maintainer is Frank Scheelen.

© 2003 WiseGuys Internet B.V.