The `Bow' Toolkit

Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

The library and its front-ends were designed and written by Andrew McCallum, with some contributions from several graduate and undergraduate students.

The name of the library rhymes with `low', not `cow'.

About the library

The library provides facilities for:

Recursively descending directories, finding text files.

Finding `document' boundaries when there are multiple documents per file.

Tokenizing a text file, according to several different methods.

Including N-grams among the tokens.

Mapping strings to integers and back again, very efficiently.

Building a sparse matrix of document/token counts.

Pruning vocabulary by word counts or by information gain.

Building and manipulating word vectors.

Setting word vector weights according to Naive Bayes, TFIDF, and several other methods.

Smoothing word probabilities according to Laplace (Dirichlet uniform), M-estimates, Witten-Bell, and Good-Turning.

Scoring queries for retrieval or classification.

Writing all data structures to disk in a compact format.

Reading the document/token matrix from disk in an efficient, sparse fashion.

Performing test/train splits, and automatic classification tests.

Operating in server mode, receiving and answering queries over a socket.

The library does not:

Have English parsing or part-of-speech tagging facilities.

Do smoothing across N-gram models.

Claim to be finished.

Have good documentation.

Claim to be bug-free.

It is known to compile on most UNIX systems, including Linux, Solaris, SUNOS, Irix and HPUX. Over a year ago, it compiled on WindowsNT (with a GNU build environment); it doesn't do this any more, but probably could with small fixes. Patches to the code are most welcome. It is developed on a Linux system.

The code conforms to the GNU coding standards. It is released under the Library GNU Public License (LGPL).

Citation

McCallum, Andrew Kachites. "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering." http://www.cs.cmu.edu/~mccallum/bow. 1996.

Here is a BiBTeX entry:

@unpublished{McCallumLibbow, author = "Andrew Kachites McCallum", title = "Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering", note = "http://www.cs.cmu.edu/~mccallum/bow", year = 1996}

Obtaining the Source

Source code for the library can be downloaded from this directory. Different versions are indicated by eight digit sequences that indicate year, month and day. Thus, the most recent version is the one with the largest version number.

Unfortunately I do not have time to help rainbow's many users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Bow Library Front-Ends