WhatLanguage: Ruby Library To Detect The Language Of A Text

By Peter Cooper

WhatLanguage is a library by Peter Cooper (disclaimer: yes, that's me) that makes it quick and easy to determine what language a supplied text is written in. It's pretty accurate on anything from a short sentence up to several paragraphs in all of the languages supplied with the library (Dutch, English, Farsi, Russian, French, German, Portuguese, Spanish, Pinyin) and adding languages of your own choosing isn't difficult.

The library works by checking for the presence of words with bloom filters built from dictionaries based upon each source language. We've covered bloom filters on Ruby Inside before, but essentially they're probabilistic data structures based upon hashing a large set of content. They're ideal in situations where you want to check set memberships but the threat of false positives is acceptable in return for significant memory savings (and a 250KB bloom filter is a lot nicer to deal with than a 14MB+ dictionary).

WhatLanguage is available from GitHub (and can be installed as a gem from there with gem install peterc-whatlanguage ) or from RubyForge with a simpler gem install whatlanguage . Once installed, usage is simple:

require ' whatlanguage ' " Je suis un homme ". language wl = WhatLanguage . new ( :all ) wl . language (" Je suis un homme ") wl . process_text (" this is a test of whatlanguage's great language detection features ")

I wrote the library initially a year ago but have only just made it available for public use, so if there are unforeseen bugs to fix or things that really need to be added, fork it on GitHub and get playing.