

OSBF-Lua

Text classification module for the Lua Programming Language

and a production class anti-spam in Lua using the module #1 in CEAS 2008 Spam Filter Live Challenge Winner of TREC's Spam Track 2006



Overview · What's new · Download and Contributions · Installation · Manual · Credits · Contact

Overview

OSBF-Lua (Orthogonal Sparse Bigrams with confidence Factor) is a Lua C module for text classification. It is a port of the OSBF classifier implemented in the CRM114 project. This implementation attempts to put focus on the classification task itself by using Lua as the scripting language, a powerful yet light-weight and fast language, which makes it easier to build and test more elaborated filters and training methods.

The OSBF algorithm is a typical Bayesian classifier but enhanced with two techniques that I originally developed for the CRM114 project: Orthogonal Sparse Bigrams - OSB, for feature extraction, and Exponential Differential Document Count - EDDC (a.k.a Confidence Factor), for automatic feature selection. Combined, these two techniques produce a highly accurate classifier. OSBF was developed focused on two classes, SPAM and NON-SPAM, so the performance for more than two classes may not be the same.

spamfilter.lua is an anti-spam filter written in Lua using the OSBF-lua module. It takes special advantage of EDDC to introduce TONE-HR, a highly effective training method. The combination of OSB, EDDC and TONE-HR to enhance a classical Bayesian classifier resulted in the best spam filtering performance in TREC's Spam Track 2006 and in CEAS 2008 Live Challenge.

The Confidence Factor was officially introduced in the paper "Exponential Differential Document Count - A Feature Selection Factor for Improving Bayesian Filters Accuracy", presented in the MIT Spam Conference - 2006, after being in experimental use for more than a year in both projects: CRM114 and OSBF-Lua. The conference slides are also available.

The OSB technique was officially announced in the paper "Combining Winnow with Orthogonal Sparse Bigrams for Incremental Spam Filtering", a work headed and presented by Christian Siefkes in the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), on September 2004.

The CRM114 implementation of OSBF was one of the classifiers submitted to the TREC's Spam Track 2005 by the CRM114 team, but its first results were not good because of a bug. Later, the bug was fixed and the OSBF-Lua version was submitted to the track coordinator, prof. Gordon Cormack, for an extra evaluation. The new results were comparable to those of the best participants, with the advantage of being 5 to 10 times faster. Our notebook paper comments on the results of the four filters submitted by the CRM114 team: OSBF, Winnow, OSB Unique and OSB.

OSBF-Lua is free software and is released under the GPL version 2. You can get a copy of the license at GPL. This distribution includes a copy of the license in the file gpl.txt.





What's new

February 2007 - Article on OSBF-Lua published on Virus Bulletin.

Fidelis Assis, OSBF-Lua, February 2007, Virus Bulletin. Copyright is held by Virus Bulletin Ltd, but is made available on this site for personal use free of charge by permission of Virus Bulletin.





[14/Jan/2007 Version 2.0.4





Changes to osbf module



Removed unnecessary linking of liblua.a, which caused segfaults on IRIX 6.5.30. This fix also reduced the size of the module by a factor of 5. Problem detected and fixed by Holger Weiss;





Fixed the number of args returned by osbf.classify in case of error;







Changes to spamfilter.lua



Added --help option;

option;



Extended syntax to read from file passed as arg in command line. If no file is given, it reads from standard input as usual;





Better error handling;





Fixed optind in getopt.lua.

in getopt.lua.



Fixed a date parsing error in cache_report.lua , caused mainly by ill-formed date fields in spam messages;



, caused mainly by ill-formed date fields in spam messages;



The scripts classify.lua and train.lua were renamed to classify.sample and train.sample , because they are meant more as samples, or starting points for customized scripts, than for real use. spamfilter.lua should be used for real classifications and trainings;



and were renamed to and , because they are meant more as samples, or starting points for customized scripts, than for real use. should be used for real classifications and trainings;



Added the file HISTORY_AND_AGREEMENT which states the dual-license agreement between Fidelis Assis and William Yerazunis.

See the full CHANGES





Download and Contributions

The sources can be downloaded from LuaForge.

Alessandro Martins maintains OSBF-Lua and Lua packages for Slackware.

Cassiano Aquino maintains OSBF-Lua packages for Debian: i386, amd64 and sparc.

Also available via apt-get: deb http://sendmail.com.br/ unstable free.

Also available via apt-get: deb unstable free. Steve Pellegrin maintains a SquirrelMail plugin for OSBF-Lua.

Christian Siefkes wrote a Perl script that makes the training process much easier. He also has a tutorial on how to install the filter locally.

Marlon Cabrera shows how to integrate OSBF-Lua and Exim.

Diego Aguirre has a patch for Openwebmail to add training buttons.

For a general-purpose text classifier based on OSBF-Lua, see Christian Siefkes' Moonfilter.

Holger Weiss wrote train_osbf, a script to create trained databases that reads messages directly from mbox folders. He also wrote a handy script for resizing the databases.





Installation

OSBF-Lua requires Lua 5.1 installed with dynamic loading enabled. OSBF-Lua was developed and tested under the Lua 5.1 work, alpha, beta and final 5.1 versions. It probably won't work with previous versions.

Installation steps:

Install Lua with dynamic loading enabled: For linux, execute "make linux" and "make install". For other OS, read the instructions in the INSTALL file and your OS documentation on how to create shared libs. You might want to change the occurrences of the O2 flag in CFLAGS to O3, in all makefiles, for increased speed.



Install the OSBF-Lua module: $ tar xvzf osbf-lua-x.y.z.tar.gz $ cd osbf-lua-x.y.z edit the "config" file to suit to your platform - not necessary for Linux - or to change the default installation PREFIX dir (/usr/local). $ make make install





If you want to install in the default dir you must be root to do the "make install" step. If you don't have root access, you may set PREFIX to point to a dir you have write access to, for instance $HOME/lib. You need to add the new installation dir to LUA_CPATH so that Lua loader can find osbf.so.

Ex: Installing in $HOME/lib



<edit config and set PREFIX to $HOME/lib>

$ mkdir $HOME/lib

$ make install

$ export LUA_CPATH=$LUA_CPATH:$HOME/lib/?.so





After osbf module is properly installed, you may want to install the spamfilter, a Lua script that uses the OSBF-Lua module to classify and tag messages as spam or non-spam (ham) according to the score they get, or to the white/blacklists, if any:



make install_spamfilter



The spamfilter files are installed in /usr/local/osbf-lua. If the dir doesn't exist it'll be created

The next step is to configure your email account to use the spamfilter:

do the following steps under your account, not as root

create your local osbf-lua dir: mkdir $HOME/osbf-lua

create your log and cache dirs: mkdir $HOME/osbf-lua/log mkdir $HOME/osbf-lua/cache



Note: Old messages in the cache dir should be deleted regularly, typically from a cron job, to preserve disk space. Check Christian Siefkes' trainfilter for his clean-up script.

copy the spamfilter config file to your dir: cp /usr/local/osbf-lua/spamfilter_config.lua $HOME/osbf-lua

edit spamfilter_config.lua to set your password

change the current dir to your osbf-lua dir and create the spamfilter databases (spam.cfc and nonspam.cfc) cd $HOME/osbf-lua lua /usr/local/osbf-lua/create_databases.lua # change '/usr/local' to your PREFIX

add the following lines to your .procmailrc

# set OSBF_LUA_DIR to where spamfilter.lua, spamfilter_command.lua etc were installed

OSBF_LUA_DIR=/usr/local/osbf-lua # change '/usr/local' to your PREFIX

OSBF_LUA_USER_DIR=$HOME/osbf-lua

# let the Lua interpreter find the "osbf" module.

# uncomment if you installed a local copy of the osbf module (e.g. no root access)

#LUA_CPATH=$HOME/lib/?.so: $LUA_CPATH





:0fw: .msgid.lock

* < 350000 # don't check messages greater than 350000 bytes

| $OSBF_LUA_DIR/spamfilter.lua --udir $OSBF_LUA_USER_DIR

OBS: The "osbf-lua" dir and all files and dirs under it must be writable by the user or group that procmail runs under.

Check your installation by sending a message to yourself with the following command in the subject line:

help <your password>

You should receive a message with a help on the spamfilter. Then, send another command in the subject line to verify that the databases were created correctly:

stats <your password>

You should get a statistics report on the just created databases.

From now on, all messages you receive with less than max size specified in the procmail recipe will be classified and tagged according to the score they get:





Tag Meaning [--] almost sure it's a spam - score <= -20 [-] probably it's a spam (reinforcement zone) - score < 0 and > -20 [+] probably it's not spam (reinforcement zone) - score >=0 and < 20 [++]

almost sure it's not spam - score >= 20. This tag is here just for symmetry, it's not used. An empty tag is used in place of it so as not to pollute the messages.

If the classification is wrong you must train the filter replying (you must do a "Reply", not a "Forward") the message back to yourself, replacing the subject with the correspondent training command:

learn <password> spam

or

learn <password> nonspam

The body of the message may be erased, it's not required. The original message, temporarily saved on the server by the spamfilter, will be recovered through the SFID (Spam Filter ID), a special mark added to the header when the message arrived to the server.

If you make a mistake, you should undo the training with the command "unlearn". Ex:

unlearn <password> spam

if wrongly trained as spam.

Training when the classification is wrong is essential for accuracy. Training when in the reinforcement zone, called reinforcement, is highly recommended for increasing and keeping the accuracy high. After you have a well trained filter, say 99% or better accuracy, you may want to reduce the reinforcement zone, eg. [-10, 10], so as not to do many reinforcements a day. You may change the reinforcement zone, tags, etc, by editing the file spamfilter_config.lua.



As of version 2.0.2, there's a new feedback mechanism, based on an HTML form report sent to the user, which completely removes the nuisance of sending training commands in the subject line for each message. The training report shows a table with the messages in the cache dir with scores between -20 and +20, containing Date, From, Subject and a list of actions, for each message, up to 50 messages per report. The user selects the proper action for each message and click on the "Send Actions" button to generate a pre-formatted training message, ready to be sent.

The training report can be sent to the users on scheduled times, from a cron job. After 10 learnings on each class, the training report suggests the proper training action for each message, based on its score, and colors each line of the table in red when the suggested action is "Train as Spam", or in blue for "Train as Non-spam". This makes the training process even easier, because most of the time the user doesn't have to do nothing more than just click the "Send Actions" button, for well trained databases. See the script cache_report.lua for details.





Credits

The OSBF-Lua lib and spamfilter.lua were designed and implemented by Fidelis Assis, who holds the primary copyright.

The OSB technique, as well as the OSBF classify and learn codes are based on the OSB and OSBF I originally developed for the CRM114 project, as a derivative work based on Bill Yerazunis' CRM114 Markovian classifier. Bill Yerazunis holds the secondary copyright on the OSBF-Lua lib.





Contact

For more information please email me. Comments are welcome!