Spam is a kind of messaging where the cost of sending is usually negligible and the receiver and the ISP pays the cost in terms of bandwidth usage.

An example of a manual approach to detecting spam is using knowledge engineering. When you are aware of what is spam and what is not, you can usually filter it by creating a set of rules like,

If the subject line of an email contains words ‘Buy viagra’ its spam

Any email from a certain address or from a pattern of addresses is spam

A comment in blog containing a link to a certain website is spam

These rules can be configured by the user himself or by the email provider and if correctly thought out and executed this technique can be effectively be used to combat spam. This is a blog post about one such implementation. However, a manual rules based approach doesn't scale because of active human spammers circumventing any manual rules. Therefore a machine learning related approach is necessary.

Machine Learning

Machine learning based approaches don’t need specifying rules explicitly instead you need a decent amount of data pre classified as spam and not spam. You can use specific algorithms to learn rules to classify the data.

An important problem in using the machine learning algorithms is that most of the algorithms can only classify numerical objects like vectors to overcome this we usually convert text data into vectors of numbers expressing certain features in the message. It is to be noted that more than the algorithm being used the features you choose determine the success and failure rate of the filter to a large extent.

Paul Graham articulated the need for a machine learning approach in his seminal essay called "A Plan for Spam". The approach used Bayesian classifiers on a bag of words features to classify the text. Naive bayesian techniques correlate spam and ham mail to different tokens in the email and use bayesian formula to calculate the probability of an email received being spam. The tokens can be anything, like words in the mail or words in header and html of the mail or phrases in the text. This is a great essay where Paul Graham explains about his spam filtering technique. Gary Robinson further improved on Paul Graham's algorithm.

Another simple method is the k Nearest Neighbors Classifier where a text is classified as spam or not spam based on the majority vote of K nearest neighbours. The algorithm requires pre classified feature vectors. All the vectors are just stored. To classify a new text document, its feature vector is extracted and its distance from every vector in the training set is calculated and its assigned to the class of the majority members of the K nearest neighbours. If K=1 it is simply assigned to the class of the nearest neighbour and if K=3 it is assigned to a class of majority of 3 different neighbors. Here is an example of how this algorithm can be used to check for spam as outlined in this paper:

Given a message x, determine its k nearest neighbors among the messages in the training set. If there are more spams among these neighbors, classify given message as spam. Otherwise classify it as legitimate mail.

Research Papers on Spam Detection

Tools that Use Machine Learning Capabilities for Spam Detection: