How I built an effective blog comment spam blocker

Mention comment spam and most people, in particular those crazy WordPress users, mention Akismet. Great tool and I have nothing against it but I wanted to build my own, avoiding the external call to the Akismet service. What has been interesting to see, is just how effective it is. Turns out, my spammers are quite obvious.

As you might see, I don't use CAPTCHAs and I don't use JavaScript detection. I just use a number of rules that validate each comment on the server. Oh, and I don't use nofollow.

Points System

I use a points system, which I got the idea from Movable Type, whose spam protection is also based on a points system. For everything in a comment that I like, you get a point. For everything I don't like, you lose a point (or two, or three). If you get a 1 or higher, you've made it on the site as a valid comment. If you get a 0, it's set for moderation and I'll take a look at it. If it's below 0, it's marked as spam and I'll never see it (although I check every couple weeks just in case a legitimate comment needs to be unflagged). If it falls below -10, I don't even bother saving it to the database since it is so obviously spam.

Types of Spam

There are two main types of spam: automated and manual.

Automated spam is the most obvious. There are a number of tricks they try to pull and stands out when you see the same message a dozen times posted within seconds of each other. Automated spam is also the easiest to catch. So insanely simple that just a few rules would catch about 95% of all comment spam hitting a server. (That percentage may even be higher...I'm just guessing).

Manual spam, on the other hand, is more devious. People actually try and respond to the article at hand, which makes it slightly harder to catch. I say slightly because the vast majority of manual spammers do such a poor job at leaving a comment that they stand out like a sore thumb. The remaining few are usually the ones you end up filtering by hand.

Quick Solution

The quickest solution to reducing the amount of comment spam you get, and doesn't require any server-side programming and is built into almost all blogging tools, is to simply turn off the comments on a post after a certain amount of time. It works quite well and here are the two major reasons why:

Automated spam has a database of pages to which they try to submit to. If the form is no longer there then you don't get spam. Spammers are forced to discover new pages in which to spam. Manual spam often tries to hit pages that have higher page ranks. There's plenty of search engine tools to help people look this information up. (I'd actually see referrers from these search tools, followed shortly by a new blog comment.) Higher page ranks will happen on older and popular posts. By shutting down the comment form, manual spammers are left to target newer pages in the hopes of getting missed until the page gets a higher ranking.

I've had old posts that I left the comments open for years and would still see users come across it and add to the discussion in meaningful ways. I loved that. However, that almost never happens now. So, I finally gave in and just close comments.

The Rules

In a blog comment, there are 5 fields and I test each one separately and in various combinations for various rules. The fields are: body, email, author name, url, and ip.

Here now are my rules for filtering blog comments.

How many links are in the body More than 2 -1 point per link Less than 2 +2 points How long is the body More than 20 characters and there's no links + 2 points Less than 20 characters -1 point Number of previous comments from email Approved comments +1 point per Marked as spam -1 point per Keyword search Levitra, viagra, casino, etc. -1 point per URLs that have certain words or characters in them .html, .info, ?, & or free -1 point per URLs that have certain TLDs .de, .pl, or .cn (sorry guys) -1 point URL length More than 30 characters -1 point Body starts with... Interesting, Sorry, Nice or Cool. -10 points Author name has http:// in it -2 points per Body used in previous comment -1 point per Random character match 5 consonants -1 point per

Once you have a database of spam messages, you can observe certain patterns. In checking some information from time to time, I discovered some interesting stats:

Body length

Write something of consequence. If it's less than 20 characters, you obviously don't have much to say.

URL matches

Most people who include a URL usually have a top level domain or a subdomain that they use. They're not using querystring parameters or any other crazy URL structures. And I'm sorry for all the German, Polish or Chinese but a few of your fellow countrymen aren't being very nice.

URL Length

URLs that are longer than 30 characters are almost always spam. This ties in with the last filter. If you've got a URL, it's short, sweet and sexy. It's not crazy long — although I have seen some crazy long, perfectly legitimate URLs.

Body matches

It may seem like I'm being overly severe on people who start their comments like this but it's a very specific pattern that I'm matching. I was getting 10 to 20 hits of the same message coming in. It was just easier to match the messages and essentially ban them.

Random character matches

The other thing I noticed was email addresses or author names that were just a random string of characters. If there's no vowels, sure you might be Polish but more likely that you're spam. Rarely do even the Polish have 5 consonants in a row!

Effective?

How effective has it been? These days, I only see a new spam message get through maybe once every week or two. It's usually a message that somebody has handtyped to be relevant to the page but the comment is near useless and their author name is most evidently spam.

I've also reworded the disclaimer text under the submit box to let people know that I'm actively on the look out for spam and even legitimate comments will get edited or marked as spam if they plan to abuse the system. This lets those people know — like those who like to leave signatures on a blog comment or who like to use their company name as their author name — that being underhanded will not be rewarded.

Despite my past frustration with spam, things are at a point now where I'm happy to leave comments open on recent posts for a couple weeks and then just close them up and never have to worry about them again. It certainly isn't the death of comments I thought it might need to come to.