I've been running blogfile for a fair while now. I've got over 20 blog posts here, some of which have gotten (and continue to get) reasonable amounts of attention. The result is that I've also had several thousand comments.

If someone were to go through the posts though, they may be tempted to call me a liar. "I can only count about 100 comments!" they might say. "Several thousand sounds like a massive stretch!"

Well, it's not. But luckily for you, you can't actually see most of the comments. That's because they're spam.

"You must be very vigilant, quickly flagging or deleting these spam comments, Sam!"

Nope. I'm pretty lazy. I prefer to just let the spam comments catch themselves. I also know that other people are pretty lazy, and don't want to discourage discussion by forcing users to enter in stupid captchas.

"So you have to analyse the text and figure out what is and isn't spam? That's a road that leads to false positives, and general confusion."

Yes, it is, which is why I don't do that. What I use instead is a captcha-less honey pot. It works on a couple of basic assumptions:

1. Spammers are running simple HTML-scraping scripts to find comment fields.

2. Once the form fields are captured, spammers will often make posts without using the actual loaded form.

3. Regular users don't pay attention to field names, or the underlying HTML content.

The simple run down is this. A bot scrapes the HTML and comes to the comment form. The first fields they see are 'name', 'email', and 'url'. Astute observers will notice at this point that I don't actually ask them for an email address to leave a comment. In fact I don't ask regular users for any of these fields. They're not even visible to you. They're hidden with CSS, but not by setting them to display=hidden; or display=none; , but by using/abusing overflow rules. Any content posted to these fields automatically tells me that the user posting was looking at the HTML, not the browser.

Next, I pre-fill a couple of other hidden 'verification' fields which capture a unique value that is generated for the user. If I can't find these in the post, or can't re-generate the same value based on user details that I've received in the post, then the user probably didn't visit the actual page where they supposedly posted the comment from.

Finally, I keep a (hashed) record of spammers, and check against it. This is used as an internal "karma" measure to catch repeat offenders.

So how well does it work? Very well. I've not yet have a single spam comment get through the filters, and not yet had a single false positive.

Is this long-term viable? That depends on a few factors. If my blog eventually gets to a level of popularity where people would specifically write scripts to target my comment fields, then no. It probably wouldn't work any more. That's not to say that all is lost, because I have plenty of other tricks up my sleeve, and if it's that popular then I can afford to spend some time on refining my system.

What about accessibility? Actually, to be honest, this plan may not fully work with screen readers, but I'm sure that there are some pretty simple changes that I could make to let it happen. I don't think that I get much traffic from people with screen readers, though, so for the moment the time and cost of making the adaptations isn't worth the pay off.

So that's how I've been protecting my blog from spam. It's worked very well, and I think it will continue to work for a fair while into the future. When it stops working, then some trivial changes should make it effective again.