SpamAssassin is back

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

The SpamAssassin 3.4.2 release was the first from that project in well over three years. At the 2018 Open Source Summit Europe , Giovanni Bechis talked about that release and those that will be coming in the near future. It would seem that, after an extended period of quiet, the SpamAssassin project is back and has rededicated itself to the task of keeping junk out of our inboxes.

Bechis started by noting that spam filtering is hard because everybody's spam is different. It varies depending on which languages you speak, what your personal interests are, which social networks you use, and so on. People vary, so results vary; he knows a lot of Gmail users who say that its spam filtering works well, but his Gmail account is full of spam. Since Google knows little about him, it is unable to train itself to properly filter his mail.

Just like Gmail, SpamAssassin isn't the perfect filter for everybody right out of the box; it's really a framework that can be used to create that filter. Getting the best out of it can involve spending some time to write rules, for example. Most of the current rule base is aimed at English-language spam, which isn't helpful for people whose spam comes in other languages. Another useful thing to do is to participate in the MassCheck project, which can quickly evaluate the effectiveness of new rules on a large body of spam. In particular, MassCheck performs a nightly run to check the hit rate of rules to determine how those rules are performing in real installations. It can also check for overlap; if two rules always trigger on the same messages, there isn't really a need for both of them. This information feeds into the RuleQA database to give a picture of how the rules are working overall.

SpamAssassin is not just for email filtering, Bechis said; some sites are using it to detect spam submitted in web forms, for example.

So what is new in SpamAssassin? There has been a lot of work by the project's system administration team, he said, to update the infrastructure. That has resulted in the rebuilding of the MassCheck implementation from scratch. The 3.4.2 release contained fixes for four security bugs, and also an important workaround for a Perl bug that was only triggered on Red-Hat-based distributions. Startup time has been improved, and SSLv3 support has been removed. The "freemail antiforge" mechanism, which seeks to detect forged Gmail messages, has been improved. The geo-aware scoring system can adjust scores based on which continent the mail came from. The URILocalBL plugin, which can blacklist URLs based on information like where they are hosted, has seen a number of improvements.

3.4.2 Also saw the addition of the HashBL plugin, which can be used to block email addresses from domains that cannot be blocked wholesale. There is a new anti-phishing plugin that can filter on URLs commonly found in phishing emails. The new ResourceLimits plugin can put limits on the amount of CPU and memory used by SpamAssassin. And the FromNameSpoof plugin tries to detect attempts to confuse users about the source of an email using the full-name field.

Some future plugins include a couple that are aimed at detecting Microsoft Office attachments containing macros. There is one for checking URLs from URL-shortening services; it will filter based on the final destination of those URLs. The KAM.cf ruleset is an unofficial addition that can allow sites to respond more quickly to new spam campaigns, but at a cost of more false positive results. Also coming is a set of international channels that will carry signed rulesets designed for different parts of the planet.

The SpamAssassin 4.0 release can be expected around January, Bechis said. It will include full UTF-8 support that has been completely rewritten, with better detection of east-Asian languages. The TxRep plugin, which applies scores to messages depending on the reputation of the sender, is being improved and will be able use PostgreSQL 10. The Office macro and URL shortener plugins will be in this release, but another new plugin to check for suspicious URLs inside attachments will have to wait until 4.1.

Further in the future, the project plans to update its approach to machine learning. The current code is getting old, and there is interest in applying deep-learning techniques to the spam-detection problem. There was a Google Summer of Code project that attempted to make progress in that area but it didn't succeed, so more work is needed.

When asked about whether the SpamAssassin project had really slowed down as much as its release history suggests, Bechis conceded that it had. A number of people had left the project, and there were infrastructure problems that blocked the rule-generation process. But the situation has since improved, he said. The project has picked up a new set of developers and is moving forward again. Certainly the world can only benefit from better spam filtering.

The slides from this talk [PDF] are available.

[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to the event.]

