This is an impressive-looking AUC!

But the AUC score of a classifier is only a very generic measure of performance. When having a specific problem like spam filtering, we're better off using a performance metric that truly matches our intuition about what a good spam filter ought to be. Namely, a good spam filtering algorithm should almost never flag as spam a legitime email, while keeping your inbox as spam-free as possible. This is what should be used to choose the threshold for the classifier, and then to measure its performance.

So instead of the AUC (that doesn't pick a specific threshold but uses all of them), let's use as our performance metric the best $F_{0.05}$ score, which gives 20 times more importance to precision than recall. In other words, this metric represents the fact that classifying as spam only what is really spam is 20 times more important than finding all the spam.

Let's see how we are doing with that metric.