Intuitively, stemming should improve the performance of a classifier when the number of training examples is small. For instance, if "cat" is a relevant word for a category, "cats" probably is too, and we'll collect better statistics for word occurence if we group them together. On the other hand, this assumption might not always be true: for instance, it could be that "CAT" bulldozers are what's relevant. Also, most stemmers do some violence to the text by committing some errors.

Another possible improvement would be to ignore certain words. For instance, there is little value in words that occur only once or a few times, since whether they appear or not in documents which are relevant or not is just a matter of chance. Words that occur very often (say "the") are also unlikely to be predictive and are frequently treated as stop words.

Some ML algorithms do poorly when they are confronted by a very large number of features, others do better. Logistic regression, the algorithm I use, holds up pretty well, particularly if it is predictions that we want as opposed to insights.

Sticking with the bag-of-words model, I could also weight words based on how often they occur in the title (So that The Global Fund to Fight AIDS, TB and Malaria: A Response to Global Threats, a Part of a Global Future would count the word "Global" three times.) I could also weight words less if they occur frequently in the corpus as well as normalize the length of the title vector. Put those together and I'd get a tf-idf vector.

The case for the tf-idf vector could be stronger if I was using whole documents, but my experience and reading of literature seems to indicate that these choices don't usually make a big difference in the performance of the classifier.