A significant fraction of data that are currently being generated and stored is in the form of unstructured text. Researchers use topic modeling algorithms to automatically classify text documents into topics in text recommendation systems, digital image analysis, spam filtering, and high-dimensional data mining with the goal of recording and extracting relevant data. With the goal of enabling intelligent data searches, we study synthetic and real data, whose topics are known, to evaluate the performance of state-of-the-art algorithms. We show that current optimization techniques often have trouble yielding accurate and reproducible results, particularly if the topics are heterogeneously distributed. By borrowing methods from graph clustering, we propose a novel optimization method with high accuracy and reproducibility and no computational overhead.

One state-of-the-art algorithm for classifying text-based data is latent Dirichlet allocation, a means of assigning topics to documents. We measure its performance using specific synthetic data with known topics. We find that the algorithm is not able to detect the ground-truth topics because of the roughness of the likelihood landscape. We propose a novel technique that builds a network of co-occurrent words and finds topics starting from clusters of words in that network. We show that our method is more accurate, both in synthetic cases and in a real-world case of documents from the journal Science.

We expect that our results will yield new ways of automatically classifying text documents, a process that is particularly relevant given the growth of electronic, searchable data.