Jun. 12, 2015

Summer vacations! The wonderful time when you realize that while you have been teaching classes and grading scripts, science has been moving on and now you must scramble to catch up.

Not so long ago you would go to the library and browse through the journals to quickly see what you had missed out on. Doing the digitial equivalent of this by visiting journal websites and looking at tables of contents takes too many clicks for my tastes. Thankfully the RePec project provides bibliographic data for Economics working papers and journal articles in machine-readable format. Add to that some pre-packaged open-source machine learning software and you can do what was not possible in the stack-walking days: cluster articles according to topic.

I ran the standard Latent Dirichlet Allocation clustering algorithm on the titles and abstracts of articles to group them into twenty topics. Based on my interests I chose articles published in 2014 and 2015 in the leading general interest economics journals and the leading field journals in macroeconomics and game theory.

The clustering code is on GitHub and here are the clusters for your enjoyment. Following the link are the most probable words for the cluster. The prefix T: indicates the occurence of the word in the title. The words have been stemmed so that for example ‘calibration’ and ‘calibrated’ have booth been reduced to calibr’.

This was my first time running any kind of machine-learning algorithm. The most important lesson for me was that as in any other statistical work, having clean data is very important. Some of the titles have HTML and LaTeX markup, or copyright notices with the publisher’s name, and the model latches on to this as important features even though they are not.

I also saw the value of the IPython notebook format over writing scripts. You get inline graphs. Being able to restart the code from somewhere in the middle is a boon when some parts of the code is expensive to run. But I got bitten a few times by variables left over from previous runs affecting results or not being able to restart computation in the middle because, say, a previous run had closed a database connection.