I am not an academic.

In fact, as far as computer science goes I have never been an academic. My background is in mathematics.

Nevertheless, I try to read a lot of papers. I’m interested (albeit currently only in an armchair way) in programming language implementation and theory. I occasionally feel the need to implement a cool data structure for fun or profit. And, most importantly, I need to for my job.

I work on SONAR. As such a great deal of what I do is natural language processing and data mining. It’s a fairly heavily studied field in academia and it would be madness for me to not draw on that fact. Eventually I’d like to feed back into it too – we do a bunch of interesting things in SONAR and I’m reasonably confident that we can isolate enough of it into things we can share without giving the entire game away (we need to sell the product after all, so giving away the entire workings of it is non ideal).

I’ve got to say. It’s a pretty frustrating process.

Part of the problem is what I’ve come to refer to as the “Academic firewall of doom”. Everything lives behind paid subscriptions. Moreover, it lives behind exhorbitantly priced paid subscriptions. Even worse: None of the ludicrous quantities of money I’d be paying for these subscriptions if I choose to pay for them would go to the authors of the papers. The problems of academic publishing are well known. So assuming I don’t want to pay the cripplingly high fees for multiple journals, I have to either rely on the kindness of authors in publishing their papers on their own websites (a practice I’m not certain of the legal status of but am exceedingly grateful for), or if they have not done so pay close to $100 for a copy of the paper. Before I know if the paper actually contains the information I need.

Another unrelated problem which is currently being discussed on reddit in response to an older article by Bill Mill is that very few of the papers I depend on for ideas come with reference implementations of the algorithms they describe. I understand the reasons for this. But nevertheless, it’s very problematic for me. The thing is: Most of these papers don’t work nearly as universally as one might like to believe. They frequently contain algorithms which have been… let’s say, empirically derived. Take for example the punkt algorithm which I’m currently implementing (which has an implementation in NLTK, which at least gives me confidence that it works. I’m trying not to look at it too much though due to its GPLed nature). It starts out with a completely legitimate and theoretically justified statistic to do with colocations. That’s fine. Except it turns out that it doesn’t work that well. So they introduce a bunch of additional weightings to fuzz the numbers into something that does work and then through experimentation arrive at a test statistic which does a pretty good job (or so they claim).

That’s fine. It’s how I often work too. I’m certainly not judging them for that. But it does mean that I’m potentially in for a lot of work to reproduce the results of the paper (hopefully only a couple of days of work. But that’s a lot compared to firing up the reference implementation) for very uncertain benefits. And additionally it means that if it doesn’t work I don’t know whose fault it is – mine or the algorithm’s. So then if it doesn’t work I have to make a decision whether to spend time debugging or to cut my losses and run.

It should also be noted that negative results are, for whatever reason, the overwhelming majority. I’ve written enough similar things that actually work (at least some of which are based on papers) that I choose to believe that at least not all of these are due to my incompetence.

So, to recap, I’m in a situation where in order to find out if any given paper is useful to me I need to a) Spend near to $100 if I’m unlucky and b) In the event that the paper actually does contain what I want, take several days of work to implement its results and c) In the common case get negative results and even then not be certain about the validity of them so potentially spend more time debugging.

So the common case is designed in such a way as to cost me a significant amount of money and effort and is unlikely to yield positive results. You can understand why I might find this setup a bit problematic.