You’ve probably already heard about the interesting paper by Kiri Wagstaff, titled “Machine Learning that Matters”, presented at this years ICML conference which is just underway in Edinburgh.

The basic argument in her paper is that machine learning might be in danger of losing its impact because the community as a whole has become quite self-referential. People are probably solving real-world problems using ML methods, but there is little sharing of these results within the community. Instead, people focus on existing benchmarks which might have originally had some connection to real-world problems which has been long forgotten, however.

She proposes a number of tasks like $100M solved through ML based decision making or a human life saved through a diagnosis or an intervention recommended by an ML system to get ML back on track.

The paper already generated quite some buzz. There is a heated discussion thread on reddit and many people have responded to it on their blogs, for example, Cheng Soon Ong on the mloss.org blog. I also just found out that she built a website called mlimpact.com which hosts a discussion forum around these topics.

I partly agree with her assessment, however, I think it is wrong to take the main ML conferences and journals as a reference to how much application work is going on. Our group at TU Berlin has always had a strong focus on applications, including neuroscience (in particular brain-computer interfaces), bioinformatics, and computer security, and eventually, you have to start to publish in the conferences and journals of the application field, not in pure ML conferences.

I also think that it is perfectly okay that pure ML is somewhat removed and abstract from real applications. After all, being able to formulate methods abstractly is one of the main reasons there is a discipline like machine learning at all. If everything were always very application specific, it would be very hard to transfer knowledge between people working on different applications.

On the other hand, it is true that there is little return of information from the applications into the pure ML domain, partly because it is very hard to publish application related papers at ML conferences unless they have a significant methodological contribution. I think we’re missing out a lot of interesting insights into the capabilities and limitations of the learning methods we have developed that way.

But I think there are also other problems. The hype around Big Data and Data Science is pretty big right now. As I’ve discussed in a previous post I think machine learners are one of three groups who can potentially contribute a lot to this field (the others being data ming people, and computational statisticians). Still, from talking to my colleagues and other people in ML I get the feeling that we’re losing the race to get our share of the cake, mostly to data mining people who have much better expertise on the technological side, but often lack the methodological depth of machine learners.

I think the main reason why this is happening is that machine learning has been a bit too successful in finding an abstract mathematical language in which to formulate their problems, which mostly statistics and linear algebra. If faced with a concrete problem, the typical machine learner goes through a very painful stage where he tries to get the data, convert it into matrices, cleanse it, so that he can finally load it into matlab, scipy or R. Now he can relax and finally feel at home. Honestly, many of my colleagues consider databases as just another file format, a way to store and retrieve data.

The problem with this, however is that as a data scientist, you also need to be able to put your stuff into production, which means dealing with all kinds of enterprise level technology like web services, databases, messaging middleware, and questions of stability and scalability. Also, as opposed to the batch processing mode which one often uses to get results for papers (load data, grind data for a few hours, write out the results), you have to run your analyses in a much more tightly knit fashion, for example, by hooking up your algorithms to a web services and doing all the communication over a network.

Of course, acquiring this kind of expertise also takes a lot of time, and it’s also not strictly required for an academic career in machine learning. However, in particular for many web related tasks, this is exactly what it takes to make an impact in businesses and on the world for machine learning. Instead, I often get the impression that people consider this extra work as merely “programming”, and something which is outside of the scope of a machine learner.

I’m not sure how to change this. In the end, everyone has to decide for himself what pieces of technology to learn. If you want to apply ML beyond academia, you certainly have to learn about Java, databases, NoSQL, Hadoop, and all this stuff at some point.

Actually, I’m not even convinced that it would make sense to have more technical contributions at core ML conferences. However, I definitely think it should become more common place that you know how to implement an algorithm in an enterprise environment as opposed to a ML-friendly matrix based language such as matlab or R.

So in terms of teaching, I think there is a lot of room for improvement. Students should learn a lot more about how to bridge the gap between a purely mathematical version of an ML algorithm and how you would implement it in the real-world. People have been doing this kind of work at Google, Yahoo, and all the other data driven companies for quite some time, but it’s something different than having students implement SMO in matlab.

Thanks to Johnny D. Edwards for originally telling me about the paper at the Berlin Buzzwords conference.

Posted by Mikio L. Braun at 2012-06-27 21:10:00 +0200

Disqus