Why our machine learning platform supports Python, not R

Machine learning engineering is maturing

Disclaimer: The following is based on my observations — not an academic survey of the industry. For context, I’m a contributor to Cortex, an open source machine learning platform (the “our” in this article’s title).

There are dozens of articles written comparing the relative merits of Python and R for data science, and this isn’t one of them.

Instead, this an article about the divergence of data analysts and machine learning engineers, and their differing needs in a programming language.

The simple version is that machine learning engineers are, fundamentally, software engineers, and they use programming languages designed for software engineering—not statistics.

This may sound fairly obvious, but it represents a change in the machine learning ecosystem, one that is worth diving into further.

Python and R are both suited for data analysis

Comparisons of R and Python often highlight perceived advantages of either language that are, at best, marginal and subjective. While some believe R’s out-of-the-box statistical functions provide an advantage over Python, which requires the use of third party libraries like NumPy, those differences aren’t that impactful.

The simple truth is that R and Python are both completely adequate for the analysis of data.

For example, say you want to run a simple linear regression model on some data, like housing prices. In R, it would look something like this:

square_feet <- c(1000, 1300, 942, 1423, 2189)

price <- c(300000, 299000, 240000, 420000, 600322) correlation <- lm(price~square_feet) new_house <- data.frame(square_feet = 1100)

new_house_price = predict(correlation, new_house) print(new_house_price)

Here it is in Python:

import pandas as pd

import statsmodels.api as sm data = {'square_feet': [1000, 1300, 942, 1423, 2189], 'price': [300000, 299000, 240000, 420000, 600322]}

housing_data = pd.DataFrame(data=data) model = sm.OLS(housing_data['price'], housing_data['square_feet']).fit() new_data = {'square_feet': [1400]}

new_housing_data = pd.DataFrame(data=new_data) model.predict(new_housing_data['square_feet'])

The differences aren’t incredible. Some people may feel particularly attached to the syntax of one language, or may prefer R’s default plotting library ( ggplot2 ) over Matplotlib or other Python options. Others will point out that Python is more performant than R.

The reality is, if all you want to do is analyze data, either language will get the job done fine.

But machine learning engineering is about software—not business intelligence

The needs of a company that is analyzing data to learn about their business—business intelligence, in other words—are different than those of a company for whom machine learning is an actual part of their product.

As Adam Waksman, Head of Core Technology at Foursquare, explains:

“A lot of times when companies say they have a “data science team”, they mean they have an analytics support function. At Foursquare, where machine learning models are a big chunk of our product…. we think of data science as part of our product development team”

Waksman continues to explain that at Foursquare, “We don’t have a data science department — we have an engineering department that cuts across a lot of functions.”

The needs of machine learning engineers are different. Let’s look at a real example.

To build a customer service bot for your company, you’d probably deploy your model as a microservice, which would take customer input and return a response to be rendered within the bot’s frontend.

In building this API, you’d need to:

Load your model, which regardless of what framework you use, almost certainly has native Python bindings.

Use a framework for serving your API. Python has several options—Flask being the most popular—while R is stuck with just Plumbr.

Worry about things like parsing user input and, potentially, communicating with other services. This is more easily done in a general purpose scripting language like Python.

In other words, machine learning engineers have to deal with engineering concerns, where Python is the better choice.

Machine learning is both a research and an engineering discipline

To understand the emergence of machine learning engineering, it is useful to look at what happened in a related field, web development.

In 2000, there was only one product that relied on asynchronous communication between the client and server—Outlook Web Access. The team at Microsoft working on Outlook Web Access was the same team that invented XMLHTTP, the technology that made background HTTP requests possible.

In other words, the only people who could build asynchronous apps were the people who invented the technology that enabled them.

Not long ago, the same was true of machine learning. The only companies building products with machine learning also had sizable machine learning research teams, like Google, Facebook, and Netflix.

However, it didn’t take long for the web development field to split into researchers and practitioners. While researchers still work on new technologies and frameworks—typically while employed by larger organizations—practitioners mostly use these inventions to build products.

A similar trend is happening in machine learning. Machine learning engineers are emerging as practitioners who build ML-powered products using state-of-the-art models and frameworks produced by large companies and research labs.

For example, Nick Walton built AI Dungeon, an ML-driven choose your own adventure game, at a hackathon using a finetuned version of OpenAI’s GPT-2:

Similarly to how most web developers don’t design their own database or framework, Walton did not invent his own model architecture. Instead, he used the outputs of machine learning researchers to build a new product.

Practitioners like Walton, who are focused on building software, need to work in a language that suits itself to building software—not dashboards.

Machine learning is moving out of the lab and into products—and that means Python

Business intelligence and data analysis will always exist, and within those communities, R will remain a popular choice. ML engineering, however, has moved on.

More and more, we are seeing teams like Foursquare, for whom data science and machine learning are matters of product development and engineering. The people responsible for them aren’t data analysts, they’re engineers (in terms of responsibilities, not titles), and they use tools and languages familiar to software engineers—like Python.

R will always be a valid tool for generating dashboards and reports. Building a predictive ETA feature for your ridesharing app, a content recommendation engine for your streaming service, or a face recognizer for your photo app, however, is a job for machine learning engineers and Python.

We built Cortex for machine learning engineers because we, originally, were software engineers who wanted to use machine learning. Our concerns had less to do with designing new models, and more to do with engineering problems, like:

What is the best language for integrating with popular ML frameworks? Every framework has native Python bindings.

What language is best suited toward writing request processing code? A general purpose language like Python.

What is the simplest microservice framework we could use for wrapping models in APIs? Flask, which of course is Python.

In other words, we built a platform for machine learning engineers, not data analysts, and that meant supporting Python over R.