We’ve all heard about big data; over the past few years, many companies have invested in Hadoop, NoSQL, and data warehouses, to collect and store massive volumes of new data. Even when based on open source platforms like Hadoop, these investments can easily measure in the millions of dollars for large companies with new hardware, new staff, and untold person-hours spent implementing new systems and procedures.

Now it’s time for that investment to pay off.

The way to do that is with data science, the extraction of knowledge from data. It’s more than just tabulating and reporting on the data; data science combines computer science, statistical analysis, and a keen understanding of business needs to separate correlation from causation, and to forecast future outcomes and risk. According to TheNextWeb, Data Scientists are “changing the face of business intelligence." And, the increased availability of data has made data science crucial to product development, and creating and managing innovations that are too complex for automated systems, especially in a world where privacy concerns are paramount.

As a result, companies are hiring data scientists at a massive rate. Job postings for data scientists have skyrocketed since early 2011 according to data from job-tracker Indeed.com. Although in recent months much of the growth has been in data science skills generally, as data scientists take on specialized job titles. Meanwhile, data scientists still command impressive salaries: a median of $98,000 worldwide and $144,000 in the US, according to the latest Data Science Salary Survey by O’Reilly Media.

With such strong demand and such high salaries to offer, it’s no surprise that competition for hiring data scientists is intense. As a result, companies who previously relied on legacy proprietary platforms for statistical analysis are now adopting a new alternative, open source R. So far, it has been chosen by more than two million data scientists and statisticians around the world.

R is an open source software platform for statistical data analysis. The R project began in 1993 as a project by two statisticians in New Zealand, Ross Ihaka and Robert Gentleman, to create a new platform for research in statistical computing. Since then the project leadership has grown to include more than 20 leading statisticians and computer scientists from around the world.

Largely because of its open source nature, R was rapidly adopted by statistics departments in universities around the world, attracted by its extensible nature as a platform for academic research. Being free in cost certainly played a role as well. And it wasn’t long before researchers in statistics, data science, and machine learning started to publish papers in academic journals along with R code implementing their new methods. R makes this process very easy: anyone can publish an R package to CRAN (the “Comprehensive R Archive Network”) and make it available to everyone. As of this writing, thousands of R users have contributed more than 6,100 packages to CRAN, extending R’s capabilities in fields as diverse as econometrics, clinical trials analysis, social sciences, and web-based data. And one can easily search for R applications by topic or keyword at MRAN.

While the core R project is maintained by the R Foundation (a non-profit based in Vienna, Austria), other companies and organizations are extending R as well. The BioConductor Project has created an additional 900+ packages making R the leading software for genomic and genetic data analysis. RStudio has created an excellent open-source interactive development environment for the R language, further boosting the productivity of R users everywhere. And Revolution Analytics has boosted the performance of R with Revolution R Open and made it easy to embed R into other applications with DeployR.

With R’s widespread use in the academic sector, it wasn’t long before it started being used in the commercial sector as well. A front-page article in The New York Times technology section in January 2009 spurred a lot of new interest, and Revolution Analytics has been very active, offering technical support, services, and big-data capabilities. Today, R is ranked as the 9th most popular language by IEEE Spectrum, and it is consistently ranked the most popular language for data science and thousands of companies are using R for data science applications.

Here are just a few examples:

Google uses R to calculate the ROI on advertising campaigns.

Ford uses R to improve the design of its vehicles.

Twitter uses R to monitor user experience.

The US National Weather Service uses R to predict severe flooding.

The Rockefeller Institute of Government uses R to develop models for simulating the finances of public pension funds.

The Human Rights Data Analysis Group uses R to quantify the impact of war.

R is used frequently by The New York Times to create infographics and interactive data journalism applications.

These companies have adopted R because it’s the platform their data scientists prefer to use. And, crucially, given that data scientists are a limited resource, it’s also the platform that makes data scientists the most productive. Unlike proprietary systems which provide only constrained point-and-click tools or black-box procedures, R is a fully-fledged programming language. All of the functions needed for a typical data science application are included in the base language: functions for data access and preparation, data visualization, statistical modeling, and forecasting. Complete data analyses can often be represented in just a few lines of code. And because data scientists using R produce code, not just reports, it’s easier for them to collaborate, to replicate results (particularly in automated production environments), and to reuse code from other projects to get tasks done faster.

R’s open source nature also gives companies a boost when it comes to innovation. This is incredibly important in today’s data-centric world, where even a tiny edge in being able to predict customer needs or financial returns better than your competitors can mean the difference between success and failure. Because most cutting-edge research in statistics and machine learning is done in R, the latest techniques are usually available first as a package for R, years and sometimes decades before they appear in proprietary systems.

So, with data science as a top business priority according to Gartner, the popularity of R is set to grow even further. And if you’re looking to expand your career potential, and you have data analysis skills, you could do a lot worse than getting to know the R language.

Open Source

Careers

A collection of articles about jobs and careers in open source.