Contributed by John King and Roger Magoulas

If there’s one thing folks in the data community are good at, it’s using their analytic skills to find high paying jobs. Our data science survey showed a mean salary of $98,000 (US).

The last three years, O’Reilly Media has been running an anonymous Data Science Salary Survey, in concert with the Strata + Hadoopworld Conference to look at what factors most affect the salaries of data analysts and data engineers. The survey is open to the public and more than 800 people, from 53 countries and 41 US states, who work in and around the data space have responded. For more detail, you can download the full 2014 Data Science Salary Survey report for free. And, you can take the survey yourself.

O’Reilly promotes the survey through a number of its public channels and events. As a result the survey population more reflects the O’Reilly audience than the general population of data workers. Survey respondents likely use more open source tools and more new technologies than the general population and have more technical lead and managerial positions.

Those responding work in a variety of industries and come from a variety of backgrounds – not all strictly technical. While almost all the respondents had some technical duties and backgrounds, less than half work in individual contributor roles. Respondents top roles/duties include Analysts (who code), Statisticians, Software Developers, Technical Leads, and Managers.

The 40 questions in the survey covered demographic, tool usage, and compensation topics. Looking at what tools best correlate with the highest salaries helps identify market conditions where demand for skills is greater than the supply of workers (geography plays a role in supply and demand, we see the highest salaries in tech-intensive California, Texas, the Northwest and Northeast (MA to VA)). Tool choice in general, and the clusters of tool usage detailed below, show what the data workers in our survey use to get the job done.

One key finding, the more you learn, the more you earn – respondents who used the widest variety of tools tended to earn more. One cause is likely the many tools that characterize the coalescing Hadoop ecosystem – where new tools fill new niche needs, like real-time/streaming data, log processing and in-memory data management. That doesn’t diminish the value of using traditional tools like RDBMS, higher salaries went to folks who know both Hadoop and RDBMS than to those who use either in isolation. Plus SQL remains the most commonly used tool amongst all respondents.

One key finding, the more you learn, the more you earn

On the analytics side, the most used tools include SQL, followed by Excel and R (tied), then Python and Tableau. For data management, SQL, MySQL MS SQL Server, Oracle, Hadoop and Hive are most commonly used.

We wanted to look further to see how tool usage correlated for our survey cohort. Using a clustering algorithm, we discovered five clusters. Few respondents used only the tools in a single cluster, and most respondents used tools in four of the five clusters. The clusters do show that using one tool in a cluster increases the probability of using another tool in the cluster, and many respondents tended towards one or two clusters, using few tools from the others.

Tool use clusters

Cluster 1 –

Proprietary Analytics

SQL, Excel, MS SQL Server, Oracle, Oracle BI, SAS, SPSS, Microstrategy, Business Objects, C#, PowerPIvot

Cluster 2 –

Hadoop Ecosystem

Hadoop (all distributions), Spark, Java, Scala, Pig, HBase, Hive, Cassandra, MongoDB, Storm, AWS/Elastic Map Reduce, Redis, Pentaho, Splunk, Mahout

Cluster 3 –

Data Science

Python, R, Matlab, Natural Language Processing, Network/Social Graph, Continuum (Numpy, SciPy), Weka, libsvm

Cluster 4 –

Presentation Layer

Mac, Javascript, D3, MySQL, Postgres, Google Chart Tools, SQLite, Ruby

Cluster 5 –

Data Munging

Unix, C/C++, Perl

We had one outlier tool, Tableau, with relatively high usage but no good fit with a single cluster. Tableau did correlate with Clusters 1 and 2.

That Cluster 2 (Hadoop Ecosystem) contains a large set of tools reflects two characteristics of the cluster: respondents in the cluster use the highest number of tools (18-19, almost double what we see for some of the tools in Cluster 1); and, that many of the tools in the cluster are complementary and single-purpose, reflecting the evolving and coalescing nature of the Hadoop ecosystem.

When looking at the Clusters and how they fit into the regression model we built from the data, we see respondents in Cluster 1 (Proprietary Analytics), with slightly lower salaries, while respondents in Cluster 2 (Hadoop Ecosystem) and Cluster 3 (Data Science) with slightly higher salaries relative to the mean.

Looking at results from the salary survey can provide some guidance on tools in the greatest demand, information you can use to help set your learning path and career choices. However, in the dynamic data space, you can expect change. We know from looking at the Strata conference proposal submissions, that more data professionals want to share what they know about topics like real-time/streaming and ElasticSearch. The next Data Science Salary Survey, due in the fall, should provide additional guidance.