How to Level Up as a Data Scientist (Part 1)

Part 1: Table Stakes

A little while ago I wrote an article talking about methodologies as vanity metrics. The point I was trying to get across was that methods are a means not an end- it may be enjoyable to learn them but as data scientists we earn our keep by solving valuable problems. So when you think about skill building, you should be focused on those you need to drive value. In thinking about the post, I realized that it’s easy to make such a statement, but hard to execute on, especially if one is early career and hasn’t yet built up many reps. So I decided to write a (set of) companion post(s) with more actionable advice. My goal here is to leave you with two things:

first, an understanding of the core skills beyond the (obvious) technical tools that make an excellent data scientist, and

second, some specific steps you can take to develop or improve on those skills.

This is long, so I’ve decided to break it into component parts. In this post I’m going to cover the table stakes- the skills you need to get yourself a seat at the table, and that you should be working on first if you don’t already have them. In subsequent posts I’ll cover 3 areas where strong practitioners excel and specific things one can do to build that skill set.

Table Stakes

Table stakes are the foundation on top of which everything else sits. They represent the core skills that a data scientist wields to generate value for their company.

SQL

For better or for worse (mostly for better) SQL is the lingua franca for dealing with data. Most technology products are backed by a SQL database and even those that are not have a sql flavor build into them (eg. Hive for Hadoop). At any rate, since the starting point of most problems you’ll deal with is “get data” and the data is usually in a SQL store, you need to be an expert in this. Fortunately there are plenty of free resources for this. My favorite is Mode’s SQL School.

Random Variables and Conditional Probability

This is the primary mathematical tool you’ll use, because it is the most appropriate for answering questions like “Is this thing better than that thing?” or “Did the thing we just did work?”. Most people jump at statistics to make headway on those problems. In my experience statistical methods have so many assumptions baked in they tend to be pretty useless in real world problems (some simple cases that can be mapped to binomials are the exception) and I haven’t really trusted the output of a statistical test for a decision in about 10 years. But the stuff that underlies them- random variables and the probability rules that generate them- are really useful, since you can use them to make your assumptions explicit and run scenarios to figure out how likely or unlikely your results in the data actually are. So I recommend a strong foundation there, which you can get from any good textbook or these Kahn Academy courses on probability and random variables.

Basics of Python or R-based Machine Learning and Stochastic Modeling

ML and stochastic modeling are the first separation between a data scientist and an analyst, since they are tools that enable scale. By “basics” I mean, know what methods are available (eg. trees vs. regressions vs. deep models), what they are meant to do (eg. classify vs. project), how they work at a high level (eg. loss minimization, regularization, etc.), which libraries have them implemented, how to have those libraries produce a result, and how you would validate it (eg. AUROC, precision vs. recall, etc.).

I recommend “Data Science from Scratch”, which runs through python implementations of the most common algorithms from first principles for this. Whatever you choose, make sure you use Python or R and not something you have to pay for (eg. SAS) or something that isn’t code (eg. a GUI based tool like RapidMiner). This is because:

The free tools are winning. And are free. As in you don’t have to pay any money. And you can use that money for other things, like coffee or beer, if you’re so inclined.

The evidence points to code as the optimal abstraction for software. And your goal is to be able to implement your solutions in software because that gives you the highest point of leverage (and at the moment the best career prospects as a data person).

This step also gets you exposure to the data stack (pandas, numpy, matplotlib in python, dplyr and the like in R), which will help as you move up the value chain. (As a side note, I’m starting to personally advocate for Python rather than R in this step. This is because Python offers a cleaner path to production environments and is a full featured programming language that is suitable for software development as well. Due to this Python is rapidly gaining market share. You can be productive and valuable in R, but to truly scale Python is becoming a better choice. We’ll come back to this later.)