First off, let’s talk about what I think you, an aspiring data scientist, needs to know, and how to go about learning it.

Topic 1: Statistical Learning

Statistical learning methods are going to top the list. From the standpoint of “topics to learn”, there’s a laundry list one can write — all of the ML methods in scikit-learn , neural networks, statistical inference methods and more. It's also very tempting to go through that laundry list of terms, learn how they work underneath, and call it a day there. I think that's all good, but only if that material is learned while in the service of picking up the meta-skill of statistical thinking. This includes:

Thinking about data as being sampled from a generative model parameterized by probability distributions (my Bayesian fox tail is revealed!), Identifying biases in the data and figuring out how to use sampling methods to help correct those biases (e.g. bootstrap resampling, downsampling), and Figuring out when your data are garbage enough that you shouldn’t proceed with inference and instead think about experimental design.

That meta-skill of statistical thinking can only come with practice. Some only need a few months, some need a few years. (I needed about a year’s worth of self-directed study during graduate school to pick it up.) Having a project that involves this is going to be key! A good introduction to statistical thinking for data science can be found in a SciPy 2015 talk by Chris Fonnesbeck, and working through the two-part computational statistics tutorial by him and Allen Downey (Part 1, Part 2) helped me a ton.

Recommendation & Personal Story: Nothing beats practice. This means finding ways to apply statistical learning methods to projects that you already work on, or else coming up with new projects to try. I did this in graduate school: my main thesis project was not a machine learning-based project. However, I found a great PLoS Computational Biology paper implementing Random Forests to identify viral hosts from protein sequence, and it was close enough in research topic that I spent two afternoons re-implementing it using scikit-learn , and presenting it during our lab's Journal Club session. I then realized the same logic could be applied to predicting drug resistance from protein sequence, and re-implemented a few other HIV drug resistance papers before finally learning and applying a fancier deep learning-based method that had been developed at Harvard to the same problem.

Topic 2: Software Engineering

Software engineering (SWE), to the best of my observation, is about three main things: (a) learning how to abstract and organize ideas in a way that is logical and humanly accessible, (b) writing good code that is well-tested and documented, and © being familiar with the ever-evolving ecosystem of packages. SWE is important for a data scientist, because models that are making predictions often are put into production systems and used beyond just the DS themselves.

Now, I don’t think a data scientist has to be a seasoned software engineer, as most companies have SWE teams that a data scientist can interface with. However, having some experience building a software product can be very helpful for lubricating the interaction between DS and SWE teams. Having a logical structure to your code, writing basic tests for it, and providing sufficiently detailed documentation, are all things that SWE types will very much appreciate, and it’ll make life much easier for them when coming to code deployment and helping with maintenance. (Aside: I strongly believe a DS should take primary responsibility for maintenance, and not the SWE team, and only rely on the SWE team as a fallback, say, when people are sick or on vacation.)

Recommendation & Personal Story: Again, nothing beats practice here. Working on your own projects, whether work-related or not, will help you get a feel for these things. I learned my software engineering concepts from participating in open source contributions. The first was a contribution to matplotlib documentation, where I first got to use Git (a version control system) and Travis CI (a continuous integration system). It was there that I also got my first taste of software testing. The next year, I quickly followed it up with a small contribution to bokeh , and then decided at SciPy 2016 to build nxviz for my Network Analysis Made Simple tutorials. nxviz became my first independent software engineering project, and also my "capstone" project for that year of learning. All-in-all, getting practice was instrumental for my learning process.

Topic 3: Industry-Specific Business Cases

This is something I learned from my time at Insight, and is non-negotiable. Data Science does not exist in a vacuum; it is primarily in the service of solving business problems. At Insight, Fellows get exposure to business case problems from a variety of industries, thanks to the Program Directors’ efforts in collecting feedback from Insight alumni who are already Data Scientists in the industry.

I think business cases show up in interviews as a test of a candidate’s imaginative capacity and/or experience: can the candidate demonstrate (a) the creativity needed in solving tough business problems, and (b) the passion for solving those problems? Neither of these are easy to fake when confronted with a well-designed business case. In my case, it was tough for me to get excited about data science in an advertisement technology firm, and was promptly rejected right after an on-site business case.

It’s important to note that these business cases are very industry specific. Retail firms will have a distinct need from marketing firms, and both will be very distinct from healthcare and pharmaceutical companies.

Recommendation & Personal Story: For aspiring data scientists, I recommend prioritizing the general industry area that you’re most interested in targeting. After that, start going to meet-ups and talking with people about the kinds of problems they’re solving — for example, I started going to a Quantitative Systems Pharmacology meet-up to learn more about quantitative problems in the pharma research industry; I also presented a talk & poster at a conference organized by Applied BioMath, where I knew lots of pharma scientists would be present. I also started reading through scientific journals (while I still had access to them through the MIT Libraries), and did a lot of background reading on the kinds of problems being solved in drug discovery.

Topic 4: CS Fundamentals

CS fundamentals really means things like algorithms and data structures. I didn’t do much to prepare for this. The industry I was targeting didn’t have a strong CS legacy/tradition, unlike most other technology firms doing data science (think the Facebooks, Googles, and Amazons), which do. Thus, I think CS fundamentals are mostly important for cracking interviews, and while problems involving CS fundamentals certainly can show up at work, unless something changes, they probably won’t occupy a central focus of data science roles for a long time.

Recommendation & Personal Story: As I don’t really like “studying to the test”, I didn’t bother with this — but that also meant I was rejected from tech firms that I did apply to (e.g. I didn’t pass Google Brain’s phone interview). Thus, if you’re really interested in those firms, you’ll probably have to spend a lot of time getting into the core data structures in computer science (not just Python). Insight provided a great environment for us Fellows to learn these topics; that said, it’s easy to over-compensate and neglect the other topics. Prioritize accordingly — based on your field/industry of experience.