Some Important Data Science Tools that aren’t Python, R, SQL or Math

A few necessary, important, or just cool data science tools you might want to be familiar with

If you ask any Data Scientist what you need to know to succeed in the field, they’ll likely tell you some combination of the above. Every single DS job description mentions Python or R (sometimes even Java lol), SQL and math with some Spark, AWS/cloud experience mixed in and topped off with a healthy portion of buzzwords.

Random pic from travel. West Lake in Hangzhou, China.

While these are undeniably must-haves, the reality is that most data scientists don’t exist in a vacuum — it’s unlikely you’ll be in a position where you get handed perfect data, build a model on your local machine in your own environment, and then just save those weights and call it a day. The modern data scientist needs the CS skills necessary to make their work possible and accessible in leu of other engineers. You aren’t going to email your boss your jupyter notebook to showcase your work.

These may not hold true for all positions, but in my experience and opinion these tools can be as important (or just as convenient) as your bread & butter Python (although yeah Python is definitely the most important). I’m not going to explain what any of these things are in depth, but I will explain why you would want to know them to become a good data scientist — one capable of building production ready applications as opposed to messy, exploratory notebooks on your local machine.

Should go without saying. It blows my mind how many data scientists can be unfamiliar with the command line. Using bash scripts is one of the single most basic tools in computer science, and since such a large percentage of data science is programmatic, this skill is of the utmost importance.

It’s almost a given that your code will be developed and deployed on linux, so I would encourage using the command line whenever possible. Like data science, Python also doesn’t exist in a vacuum and you’ll almost certainly have to deal with package/framework management, environment variables, your $PATH and a number of other things via some command-line interface.

Also should go without saying. Most data scientists know git but don’t really know git. Because data science is so vaguely defined, there’s a lot of us who don’t follow good software development practices. I didn’t even know what unit testing was for far too long.

When coding in a team, knowing git becomes huge. You’ll need to know what to do team members make conflicting commits, or when you need to cherry pick portions of code for bug fixes, updates, etc… Committing code to an open source or private repo like Github also allows you to use things like Coveralls for code testing, and there are other frameworks to conveniently deploy code to production upon commit. Occasionally committing to a repo that only you use is the tip of the iceberg in terms of functionality offered by git and open source repos.

REST APIs

So you’ve trained a model — now what? No one wants to see your jupyter notebook or some kind of lame, interactive shell program. Additionally, unless you trained in a shared environment, your model is still only available to you. Just having the model isn’t enough, and this is where a ton of data scientists hit a wall.

To actually service predictions from the model, it’s good practice to make it available via a standard API call or something conducive for application development. Services like Amazon SageMaker have gained huge popularity for the seamless ability to make models available in a production ready way. You can build one yourself using things like Flask in Python.

Finally, there are many Python packages that make API calls on the backend, so an understanding of what an API is and how it is used in development makes for a more competent data scientist.

Two of my personal favorites — docker allows users to have a production ready application environment without having to intensively configure a production server for every single service that needs to be run on it. As opposed to a heavy virtual machine that installs a full OS, docker containers run on the same kernel as the host and are much more lightweight.

Think of a docker container like Python’s venv with way more functionality. More advanced machine learning libraries like Google’s Tensorflow require specific configurations that can be difficult to troubleshoot on certain hosts. As such, docker is frequently used with Tensorflow to ensure a development-ready environment in which to train models.

As the market trends towards more micro-services and containerized applications, docker is a huge skill to have and is only growing in popularity. Docker isn’t only good for training models, but also for deployment. Thinking of your models as services, you can containerize them so they have the environment they need to run, and can then interact seamlessly with other services for your application. This makes your models both scalable and portable.

Kubernetes (K8s) is a platform for managing and deploying containerized services at scale across multiple hosts. Essentially, this means you can easily manage and deploy your docker containers across a horizontally scalable cluster.

I wish Kubernetes also had a fun whale

As Google was using Kubernetes to manage their Tensorflow containers (among other things), they took it one step further and developed Kubeflow: an open-source workflow for training and deploying models on Kubernetes. Containerized development and production are becoming increasingly integrated with machine learning and data science, and I believe these skills will be huge for the data scientists of 2019.

Now we’re getting slightly more niche — but this one is cool. Airflow is a Python platform to programmatically author, schedule and monitor workflows using directed acyclic graphs (DAGs).

DAG

This basically just means you can easily set your Python or bash scripts to run when you want, as often as you want. As opposed to the less convenient and customizable cron job, Airflow gives you control over your scheduled jobs in a user friendly GUI. Super dope.

Just as niche. This one is a bit particular, and depends on whether you have search/NLP use cases or not. However, I can tell you working at a Fortune 50 company that we have a ton of search use cases and this is one of the most important frameworks in our stack. As opposed to building something from scratch in Python, Elastic provides everything you need with a convenient Python client.

Elasticsearch let’s you easily index and search documents in a fault-tolerant and scalable manner. The more data you have, the more nodes you spin up and the faster your query gets executed. Elastic uses the Okapi BM25 algorithm which is functionally pretty similar to using TF-IDF (term frequency — inverse document frequency, Elastic used to use this). It has a ton of other bells and whistles, and even supports custom plugins for things like multi-lingual analyzers.

Since it essentially performs a similarity comparison between the query and the documents in your index, it can also be used to compare the similarity between documents. Instead of even importing TF-IDF from scikit-learn, I highly recommend seeing if Elasticsearch can give you everything you need right out of the box.

Ubuntu has apt-get , Redhat has yum and Windows 10 even has OneGet as I was informed by a comment below. These package managers help install things via CLI, manage dependencies, and automatically update your $PATH . While mac OS comes with nothing out of the box, Homebrew can easily be installed with a terminal command and it is essentially in this article because it’s one of my favorite things.

Say you want to install Apache Spark locally for some counterintuitive reason. You could go to https://spark.apache.org/downloads.html, download it, unzip it, add the spark-shell command to your $PATH yourself OR just type brew install apache-spark in a terminal and be on your way (note you’d need to install scala and java to use spark as well). I know this is only a mac OS specific thing, but I love it so much I couldn’t not include it.

And Many More….

I probably download a couple new packages or frameworks everyday (much to my computer’s chagrin). The field is changing so rapidly it’s tough to keep on top of what’s new, and even harder to tell what’s actual useful. In my opinion, however, everything above is here to stay with more to come. Hope you enjoyed it, and feel free to reach out to me on LinkedIn. Later.