Now it’s time for the good stuff. A glossary of machine learning tools as they stand in early 2019. I’ve organized this list from lowest level and most customizable, to highest level and most abstracted.

On your laptop

Scientific Python

If you’re new to Python you’re going to need to get familiar with the Scientific Python stack. You’ll rely on these packages for data exploration, preprocessing, debugging, prototyping, and visualization.

Numpy

Numpy is the foundation of scientific Python. It defines N-dimensional array objects and an entire world of methods to manipulate them. Everything from multiplying matrices to fourier transforms is included. The linear algebra is implemented in C / C++ and Fortran providing great performance. You’ll want to avoid writing loops by vectorizing your code, which can be a little tricky at first. Link

Scipy

Whereas Numpy provides basic data structures and linear algebra operations, Scipy contains a broader set of scientific computing algorithms covering probability, statistics, optimization, integration, interpolation, and more. Again, C / C++ and Fortran implementations are under the hood. Link

Pandas

Pandas is by far the most popular data exploration and manipulation package in Python. At its core is the concept of a DataFrame which acts as an in-memory relational database. You can do basic operations on columns and rows as well as aggregations on arbitrary groups. Basic time series routines are also included. Pandas is a bit of a memory hog, so be careful. Link

Matplotlib

Matplotlib is the defacto plotting library for scientific Python. It contains a large number of predefined plotting types, but has low level APIs that make it possible to build GUIs if you wanted. Link

NetworkX

NetworkX is an easy to use, well supported network package. It has support for almost any type of graph and has methods for computing almost every graph metric under the sun. The one downside to NetworkX is that it’s implemented in pure Python and can be slow and a big memory hog. Link

Graph-tool is a faster, C++ based alternative, but isn’t as well supported and is a pain to install on some systems. Link

Jupyter Notebooks

Jupyter notebooks are interactive documents you can use to write, execute, and share code. You can think of them as an interactive Python shell in your browser. All of the code and output is saved locally as json. Notebooks have the default way for data scientists and machine learning to share prototypes and examples. With GitHub now rendering notebooks natively, you’ll be sure to run into them. Link

General Machine Learning

Before neural networks came back into vogue, it was all about boring old machine learning. Support vector machines, random forests, even linear regression are still useful techniques you should try first before jumping straight into deep learning. Many of these tools will use Numpy and Scipy routines under the hood and all of them come with a bunch of algorithms right out of the box.

Sklearn

Sklearn is the most popular general machine learning package for Python. It contains high level APIs for training a huge variety of models such as linear regressions, GLMs, Random Forests, and SVMs just to name a few. There are also plenty of utilities to compute confusion matrices, ROC and AUC, and other statistics related to your model. Sklearn is supported by default on all of the major cloud-based machine learning platforms as well as Apple’s Core ML mobile friendly format. Link

XGBoost

XGBoost is an open source implementation of distributed Gradient Boosting on nearly every data science and big data platform. Gradient Boosting is an extremely powerful technique that combines multiple “weak” models into a large ensemble. Though each individual model may not be very accurate, combining the predictions of all models in the ensemble often produces great results that generalize well to broader sets of data you find in production. XGBoost has been used win more Kaggle competitions than any other modeling technique, but can be a bit of a black box. Link

Statsmodels

Statsmodels is a high level API used to construct and fit statistical models like multivariate regression, GLMs, and more. Statsmodels will feel familiar to those who have defined multivariate linear models in R or used modeling software like SPSS or STATA. The library also has a bunch of useful utility functions for performing hypothesis testing, regression diagnostics, and specification tests. Statsmodels doesn’t have a converter to mobile formats like Core ML so if that’s your end destination, it’s best to use Sklearn. Link

Pomegranate

Pomegranate is a new kid on the block providing a high level API for building probabilistic models from data. This includes things like mixture models, bayes classifiers, and Markov models. Link

Spark / Hadoop

Two years ago you couldn’t talk about machine learning without hearing about tools like Hadoop and Spark. Now, they barely get any mention at all as deep learning and neural networks dominate the hype cycle. Plenty of problems can be solved with traditional machine learning algorithms in Spark’s MLLib and you should absolutely consider one of these solutions before jumping straight to TensorFlow. Many companies are already using these tools to populate data warehouses and dashboards making it easier to get results than starting from scratch with another service to manage. Link

SystemML

SystemML is a former IBM, now Apache project aimed at making it easier to create optimized machine learning and analytics pipelines. It comes with a bunch of ML algorithms out the box that can be used for classification or regression problems and has high level APIs to primitives that let users write their own. SystemML uses Spark as a backend by default, but can be used with others like Hadoop. SystemML is most closely related to MLLib that is included with Spark, but with some additional abstractions and optimizations. Link

Deep Learning

If you’ve just got to try the latest and greatest, deep learning is it. Convolutional neural networks have proven incredibly capable when it comes to tasks like image recognition, object detection, and language translation. There is no shortage of tools to get started with.

TensorFlow

TensorFlow is one of the fastest growing and most popular open source software projects of all time. You’ve probably heard TensorFlow in association with neural networks and deep learning, but it is a general framework for executing numeric operations using data flow graphs.

In other words, you can perform almost any mathematical operations on matrix-like data. More practically, beginners should be aware that TensorFlow has a deferred execution model that can be tricky. You execute code in two parts: build your computation graph then feed data through it. This can take some getting used to and it can make it hard to track down errors.

TensorFlow was originally implemented in C++ and Python but there are now official bindings for Go and Java and unofficial support for JavaScript though other open source projects like TensorFire. There are also two (for some reason) mobile flavors, TensorFlow Mobile and TensorFlow Lite available for deploying models in apps or IoT devices. Link

Keras

Keras is a high level API for building neural networks. Keras makes it simple to chain together various predefined layers or write your own. Actually computation is done using one of two supported backends: Keras or the now defunct Theano. If you’re just getting started with neural networks, don’t need to do anything crazy, and want to stay within the TensorFlow family, Keras is your best bet. Link

Caffe (1 and 2)

Caffe is a deep learning framework originally developed at Berkeley. It was an early favorite with some heavyweights like Facebook, but lately its lost some luster as other entrants like TensorFlow have gained popularity. Caffe APIs are available in C++ and Python. Facebook has continued development with Caffe 2, adding better support for distributed training and inference, newer hardware, and mobile deployment. When you’re looking through Caffe projects, be aware of the version number (1 or 2). While converters work fairly well, it’s not a guarantee that you can translate freely between the two. Link

PyTorch

PyTorch is a relative newcomer to the deep learning scene, celebrating it’s 2 year birthday in January, 2019. In just a year, though, it’s gained considerable traction and is the language of choice for popular deep learning courses like fast.ai. The biggest difference between PyTorch and other frameworks is that PyTorch is imperative. When you execute a line of code, it’s actually executed. Models don’t get compiled and evaluated later. Many programmers find this a lot more intuitive. As you may have guessed from it’s name, PyTorch is Python first and only (for now). Link

Theano

Theano is an older framework similar to TensorFlow in that it builds and executes arbitrary computation graphs for Tensors. It was one of the first packages that made transparent use of GPUs rather than having to write GPU calculations yourself. Shortly after releasing version 1.0 in late 2017, maintainers announced that active development would cease as it had become clear that TensorFlow had won. Theano is Python only. Link

Neon

Neon is a deep learning framework developed by Intel. It’s unclear if it offers anything at the API level that would warrant switching from something you are currently running on, but they offer a lot of optimization for Intel chips (big surprise) so if that’s a big concern for you, it might be worth checking out. Link

MXNet

MXNet is (yet another) deep learning framework, this time with support from Apache and Amazon. Though you won’t find quite as many examples written with MXNet as other frameworks, you get the best of both worlds with support for both a deferred computation model and an imperative option (called Gluon). MXNet also boasts an impressively large set of languages including Python, C++, Javascript, Go, Scala, Julia, R, and Matlab. Because it’s used by Amazon, it’s plays nice with everything AWS. Link

Turi Create

Turi Create is the highest level deep, unmanaged learning service available. Open sourced after Apple’s acquisition of its namesake, Turi Create makes it extremely easy to train custom models that do very specific tasks like image recognition. Turi Create is ten times easier to use than Keras which is ten times easier to use than raw TensorFlow. If you just want to get a model that does something, anything, for a project, start here. Turi Create only supports Python right now, but uses MXNet under the hood so you can leverage GPUs and port your model to other platforms later. Check out this guide on making a model to identify hotdogs and nothotdogs to see just how easy it is to use. Link

Create ML

Released by Apple in the summer of 2018, Create ML is a high level Swift-based toolkit for training Core ML models directly in Xcode. A combination of notebook style programming via Swift playgrounds and a drag and drop GUI makes it very easy for beginners to get started. Models are mobile-first, designed to be used directly in apps. Link