Epistemic status / motivation: My file / folder naming and management is all over the place and I hate it. I want to standardized my workflow. Since I am a student, this means that, as usual, I am under-qualified and over-opinionated

TLDR: My machine learning workflow, from simple, to sophisticated, including tools and resources I use in each step. Look at the pictures.

Simple, isn’t it?

Machine Learning is very simple. Just grab some data pick an algorithm and run it. I, personally, use python. The very basic libraries you need are:

And one of following for visualization:

https://matplotlib.org/ (Most basic, but it is hard to make pretty.)

https://seaborn.pydata.org/ (Got some pretty cool template)

https://altair-viz.github.io/ (The best in every way, except performance and the 5000 rows limit)

or combination thereof

I really hate environmental setup, even though there is Anaconda. I will prefer, and recommend, Google Colab. It is like Google Drive, but for Python. The best part, you can run your codes on Google’s server, using Tesla K80 GPU, for free, for up to 12 hours. Although in my experience, usually I get disconnected much earlier than that.

There are many open datasets for practice:

Once you picked a dataset, what you need next is just to pick an algorithm. You don’t have to write your own, just use ready made module like scikit-learn. If you are really into Deep Learning, then you can pick up Keras. And if Keras is still too restrictive for you, you can jump straight to TensorFlow or PyTorch, which are what researchers use.

But if you really want to be a real programmer, you better code your own module yourself. Jokes aside, I did code my first Neural Network with 11 lines of python code by following this tutorial. And I learned a great deal from it. And I think, if you are a student studying this, not just someone who wants to deploy quickly and dirtily, those 11 lines are really worth typing, just as a learning experience.

Each algorithm (I’m going to call it model from now on) will come with their own set of numbers that you have to tune. For example, how many neurons in a Neural Network? This numbers are called hyperparameters. For now, use the default.

Evaluate your prediction

Okay, so you are able to make predictions (or clusters). But how good are they? Sci-kit learn got you covered, they provide more metrics than I ever need. The standard way to start is to use Mean Squared Error for regression and F1 score for classification.

As I hope this post made obvious, I am a very visual person. When I’m doing regression, I would plot my predictions vs label (actual) values, as well as residual plot get a better sense of what my model is doing (or not doing). For classification, the equivalent plot is a confusion matrix.

How is my prediction doing? https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

Learn more about residual plot to understand your model better.

For this kind of plot, I prefer to use Seaborn’s joinplot as they nicely include a histogram on both axis, and shows the density of overlapping datapoints. I use kind=”kde”. Notice how they use UCI Iris dataset.

I usually use Seaborn jointplot for Prediction vs Actual plot.

Try different models

Okay, so you evaluate your predictions and you have gained a lot of insight from it. What can you do? Well, try a bunch of different models, and see which one is best? Models to try:

Linear regression

Polynomial regression (add some regularization)

Trees & Forest

Neural Network and Deep Learning

Support Vector

Data are usually not clean. Remember, garbage in, garbage out.

Anyone who have worked with raw data knows how ugly they get. Sometimes your data are spread into many different tables that have to be unified into one. And then you will have missing values (null and NaN), or just a bunch of zeros or some kind of Null Island, or numbers that are stupidly large, it must be a mistake. There are so many ways things can go wrong, and most of the time, they do. Data preprocessing is one of biggest time consumer in any data science / machine learning projects. You can google these terms:

Data preprocessing

Data munging

Data wrangling

Data cleansing

Data preparation

Visualize your data

Again, I love to get some kind of “feel” about my data, so I never fail to have a Scatterplot Matrix. That plot was done using Seaborn. Matplotlib don’t have a built in library for scatterplot matrix, and Altair will require few extra lines. I find Seaborn is best for this. Again, UCI Iris dataset, it is everywhere.