The Learning Path

First, you need an overview of what ML is, and some popular algorithms to put in your toolkit. You already finished this part if you made it this far in my series! You can now follow these remaining steps (explained in greater detail further on) to push this further and become an expert.

Now you can start building models without any programming using Weka. Quickly experiment with various models you have seen, as well as explore and preprocess data in a graphical interface. At this point, you want to start learning a bit of programming in Python to move beyond the graphical interface. It’s easy to get started and learn the basics, and you won’t need the more advanced bits for ML. Now you can build models in Scikit-learn in as little as 5 lines of code! With its consistent interface, you can try on many different models, tune hyperparameters, evaluate results, and do a lot more in very little time. At this point, you should know some things to help you get better results in the real world. This involves feature engineering and understanding the bias-variance trade-off. You should also learn more about neural networks, and the choice of layers, activations, optimizers, etc. because they are a pretty diverse kind of model. You now want to learn more about Pandas, Numpy and Matplotlib as that forms the basic AI stack for processing data, performing descriptive and inferential statistical computations, and visualizing data respectively. These are all Python libraries to help you do more efficiently and easily. Now you can learn Keras to implement your own Deep Neural Networks, including CNNs and RNNs to build models for complex problems. Keras models eventually compile into an object that can be called using the same Scikit-learn interface and specifying the layers is really simple with its API, so there’s not much to take in at all! [Optional] By this time, if you are more interested in specific domains like Vision or Language, you should learn libraries specific to that domain like OpenCV and NLTK respectively.

Depending on your prior experience with programming and math, your time to completion may vary, but on average, if you spend just 5 hours a week on this, you should expect to get a basic grasp of everything in 8 weeks. Of course, taking that basic grasp to expertise is a matter of practice (I’ll let you know of some suggestions!).

After all this, if you want to achieve an even deeper understanding, I strongly recommend you take up Andrew Ng’s Machine Learning course. It is the best, it is free, and it is highly valuable to understand the math and intuition behind the concepts you use. Some high-school math may be a prerequisite, and at times it may seem challenging, but trust me it’s totally worth it! Also, you may want to learn Tensorflow along the line, more for the paradigm rather than results (because you can do almost any practical thing without some of the pains of Tensorflow).

Step 1: Weka

Weka is a graphical machine learning tool. This means you need no prior programming experience to use it effectively. You can load in any Excel or CSV file (even databases!) easily. Then, you preprocess that data e.g. deleting columns, converting string field to numerical field, and more! You can also see descriptive statistics like mean or standard deviation, and visualize how different parameters correlate with one another. Most importantly, however, you can go to the “Classify” tab and train a model with one of the various classification / regression algorithms that you have seen!

There’s plenty of tutorials on YouTube about Weka, but I also recommend you check out Jason Brownlee’s tutorial here.

Step 2: Python

Programming is an important and useful skill to learn, not only for ML. There’s limitations about how much (and how quickly) you can do things with Weka, so eventually you will need to start writing code. Python is an amazing programming language to learn for beginners because it’s concise, intuitive and simple.

This page lists a huge variety of interactive resources for beginners to learn Python. You might also enjoy Codecademy which has free Python courses. If you have time and are willing to dive into further detail and build something cool (a search engine!) while at it, this Udacity course is awesome!

Step 3: Scikit-learn

Sk-learn is an extremely powerful library that has many ML models right out of the box, as well as a ton of other useful features like some test datasets, evaluation functions, and more. You can check out the full list of features here (caution: it may be overwhelming). All you need to know is find the models you know about already, like Logistic Regression or Neural Networks (known as Multi-Layer Perceptrons). Clicking on their class name opens its documentation which shows all the parameters you can (but don’t have to!) tune and their descriptions. Usually though, most models run without errors in absence of any configuration parameters as well.

Screenshot from SkLearn docs

The workflow is simple:

Load Data. You can use an inbuilt test dataset, or load your own from an excel/csv file. To do so, you will need to import pandas library and call the read_csv or read_excel function. Pandas has a lot of other useful features, but you can learn them later (Step 5)! Here’s some basic background on creating saving and loading DataFrames (fancy name for tables). Usually, you also split your data into training and testing datasets here. Instantiate your model. Once you import sklearn, you can instantiate your model like cla = sklearn.ensemble.RandomForestsClassifier() supplying any parameters as necessary. Train your model. Now that you have an instantiated model, you train it on your dataset by calling the fit method on your object. Test your model. Finally, you can test your model on the testing dataset by calling predict on your object. All of these steps are covered in this getting started guide!

Step 4: More Fundamentals

From the first three steps, you will have a fair idea of building your own awesome models. As you practice and play around, you will want to apply it to real world data, and sometimes you will realize that your models don’t perform as well as you would expect them to. You then start thinking about what went wrong, and these fundamentals would help.

The first thing you need to recall are the concepts of underfitting, known as high bias, and overfitting, known as high variance. Generally, if the training error (error on training data) is not low enough, and the test error (error on test data) is about the same, we have a high bias. This means that the model isn’t complex enough or sophisticated enough to fit even the training data well. On the other hand, if the training error is much lower than the test error, we have a high variance because the training data is fit too well. The training error will usually be lower than the test error because it is the data that the model has already seen. Also, usually the training error increases with more data, because it’s easy to fit fewer points while increasingly harder to fit more points well. On the other hand, the testing error usually decreases with more data, because the model has seen more examples of right answers.

Together, we can use the learning curve, which is the plot of the training and test error against the amount of input data to the model to diagnose whether our model faces high bias or high variance. Once we have diagnosed the problem, we can solve it more easily.

High variance can be addressed by adding more data, or adding regularization, or sometimes performing dimensionality reduction. High bias can be addressed by making the model more complex, either by adding layers if it’s a neural network, or increasing the maximum tree depth in a decision tree based model. Also, adding meaningful features by transforming existing features, which is known as feature engineering, even if it is manually, can significantly improve the performance of your model. This may sometimes require some domain knowledge about what you are building your model on. For example, I am currently working on a model that takes in player statistics from PUBG (a video game) to predict their rank in a given match. Now, having the mean statistics of the group, for example, may help the model perform better instead of just the individual player, considering that it is a team game. Adding such a feature would enhance the model’s predictive power and thereby performance.

Step 5: Rest of the Stack

Ask any data scientist what are the 4 most important libraries for the job, and they would say pandas, numpy, matplotlib and sklearn. Ideally, you would learn the first three before sklearn — but you don’t need to. Personally, being able to actually build something first is more important to me than building it perfectly, and knowing sklearn alone you can build awesome things. However, to polish it, you might need to do a bunch of other things, and that’s what those libraries provide.

Pandas is a data processing and file I/O library. You can read and write datasets, compute statistics about the dataset, modify datasets by adding fields in complex ways, and a lot more! It’s an extremely powerful library, and it might even take months to truly master the library. For the purposes of Machine Learning, I think you’re most likely to use it for feature engineering as described above, and data exploration. Pandas allows for some really complex operations like grouping and aggregating and joining over multiple subsets of data, combining various tables, etc. which makes feature engineering a lot easier. The predefined helper functions for computing statistics also make exploration of the data a breeze.

Numpy is a library for performing linear algebra computation. It is likely that you would not need to use it yourself for the purposes of ML specifically because a lot of the libraries you will be using (like sklearn or tensorflow) will use it under the hood for you, but you should at least know of it because it really is integral to how your code runs. Essentially, a lot of ML is working with vectors and matrices, and Python isn’t fast at dealing with them itself. However, with numpy wrapping over more lower level languages, it can perform vector computations much faster by parallelizing the code automatically. If you want to pursue the more general field of data science, be prepared to use it a lot. Also, libraries like tensorflow have a lot of features that are similar to that of numpy, so there definitely is transfer of skills.

Matplotlib, as the name might suggest, is a plotting tool. You can build really cool charts and graphs very quickly and easily, and it is certainly one of the most heavily used libraries. Another library that adds feature to matplotlib is seaborn. Among other things, you can use that library to make your charts prettier than the default matplotlib look and feel. You will most likely use matplotlib for exploring data and finding useful insights on correlations between features so you may decide what model would work best. A lot of ML is about intuition: nothing works well in every case, and it’s a developed ability to figure out what model to use. As you look through different charts, you can apply your intuition to reason what techniques would offer the best results.

While you’re at it, you also want to learn Jupyter. It is a runtime, not a library, but it might take some getting used to. After doing some ML, you might find it very inefficient to run the whole script form top to bottom even though you only changed one small thing. Sometimes loading and preprocessing steps might take several minutes, and it really disturbs your workflow if you need to wait that long after every small change. In a Jupyter notbook, you have individual blocks of code that you can run separately, while preserving the overall state of the environment. You can also use Google Colab which provides free Cloud GPU time to train larger models!

Step 6: Keras

Keras is a Deep Learning Library that’s really easy to use. You should read the quickstart guide on their page which shows you how to build your first model in as little as 30 seconds! The best part? Once you setup and compile the model, you can call fit and predict on it just like sklearn models! You can also save and retrieve models to files and tune all hyperparameters. It has a variety of activation functions, layer types and virtually everything you need to build sophisticated DNNs!

You should also learn more about the intuition behind Convolutional Neural Nets and Recurrent Neural Nets here and here respectively (he’s a great YouTuber to learn ML!).

Getting Help: More Resources!

There’s plenty of ML resources out there, but not all of them have the same target audience or prioritize the same things! So if you think a certain resource is too advanced for your liking, don’t quit! That said, the best sources would be:

Email newsletters. These are super-helpful and regular so that you can stay updated and gradually progress towards being a pro. My favorites are ML Mastery (who btw has tutorials about everything I discussed!) and PyImageSearch (if you’re interested in computer vision). Find a newsletter that you like, and they’ll deliver byte-sized tutorials to you every week so you can keep sharpening your skills! Andrew Ng’s Course. This is, I believe, the third time I am mentioning him because his course is AMAZING if you want to delve deep and really understand everything going under the hood. This would probably be less relevant if you are only looking to apply ML, but even then you want to understand how something works to use it best at a point… Please do consider going through it at some point. Reddit. Sub-reddits like this one is a great place to find a lot of other students as well as mentors to help you through the problems you run into, as well as to get updated with articles about latest developments in the field. I know of some researchers myself who use Reddit to keep track of what’s new and get updated about new research papers worth reading! Medium. From reinforcement learning to tensorflow to anything you can think of, there’s a ton of Medium bloggers who are trying to share their experiences and give tutorials. Other Coursera/edX/Udacity MOOCs. Okay, to be honest, I’m not a fan of a lot of courses out there. There’s some that are genuinely well taught (like Andrew Ng’s course), while some are just taught in a dry fashion, or aimed at people with a strong Math background to discuss theory only instead of practice. However, if you ever think of pursuing ML at University, these certainly give an equivalent experience, so definitely go for it.

Practicing

It is unsurprising that you are only going to get better with practice. Join clubs, pick up side projects, just keep on building stuff until you figure out what works and what doesn’t. A lot of it is about building solid intuition about how you can achieve certain things. In each step, even if it is just Weka, try to really spend time training the models inside your brain so that you receive optimal performance!

Your best practice is going to be by applying ML to a real world problem you, your company or your clients face. Real data is messy, noisy, and requires a lot of work to build an entire pipeline. Your next best option is Kaggle. It has hundreds of interesting and clean datasets and competitions for you to sharpen your skills at. Remember, often basic models like Random Forests or XGBoost will produce good enough results! Try not to go all out there and stick neural networks where you don’t need to just because it sounds cool!

Summary

In summary, this article showed you a learning path with the tools you need for doing ML well, and LOTS of resources to learn them. You saw what they are, why you need them, and how you can do a lot in as little as 5 lines of code (or even zero lines, with Weka). You also got some more fundamentals, as well as learning and practicing tips. Feel free to leave any questions you have in the comments below!

Hope you liked this series! I do plan to add more ML tutorials in the future based on the response I get for this one, but for now this is it for this series. Let me know if you have any specific requests for the future, and I’ll try to keep that in mind. Show your support by following, clapping, sharing and commenting!