Data Science Resources



Useful data science resources and recommended study routes. Updated occasionally.

Online Courses

Title Author Thoughts Level Coursera:

Machine Learning Andrew Ng General overview, not much detail.

Labs are in MATLAB, which is not desirable. Introductory Stanford Statistical Learning Trevor Hastie, Robert Tibshirani Online lectures following the text Introduction to Statistical Learning

Example R modules, minimal self-evaluations Introductory Stanford CS229 Machine Learning Andrew Ng, John Duchi A broad, technical overview of Machine Learning

Written problem sets on ML theory Average Coursera:

Neural Networks for Machine Learning Geoffrey Hinton Wide overview of several neural network models, including non-standard ones, such as Hopfield nets and Restricted Boltzmann Machines

Labs are in MATLAB, which is not desirable. Average Stanford CS231n Convolutional Neural Networks for Visual Recognition Fei-Fei Li, Andrej Karpathy, Justin Johnson Well-written online modules, video lectures on youtube

Completed the assignments, in which you write neural network architecture in python

Strongly recommend modules + assignments for understanding NN's, CNN's, RNN's Average - Advanced

All course content should be available for free. The paid Coursera certification is not really important.

Texts

Title Author Thoughts Level An Introduction to Statistical Learning Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani Good introductory book for machine learning for those with statistical background

Includes R modules Introductory Mining of Massive Datasets Jure Leskovec, Anand Rajarman, Jeffrey D. Ullman Practical knowledge about data mining, machine learning with real-life applications Average The Elements of Statistical Learning Trevor Hastie, Robert Tibshirani, Jerome Friedman Advanced version of Introduction of Statistical Learning

Includes R modules Advanced Pattern Recognition and Machine Learning Christopher M. Bishop Have not read in detail -

Other

Title Author Thoughts Level HackOn(Data) Workshop Material Armando Benitez Great notebooks to learn Apache Spark on Databricks, Machine Learning with Spark

Adapted from edX Spark labs Introductory - Average TensorFlow Tutorial and Examples for Beginners Aymeric Damien Well-constructed jupyter notebooks for learning TensorFlow Introductory

Recommended Study Routes

Prerequisites

Make sure you have the sufficient theoretical background in statistics, linear algebra and multivariable calculus. Most university students should be adequately prepared after second-year classes in these subjects.

Acquire a basic background in Python, including the following libraries: NumPy, Matplotlib, Scipy, Pandas. There are many resources available online. I particularly like this one for NumPy, Matplotlib, Scipy.

It is also useful to know R and Scala (for Apache Spark).





Machine Learning

Start off with the canonical Coursera Machine Learning course by Andrew Ng. It will give you a high-level overview of machine learning that is not too technical. You can stop this course after you feel like you have developed a sufficient intuition for machine learning.

If you have a statistical background, opt for the Stanford Statistical Learning course and study An Introduction to Statistical Learning. Otherwise, read the lectures notes to Stanford CS229 Machine Learning for a more technical introduction to Machine Learning.





Neural Networks & Deep Learning

For a general theoretical overview of neural networks, complete the Coursera Neural Networks for Machine Learning course by Geoffrey Hinton.

For a deeper and more technical understanding of neural networks, read the modules to Stanford CS231n Convolutional Neural Networks for Visual Recognition and complete the assignments. It is important that you complete the assignments, in which you will actually write neural network layers.

Afterwards, begin learning computational frameworks for deep learning, such as tensorflow or theano (I recommend tensorflow), as well as deep learning libraries, such as keras and caffe. Then start building your own neural networks, and figure out how to train them with GPUs.





Big Data

Familiarize yourself with cloud computing services. I recommend beginning with AWS, which offers a free tier. I don’t think there is a need to take an entire course on cloud computing, as you will learn a lot by doing. Try to launch your own virtual machines and use them to run your models. Try integration with their storage services.

Learn the basics to Apache Spark, a distributed computing engine designed for big data. I did this through the HackOn(Data) Workshops, but there are plenty of other resources available. Then, try launching a Spark cluster on the cloud, either through a service like AWS EMR or Azure HDInsight, or by bootstrapping your own cluster (My Guide).

As for the rest, learn as you need.





Note: Keep in mind that you can only learn so much through reading. Data Science is about doing! Try kaggle competitions, or fool around with fun datasets.



