With this article, we, OpenDataScience, launch an open Machine Learning course. This is not aimed at developing another comprehensive introductory course on machine learning or data analysis (so this is not a substitute for fundamental education or online/offline courses/specializations and books). The purpose of this series of articles is to quickly refresh your knowledge and help you find topics for further advancement. Our approach is similar to that of the authors of Deep Learning book, which starts off with a review of mathematics and basics of machine learning — short, concise, and with many references to other resources.

UPD: YouTube playlist with videolectures

The course is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course.

All materials are available as a Kaggle Dataset and in a GitHub repo.

The course is going to be actively discussed in the OpenDataScience Slack team. Please fill in this form to be invited. The next session of the course will start on October 1, 2018. Invitations will be sent in September.

Article outline

1. About the course

2. Assignments

3. Demonstration of main Pandas methods

4. First attempt on predicting telecom churn

5. Assignment #1

6. Useful resources

1. About the course

Syllabus

Community

One of the most vivid advantages of our course is active community. If you join the OpenDataScience Slack team, you’ll find the authors of articles and assignments right there in the same channel (#eng_mlcourse_open) eager to help you. This can help very much when you make your first steps in any discipline. Fill in this form to be invited. The form will ask you several questions about your background and skills, including a few easy math questions.

We chat informally, like humor and emoji. Not every MOOC can boast to have such an alive community.

Prerequisites

The prerequisites are the following: basic concepts from calculus, linear algebra, probability theory and statistics, and Python programming skills. If you need to catch up, a good resource will be Part I from the “Deep Learning” book and various math and Python online courses (for Python, CodeAcademy will do). More info is available on the corresponding Wiki page.

What software you’ll need

As for now, you’ll only need Anaconda (built with Python 3.6) to reproduce the code in the course. Later in the course you’ll have to install other libraries like Xgboost and Vowpal Wabbit.

You can also resort to the Docker container with all necessary software already installed. More info is available on the corresponding Wiki page.

2. Assignments

Each article comes with an assignment in the form of a Jupyter notebook. The task will be to fill in the missing code snippets and to answer questions in a Google Quiz form;

Each assignment is due in a week with a hard deadline;

Please discuss the course content (articles and assignments) in the #eng_mlcourse_open channel of the OpenDataScience Slack team or here in the comments to articles on Medium;

The solutions to assignments will be sent to those who have submitted the corresponding Google form.

3. Demonstration of main Pandas methods

Well... There are dozens of cool tutorials on Pandas and visual data analysis. If you are familiar with these topics, just wait for the 3rd article in the series, where we get into machine learning.

The following material is better viewed as a Jupyter notebook and can be reproduced locally with Jupyter if you clone the course repository.

Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv , .tsv , or .xlsx . Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn , Pandas provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in Pandas are implemented with Series and DataFrame classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (objects, observations, etc.), and columns correspond to features for each of the instances.