Python tutorial May 10, 2016

Easier data analysis in Python with pandas (video series)

Summary: If you're working with data in Python, learning pandas will make your life easier! I love teaching pandas, and so I created a video series targeted at beginners. There are currently 37 videos in the series.

Why learn pandas?

pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. If you're working with data in Python and you're not using pandas, you're probably working too hard!

There are many things to like about pandas: It's well-documented, has a huge amount of community support, is under active development, and plays well with other Python libraries (such as matplotlib, scikit-learn, and seaborn).

There are also things you might not like: pandas has an overwhelming amount of functionality (so it's hard to know where to start), and it provides too many ways to accomplish the same task (so it's hard to figure out the best practices).

That's why I created this series. I've been using and teaching pandas for a long time, and so I know how to explain pandas in a way that is understandable to novices.

About the video series

You don't need to have any pandas experience to benefit from this series, but you do need to know the basics of Python.

In each video, I answer a question from one of my students using a real dataset. Since I've posted the data online, and pandas can read files directly from a URL, you can follow along with every video at home!

Every video in the series is embedded below. There are currently 37 videos in the series, but more may be added in the future. (Subscribe on YouTube for notifications.)

There's also a well-commented Jupyter notebook containing the code from every video, and a GitHub repository containing all of the datasets.

Do you have a question about pandas, or a task you would like to accomplish? Let me know in the comments section!

List of videos

Embedded videos with descriptions

pandas is a full-featured Python library for data analysis, manipulation, and visualization. This video series is for anyone who wants to work with data in Python, regardless of whether you are brand new to pandas or have some experience. Each video will answer a student question about pandas using a real dataset, which is available online so you can follow along!

"Tabular data" is just data that has been formatted as a table, with rows and columns (like a spreadsheet). You can easily read a tabular data file into pandas, even directly from a URL! In this video, I'll walk you through how to do that, including how to modify some of the default arguments of the read_table function to solve common problems.

DataFrames and Series are the two main object types in pandas for data storage: a DataFrame is like a table, and each column of the table is called a Series. You will often select a Series in order to analyze or manipulate it. In this video, I'll show you how to select a Series using "bracket notation" and "dot notation", and will discuss the limitations of dot notation. I'll also demonstrate how to create a new Series in a DataFrame.

To access most of the functionality in pandas, you have to call the methods and attributes of DataFrame and Series objects. In this video, I'll discuss some common methods and attributes, and show you how to tell the difference between them. (Hint: It's all about the parentheses!)

You will often want to rename the columns of a DataFrame so that their names are descriptive, easy to type, and don't contain any spaces. In this video, I'll demonstrate three different strategies for renaming columns so that you can choose the best strategy to fit your particular situation.

If you have DataFrame columns that you're never going to use, you may want to remove them entirely in order to focus on the columns that you do use. In this video, I'll show you how to remove columns (and rows), and will briefly explain the meaning of the "axis" and "inplace" parameters.

pandas allows you to sort a DataFrame by one of its columns (known as a "Series"), and also allows you to sort a Series alone. The sorting API changed in pandas version 0.17, so in this video, I'll demonstrate both the "old way" and the "new way" to sort. I'll also show you how to sort a DataFrame by multiple columns at once!

Let's say that you only want to display the rows of a DataFrame which have a certain column value. How would you do it? pandas makes it easy, but the notation can be confusing and thus difficult to remember. In this video, I'll work up to the solution step-by-step using regular Python code so that you can truly understand the logic behind pandas filtering notation.

Let's say that you want to filter the rows of a DataFrame by multiple conditions. In this video, I'll demonstrate how to do this using two different logical operators. I'll also explain the special rules in pandas for combining filter criteria, and end with a trick for simplifying chained conditions!

In this video, I'm answering a few of the pandas questions I've received in the YouTube comments:

When reading from a file, how do I read in only a subset of the columns or rows?

How do I iterate through a Series or a DataFrame?

How do I drop all non-numeric columns from a DataFrame?

How do I know whether I should pass an argument as a string or a list?

When performing operations on a pandas DataFrame, such as dropping columns or calculating row means, it is often necessary to specify the "axis". But what exactly is an axis? In this video, I'll help you to build a mental model for understanding the axis parameter so that you will know when and how to use it.

pandas includes powerful string manipulation capabilities that you can easily apply to any Series of strings. In this video, I'll show you how to access string methods in pandas (along with a few examples), and then end with two bonus tips to help you maximize your efficiency.

Have you ever tried to do math with a pandas Series that you thought was numeric, but it turned out that your numbers were stored as strings? In this video, I'll demonstrate two different ways to change the data type of a Series so that you can fix incorrect data types. I'll also show you the easiest way to convert a boolean Series to integers, which is useful for creating dummy/indicator variables for machine learning.

The pandas "groupby" method allows you to split a DataFrame into groups, apply a function to each group independently, and then combine the results back together. This is called the "split-apply-combine" pattern, and is a powerful tool for analyzing data across different categories. In this video, I'll explain when you should use a groupby and then demonstrate its flexibility using four different examples.

When you start working with a new dataset, how should you go about exploring it? In this video, I'll demonstrate some of the basic tools in pandas for exploring both numeric and non-numeric data. I'll also show you how to create simple visualizations in a single line of code!

Most datasets contain "missing values", meaning that the data is incomplete. Deciding how to handle missing values can be challenging! In this video, I'll cover all of the basics: how missing values are represented in pandas, how to locate them, and options for how to drop them or fill them in.

The DataFrame index is core to the functionality of pandas, yet it's confusing to many users. In this video, I'll explain what the index is used for and why you might want to store your data in the index. I'll also demonstrate how to set and reset the index, and show how that affects the DataFrame's shape and contents.

In part two of our discussion of the index, we'll switch our focus from the DataFrame index to the Series index. After discussing index-based selection and sorting, I'll demonstrate how automatic index alignment during mathematical operations and concatenation enables us to easily work with incomplete data in pandas.

Have you ever been confused about the "right" way to select rows and columns from a DataFrame? pandas gives you an incredible number of options for doing so, but in this video, I'll outline the current best practices for row and column selection using the loc, iloc, and ix methods.

We've used the "inplace" parameter many times during this video series, but what exactly does it do, and when should you use it? In this video, I'll explain how "inplace" affects methods such as "drop" and "dropna", and why it is always False by default.

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.

Have you been using scikit-learn for machine learning, and wondering whether pandas could help you to prepare your data and export your predictions? In this video, I'll demonstrate the simplest way to integrate pandas into your machine learning workflow, and will create a submission for Kaggle's Titanic competition in just a few lines of code!

In this video, I'm answering a few of the pandas questions I've received in the YouTube comments:

Could you explain how to read the pandas documentation?

What is the difference between ufo.isnull() and pd.isnull(ufo)?

Why are DataFrame slices inclusive when using .loc, but exclusive when using .iloc?

How do I randomly sample rows from a DataFrame?

If you want to include a categorical feature in your machine learning model, one common solution is to create dummy variables. In this video, I'll demonstrate three different ways you can create dummy variables from your existing DataFrame columns. I'll also show you a trick for simplifying your code that was introduced in pandas 0.18.

Let's say that you have dates and times in your DataFrame and you want to analyze your data by minute, month, or year. What should you do? In this video, I'll demonstrate how you can convert your data to "datetime" format, enabling you to access a ton of convenient attributes and perform datetime comparisons and mathematical operations.

During the data cleaning process, you will often need to figure out whether you have duplicate data, and if so, how to deal with it. In this video, I'll demonstrate the two key methods for finding and removing duplicate rows, as well as how to modify their behavior to suit your specific needs.

If you've been using pandas for a while, you've likely encountered a SettingWithCopyWarning. The proper response is to modify your code appropriately, not to turn off the warning! In this video, I'll show you two common scenarios in which this warning arises, explain why it's occurring, and then demonstrate how to address it.

Have you ever wanted to change the way your DataFrame is displayed? Perhaps you needed to see more rows or columns, or modify the formatting of numbers? In this video, I'll demonstrate how to change the settings for five common display options in pandas.

Have you ever needed to create a DataFrame of "dummy" data, but without reading from a file? In this video, I'll demonstrate how to create a DataFrame from a dictionary, a list, and a NumPy array. I'll also show you how to create a new Series and attach it to the DataFrame.

Have you ever struggled to figure out the differences between apply, map, and applymap? In this video, I'll explain when you should use each of these methods and demonstrate a few common use cases. Watch the end of the video for three important announcements!

One of the most powerful features in pandas is multi-level indexing (or "hierarchical indexing"), which allows you to add extra dimensions to your Series or DataFrame objects. But when should you use a MultiIndex, and how do you create, slice, and merge MultiIndexed objects?

If you want to combine multiple datasets into a single pandas DataFrame, you'll need to use the "merge" function. In this video, you'll learn exactly what happens during a merge operation, as well as how to use the four different types of joins. By the end of the video, you'll be fully prepared to merge your own DataFrames!

In the last 20 months, the pandas library has been updated 10 times, introducing hundreds of new features, bug fixes, and API changes. In this video, I'll show you 4 new pandas tricks that will make your life easier!

In the last 20 months, the pandas library has been updated 10 times, introducing hundreds of new features, bug fixes, and API changes. In this video, I'll explain 5 changes to the pandas API that you need to know in order to stay current with the library.

You're about to learn 25 tricks that will help you to work faster, write better pandas code, and impress your friends. These are the BEST tricks I've learned from 5 years of teaching Python's pandas library.

The pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the pandas library makes it challenging to discover the best way to accomplish any given task.

In this tutorial, you'll use pandas to answer questions about a real-world dataset. Through each exercise, you'll learn important data science skills as well as "best practices" for using pandas. By the end of the tutorial, you'll be more fluent at using pandas to correctly and efficiently answer your own data science questions.

During this two-hour webcast, I answered 45 viewer questions about pandas, the leading Python library for data analysis, exploration, and manipulation. View the complete list of questions on Crowdcast.

P.S. Want to be the first to know when I release a new pandas video? Subscribe to the Data School newsletter: