Series is the backbone of DataFrame

Agenda

The two central data structures of Pandas are Series and DataFrame. This post is an attempt to have a proper understanding of Pandas series.

The foundation of a DataFrame is a Series. The docstring of DataFrame defines a DataFrame as:

Can be thought of as a dict-like

container for Series objects

Many operations on dataframe return series instance. It is thus essential that we have a solid understanding of Series.

What is a series

A Pandas series can be conceptualized in two ways. It can be envisioned as a single column of tabular data. It can also be envisioned as a single row of tabular data.

Let’s assume there is a database table called accounting which stores revenue and expenses across different years.

year revenue expense

2017 1000 800

2018 1200 900

2019 1500 1100

The collection of all the revenues is a Series , i.e [1000, 1200, 1500] is a series. The series is a collection of same attribute of different objects/rows/instances.

Series can also be thought of as a single row of a the table, i.e [2017, 1000, 800] is a series too. In this case the series is a collection of different attributes of single object.

Creating a series

In a real world scenario, you will rarely create a series. You will probably be provided with a csv or some other data source from where you will read the data. pandas has utilities to read data from these data sources.

The read data will be a pandas dataframe. You will extract some series out of the dataframe and operate on the series.

This is an easy task though. For now, let’s explicitly create a series

A Series can be created using pandas.Series .

Let’s try it on ipython shell.

In [2]: revenue = pd.Series([1000, 1200, 1500, 800, 900], index=['2014', '2015', '2016', '2017', '2018'])

index keyword argument assigns label to different values of series. These labels come handy in retrieving data from series.

Indexing a series

Let’s get information out of this series. Let’s retrieve the revenue for they year 2016.

In [8]: revenue['2016']

Out[8]: 1500

Series also has two methods for indexing, they are .loc and .iloc .

.loc does label based indexing.

In [9]: revenue.loc['2016']

Out[9]: 1500

.iloc does integer based indexing. 2016 is the third entry, i.e indexed 2.

In [10]: revenue.iloc[2]

Out[10]: 1500

revenue.iloc['2016'] would fail. revenue.loc[2] would fail too.

Slicing a series

We can slice a series using the same syntax as list slicing. We want to get a slice from 2014 to 2016.

In [17]: revenue['2014': '2016']

Out[17]:

2014 1000

2015 1200

2016 1500

dtype: int64

Notice that the returned data structure is a pandas.Series too.

In [18]: type(revenue['2014': '2016'])

Out[18]: pandas.core.series.Series

If you want to use integer positions for slicing, you will have to use .iloc . Integer positions can only be used with an iloc . It cannot be used with loc .

In [20]: revenue.iloc[0:3]

Out[20]:

2014 1000

2015 1200

2016 1500

dtype: int64

The returned object is again a Series.

If we want information for multiple years, we will have to pass a list to the indexing syntax. Let’s get revenue for years 2014 and 2017.

In [23]: revenue[['2014', '2017']]

Out[23]:

2014 1000

2017 800

dtype: int64

We could have used revenue.loc[['2014', '2017']] too.

2014 is the first entry and 2017 is the fourth entry in the series. We could also use revenue.iloc[[0, 3]] .

In [25]: revenue.iloc[[0, 3]]

Out[25]:

2014 1000

2017 800

dtype: int64

As you should have noticed, slicing a series always returns another series.

Getting labels of the series

We want a list of all the years for which we have revenue.

In [93]: revenue.index

Out[93]: Index(['2014', '2015', '2016', '2017', '2018'], dtype='object') In [94]: list(revenue.index)

Out[94]: ['2014', '2015', '2016', '2017', '2018']

As we saw a Series object has an attribute called index which returns the labels.

Filtering a series

We want to get all years where revenue is greater than 1000.

In [50]: revenue[revenue > 1000]

Out[50]:

2015 1200

2016 1500

dtype: int64

We will understand how it works under the hood in next section.

Filtering using a list of booleans

List of booleans is the underlying mechanism which enables filtering on series.

We want to get revenue for 2014 and 2017. 2014 is positioned 0 and 2017 is positioned 3.

We will have to create a list of same length as revenue and set 0th and 3rd element as True.

In [75]: l = [True, False, False, True, False] # Set 0th and 3rd element True In [76]: revenue[l]

Out[76]:

2014 1000

2017 800

dtype: int64

We passed a list of booleans to indexing syntax and it filtered the series to return another series. The returned series only contains values where the corresponding element was True in the boolean list.

Instead of list, we could create a series of length 5 with boolean value and use the boolean series with revenue .

In [78]: boolean_series = pd.Series([True, False, False, True, False], index=['2014', '2015', '2016', '2017', '2018']) In [79]: revenue[boolean_series]

Out[79]:

2014 1000

2017 800

dtype: int64

Since we wanted to use the boolean series with revenue, we had to ensure that the indexes of both series match.

Let’s see what revenue > 1000 give us:

In [80]: revenue > 1000

Out[80]:

2014 False

2015 True

2016 True

2017 False

2018 False

dtype: bool

It returns us a series which is similar to boolean_series we explicitly created.

Since it’s a series, we could directly use this with revenue and do revenue[revenue > 1000] . This is exactly what we did in the last section on filtering.

More filtering

We want to get all the years where revenue is greater than 1000 but less than 1300.

In [68]: revenue[(revenue > 1000) & (revenue < 1300)]

Out[68]:

2015 1200

dtype: int64

Arithmetic operation on series

Suppose we realise that each year’s revenue was off by 100. We want to increase each year’s revenue by 100 to make up for it.

In [81]: revenue + 100

Out[81]:

2014 1100

2015 1300

2016 1600

2017 900

2018 1000

dtype: int64

Notice how each value of the series increased by 100.

Any arithmetic operation on series is applied to all the values of the series.

Ordering on series

We want to sort the revenues in ascending order.

In [87]: revenue.sort_values()

Out[87]:

2017 800

2018 900

2014 1000

2015 1200

2016 1500

dtype: int64

We want to sort the revenues in descending order.

In [88]: revenue.sort_values(ascending=False)

Out[88]:

2016 1500

2015 1200

2014 1000

2018 900

2017 800

dtype: int64

We want to know the year which had maximum revenue.

In [92]: revenue.sort_values(ascending=False).index[0]

Out[92]: '2016'

revenue.sort(ascending=False) returns an ordered series. And since a series has attribute index , we could use it and get the label of year with maximum revenue.

There is a better method to achieve this though. We will see idxmax() in next section.

Aggregation on series

Let’s find out total revenue across all the years.

In [96]: revenue.sum()

Out[96]: 5400

Series has utilities to easily find the average, maximum and minimum too.

In [102]: revenue.mean()

Out[102]: 1080.0 In [103]: revenue.max()

Out[103]: 1500 In [104]: revenue.min()

Out[104]: 800

We want to know the year which had maximum revenue.

In [106]: revenue.idxmax()

Out[106]: '2016'

Grouping and aggregation on series

Let’s assume there could be multiple rows for revenue for a particular year.

In [2]: revenue_with_multiple_entries = pd.Series([1000, 1200, 1500, 800, 900, 500], index=['2014', '2015', '2016', '2017', '2018', '2017']) In [3]: print(revenue_with_multiple_entries)

2014 1000

2015 1200

2016 1500

2017 800

2018 900

2017 500

dtype: int64

We have multiple rows for revenue of 2017. We want to know the total revenue across different years.

We can achieve it by doing a groupby and then doing a sum .

In [4]: revenue_with_multiple_entries.groupby(revenue_with_multiple_entries.index).sum()

Out[4]:

2014 1000

2015 1200

2016 1500

2017 1300

2018 900

dtype: int64

Most common value of series

Let’s create a series of names of students in a class.

In [6]: names = pd.Series(['steve jobs', 'bill gates', 'mark twain', 'charles darwin', 'charles dickens', 'mark zuckerberg', 'charles darwin']) In [7]: print(names)

0 steve jobs

1 bill gates

2 mark twain

3 charles darwin

4 charles dickens

5 mark zuckerberg

6 charles darwin

You should have noticed that we did not provide explicit index to the series. So an integer label was automatically assigned.

Let’s find the most common name.

In [10]: names.value_counts()

Out[10]:

charles darwin 2

bill gates 1

mark zuckerberg 1

mark twain 1

steve jobs 1

charles dickens 1

String operation on series

Let’s find all the names which start with mark .

In [16]: names[names.str.startswith('mark')]

Out[16]:

2 mark twain

5 mark zuckerberg

dtype: object

names.str.startswith('mark') returns a boolean array which we are passing to indexing syntax to filter the series.

A pandas series has str attribute which allows executing string operations.

apply method

Series has a method called apply which allows doing advanced filtering.

We want to find all names having surname as darwin .

In [27]: names[names.apply(lambda x: x.split(' ')[-1] == 'darwin')]

Out[27]:

3 charles darwin

6 charles darwin

dtype: object

apply method expects a function to be passed to it. That’s why we passed it a lambda function.

Calling names.apply() applies the lambda function to all values of the series. The return value of .apply() is a series of booleans.

In [26]: names.apply(lambda x: x.split(' ')[-1] == 'darwin')

Out[26]:

0 False

1 False

2 False

3 True

4 False

5 False

6 True

dtype: bool