In the previous parts of the “Python for data science” series, we looked at :

Part 1 : Basic in-built features in Python like functions, data types, date/time, map, reduce, filter, lambda functions etc.

Part 2 : Numpy library for creating, accessing and manipulating arrays

In this article, we will be looking at the most widely used library for data analysis — Pandas. How did it get it’s name ? The name Pandas is derived from panel data. Panel data comprises of observations over multiple time periods for the same individuals.

Pandas provide easy to use data structures and data analysis tools to create and manipulate datasets. We will be looking at the following features in Pandas :

Series and dataframes Querying a series Read and write files Indexing Merging Aggregating Filtering

As always the first step is to import the library. Let’s import both pandas and numpy libraries.

import pandas as pd

import numpy as np

1. Pandas Series

A Series is a one-dimensional data structure that can hold any data type such as integers and strings. It is similar to a list in Python.

First, let’s create a list

name = ['Rohan','Joseph','Rohit']

name

Output : [‘Rohan’, ‘Joseph’, ‘Rohit’]

Now, let’s convert the same list into a Pandas series.

name = pd.Series(name)

name

Output :

0 Rohan

1 Joseph

2 Rohit

dtype: object

We can observe that a Pandas series shows the index along with the value in each position. Similarly, let’s create a dictionary and convert it to a Pandas series.

sport = {'cricket' : 'India',

'soccer' : 'UK',

'Football' : 'USA'} sport = pd.Series(sport)

sport

Output :

cricket India

soccer UK

Football USA

dtype: object

The ‘keys’ of the dictionary become the index in the series and the ‘values’ of the dictionary remain as the values of the series. Let’s understand this further by querying a Pandas series.

2. Querying a Pandas Series

Let’s continue with the same series ‘Sport’ created above. Let’s access the third value in the series

sport.iloc[2]

Output : ‘USA’

‘iloc’ is used for selecting a value based on the integer location of the value. Now, let’s access the same value using the index of the value.

sport.loc['Football']

Output : ‘USA’

Great, let’s sum up the values in a series

a = pd.Series([1,2,3])

np.sum(a) #np sum function is a faster than the in-built function

Output : 6

Add data to an existing series.

a = pd.Series([1,2,3])

a.loc[‘City’] = ‘Delhi’

Output :

0 1

1 2

2 3

City Delhi

dtype: object

Let’s create a series with multiple values having same index.

b = pd.Series([‘a’,’b’,’c’,’d’],index=[‘e’,’e’,’e’,’e’])

b

Output :

e a

e b

e c

e d

dtype: object

3. Pandas Dataframe

Dataframe is a 2 dimensional data structure with columns of different data types(string, integer, date etc.).

Let’s create a dataframe in Pandas. We are creating a dataset with three columns- Name, Occupation and age.

df1 = pd.DataFrame([{'Name' : 'John', 'Occupation' : 'Data Scientist', 'Age' : 25},{'Name' : 'David', 'Occupation' : 'Analyst', 'Age' : 28},{'Name' : 'Mark', 'Occupation' : 'Teacher', 'Age' : 30}],index=['1','2','3'] ) df1

Output :

Now, let’s create a dataframe by appending two Series.

s1 = pd.Series({'Name' : 'Rohan',

'Age':'25'})

s2 = pd.Series({'Name' : 'Rohit',

'Age' : 28})]

df1 = pd.DataFrame([s1,s2],index=['1','2'])

df1

Output :

4. Read and write files

Let’s look at how to read a csv file.

iris = pd.read_csv('C:\\Users\\rohan\\Documents\\Analytics\\Data\\iris.csv')

Let’s see the top 5 rows of the file.

iris.head()

Output :

Save the file back into the local directory.

iris.to_csv('iris2.csv')

5. Indexing Dataframes

Check the index of the iris dataset imported in the previous step.

iris.index.values

Output :

Now, change the index to the name of the species. The ‘Name’ column will appear as the index and replaces the previous index.

b = iris.set_index('Name')

b.head()

Output :

To revert to the previous index, just reset the index as follows.

c = b.reset_index()

c.index.values

Output :

6. Merging dataframes

Let’s import the iris data again and merge another data set with it.

df1 = pd.read_csv('iris.csv')

df1.head()

Output :

Create a new dataframe to merge with this.

df2 = pd.DataFrame([{'Name' : 'setosa', 'Species' : 'Species 1'},

{'Name':'versicolor','Species':'Species 2'},

{'Name':'virginica','Species':'Species 3'}])

df2

Output :

Merge the above two datasets on the Name column by performing an inner join.

df3 = pd.merge(df1,df2,how='inner',left_on='Name',right_on='Name')

df3.head()

Output :

7. Aggregate function

Let’s aggregate few columns in the iris dataset. First, let’s find the average sepal length for each species.

df1.groupby('Name')['SepalLength'].mean()

Output :

Now, let’s find the average of all the numerical columns by species.

df1.groupby('Name')[['SepalLength','SepalWidth','PetalLength','PetalWidth']].mean()

Output :

Instead of finding the average of all the columns; let’s average one column (Sepal Length) and sum another column(Sepal Width)

a=df1.groupby('Name').agg({'SepalLength':'mean','SepalWidth':'sum'})

a

Output :

Let’s rename the columns.

a.rename(columns={'SepalLength':'Avg_SepalLength','SepalWidth':'Sum_SepalWidth'})

Output :

8. Filtering Dataframes

Once again, let’s import the iris dataset and perform operations to subset the dataset. First, let’s subset the data where sepal length is greater than 7 cm.

iris = pd.read_csv(‘iris.csv’) #import file

a = iris[(iris.SepalLength>7)]

a.head()

Output :

Now, let’s subset the data based on two conditions.

b = iris[(iris.SepalLength>5) & (iris.PetalLength>6)]

b.head()

Output :

Subset the data by filtering on the ‘Name’ column.

c = iris[iris['Name']=='versicolor']

c.head()

Filter again on the name column but on two names.

d = iris[iris['Name'].isin(['virginica','versicolor'])]

d.head()

Output :