Pandas is an open-source Python library that provides data analysis and manipulation in Python programming.

It’s a very promising library in data representation, filtering, and statistical programming. The most important piece in pandas is the DataFrame, where you store and play with the data.

In this tutorial, you will learn what the DataFrame is, how to create it from different sources, how to export it to different outputs, and how to manipulate its data.

Install pandas

You can install pandas in Python by using pip. Run the following command in cmd:

pip install pandas

Also, you can install pandas using conda like this:

conda install pandas

Read Excel file

You can read from an Excel file using the read_excel() method of pandas. For this, you need to import one more module called xlrd. Install xlrd using pip:

pip install xlrd

The example below demonstrates how to read from an Excel sheet:

We created an Excel sheet with the following contents:

Import the pandas module:

import pandas

We will pass the name of the Excel file and the sheet number we need to read data from to the read_excel() method.

pandas.read_excel('pandasExcel.xlsx', 'Sheet1')

The above snippet will generate the following output:

If you check the type of the output using type keyword, it will give you the following result:

<class 'pandas.core.frame.DataFrame'>

It’s called a DataFrame! That is the basic unit of pandas that we are going to deal with.

The DataFrame is a labeled 2 Dimensional structure where we can store data of different types. DataFrame is similar to a SQL table or an Excel spreadsheet.

Import CSV file

To read from a CSV file, you can use the read_csv() method of pandas.

Import the pandas module:

import pandas

Now call the read_csv() method as follows:

pandas.read_csv('Book1.csv')

Book1.csv has the following contents:

The code will generate the following DataFrame:

Read text file

We can also use the read_csv method of pandas to read from a text file; consider the following example:

import pandas pandas.read_csv('myFile.txt')

The myFile.txt looks like the following:

The output of the above code will be:

Pandas treats the file as a CSV file because we have comma-separated elements. The file also may use another delimiter such as a semicolon, tab, etc.

Suppose we have a tab delimiter and the file looks like this:

When the delimiter is a tab, we will have the following output:

Since pandas has no idea about the delimiter, it translates the tab to \t.

To define the tab character as a delimiter, pass the delimiter argument like this:

pandas.read_csv('myFile.txt', delimiter='\t')

Now the output will be:

Looks correct now.

Read SQL

You can use the read_sql() method of pandas to read from an SQL database:

import sqlite3 import pandas con = sqlite3.connect('mydatabase.db') pandas.read_sql('select * from Employee', con)

In this example, we connected to a SQLite3 database that has a table named “Employee”. Using the read_sql() method of pandas, then we passed a query and a connection object to the read_sql() method. The query fetches all the data in the table.

Our Employee table looks like the following:

When you run the above code, the output will be like the following:

Select columns

Suppose we have three columns in the Employee table like this:

To select columns from the table, we will pass the following query:

select Name, Job from Employee

The pandas code statement will be as follows:

pandas.read_sql('select Name, Job from Employee', con)

We can also select a column from a table by accessing the data frame. Consider the following example:

x = pandas.read_sql('select * from Employee', con) x['Name']

The result will be as follows:

Select rows by value

First, we will create a DataFrame from which we will select rows.

To create a DataFrame, consider the code below:

import pandas frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} df = pandas.DataFrame(frame_data)

In this code, we created a DataFrame with three columns and three rows using the DataFrame() method of pandas. The result will be the following:

To select a row based on value, run the following statement:

df.loc[df['name'] == 'Jason']

df.loc[] or DataFrame.loc[] is a Boolean array that you can use to access rows or columns by values or labels. In the above code, it will select rows where the name equals Jason.

The output will be:

Select row by index

To select a row by its index, we can either use the slicing (:) operator or the df.loc[] array.

Consider the code below:

>>> frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} >>> df = pandas.DataFrame(frame_data)

We created a DataFrame. Now let’s access a row using df.loc[]:

>>> df.loc[1]

As you can see, one row is fetched. We can do the same using the slicing operator as follows:

>>> df[1:2]

Change column type

The Data type of a column can be changed by using the astype() attribute of DataFrame. To check the data type of columns, we use the dtypes attribute of DataFrame.

>>> df.dtypes

The output will be:

Now to convert the data type from one to another:

>>> df.name = df.name.astype(str)

We fetched the column’ name’ from our DataFrame and changed its data type from object to string.

Apply a function to columns/rows

To apply a function on a column or a row, you can use the apply() method of DataFrame.

Consider the following example:

>>> frame_data = {'A': [1, 2, 3], 'B': [18, 20, 22], 'C': [54, 12, 13]} >>> df = pandas.DataFrame(frame_data)

We created a DataFrame and added values of integer type in the rows. To apply a function for example square root on the values, we will import the numpy module to use the sqrt function from it like this:

>>> import numpy as np >>> df.apply(np.sqrt)

The output will be as follows:

To apply the sum function, the code will be:

>>> df.apply(np.sum)

To apply the function to a specific column, you can specify the column like this:

>>>df['A'].apply(np.sqrt)

Sort Values/sort by column

To sort values in a DataFrame, use the sort_values() method of the DataFrame.

Create a DataFrame with integer values:

>>> frame_data = {'A': [23, 12, 30], 'B': [18, 20, 22], 'C': [54, 112, 13]} >>> df = pandas.DataFrame(frame_data)

Now to sort the values:

>>> df.sort_values(by=['A'])

The output will be:

In the code above, the values are sorted by column A. To sort by multiple columns; the code will be:

>>> df.sort_values(by=['A', 'B'])

If you want to sort in descending order, set ascending attributed of set_values to False as follows:

>>> df.sort_values(by=['A'], ascending=False)

The output will be:

Drop/Remove duplicates

To drop duplicate rows from a DataFrame, use the drop_duplicates() method of the DataFrame.

Consider the following example:

>>> frame_data = {'name': ['James', 'Jason', 'Rogers', 'Jason'], 'age': [18, 20, 22, 20], 'job': ['Assistant', 'Manager', 'Clerk', 'Manager']} >>> df = pandas.DataFrame(frame_data)

Here we created a DataFrame with a duplicate row. To check if any duplicate rows are present in the DataFrame, use the duplicated() method of the DataFrame.

>>> df.duplicated()

The result will be:

It can be seen that the last row is a duplicate. To drop or remove this row, run the following line of code:

>>> df.drop_duplicates()

Now the result will be:

Drop duplicates by column

Sometimes, we have data where the column values are the same and we wish to delete them. We can drop a row by column by passing the name of the column we need to delete.

For example, we have the following DataFrame:

>>> frame_data = {'name': ['James', 'Jason', 'Rogers', 'Jason'], 'age': [18, 20, 22, 21], 'job': ['Assistant', 'Manager', 'Clerk', 'Employee']} >>> df = pandas.DataFrame(frame_data)

Here you can see that Jason is two times. If you want to remove duplicate by column, just pass the column name as follows:

>>> df.drop_duplicates(['name'])

The result will be like the following:

Delete a column

To delete an entire column or row, we can use the drop() method of the DataFrame by specifying the name of the column or row.

Consider the following example:

>>> df.drop(['job'], axis=1)

In this line of code, we are deleting the column named ‘job’ The axis argument is necessary here. If the axis value is 1, it means we want to delete columns. If the axis value is 0, it means that row will be deleted. In axis values, 0 is for index and 1 is for columns.

The result will be:

Delete rows

We can use the drop() method to drop or delete a row by passing the index of the row.

Suppose we have the following DataFrame:

>>> frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} >>> df = pandas.DataFrame(frame_data)

To drop a row with index 0 where the name is James, age is 18, and the job is Assistant, use the following code:

>>> df.drop([0])

Let’s create a DataFrame where the indexes are the names:

>>> frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} >>> df = pandas.DataFrame(frame_data, index = ['James', 'Jason', 'Rogers'])

Now we can delete a row with a certain value. For example, if we want to delete a row where the name is Rogers, then the code will be:

>>> df.drop(['Rogers'])

The output will be:

You can also delete a range of row as:

>>> df.drop(df.index[[0, 1]])

This will delete rows from index 0 to 1 and one row left only since our DataFrame composed of 3 rows:

If you want to delete the last row from the DataFrame and do not know what the total number of rows is, then you can use the negative indexing as below:

>>> df.drop(df.index[-1])

-1 deletes the last row. Similarly, -2 will delete the last two rows and so on.

Sum a column

You can use the sum() method of the DataFrame to sum the column items.

Suppose we have the following DataFrame:

>>> frame_data = {'A': [23, 12, 12], 'B': [18, 18, 22], 'C': [13, 112, 13]} >>> df = pandas.DataFrame(frame_data)

Now to sum the items of column A, use the following line of code:

>>> df['A'].sum()

You can also use the apply() method of the DataFrame and pass in the sum method of numpy to sum the values.

Count unique values

To count unique values in a column, you can use the nunique() method of the DataFrame.

Suppose we have DataFrame as below:

>>> frame_data = {'A': [23, 12, 12], 'B': [18, 18, 22], 'C': [13, 112, 13]} >>> df = pandas.DataFrame(frame_data)

To count the unique values in column A:

>>> df['A'].nunique()

As you can see, column A has only two unique values 23 and 12, and another 12 is a duplicate. That’s why we have 2 in the output.

If you want to count all the values in a column, you can use the count() method as follows:

>>> df['A'].count()

Subset rows

To select a subset of a DataFrame, You can use the square brackets.

For example, we have a DataFrame that contains some integers. We can select or subset a row like this:

df.[start:count]

The start point will be included in the subset, but the stop point is not included. For example, to select three rows starting from the first row, you will write:

>>> df[0:3]

The output will be:

That code means start from the first row which is 0 and select three rows.

Similarly, to select the first two rows, you will write:

>>> df[0:2]

To select or subset the last row, use the negative indexing as:

>>> df[-1:]

Write to Excel

To write a DataFrame to an Excel sheet, we can use the to_excel() method.

To write to an Excel sheet, you have to open the sheet, and to open an Excel sheet; you will have to import the openpyxl module.

Install openpyxl using pip:

pip install openpyxl

Consider the following example:

>>> import openpyxl >>> frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} >>> df = pandas.DataFrame(frame_data) >>> df.to_excel("pandasExcel.xlsx", "Sheet1")

The Excel file will look like the following:

Write to CSV

Similarly, to write a DataFrame to CSV, you can use the to_csv() method as in the following line of code.

>>> df.to_csv("pandasCSV.csv")

The output file will be like the following:

Write to SQL

To write data to SQL, we can use the to_sql() method. Consider the following example:

import sqlite3 import pandas con = sqlite3.connect('mydatabase.db') frame_data = {'name': ['James', 'Jason', 'Rogers'], 'age': [18, 20, 22], 'job': ['Assistant', 'Manager', 'Clerk']} df = pandas.DataFrame(frame_data) df.to_sql('users', con)

In this code, we created a connection with the sqlite3 database. Then we created a DataFrame with three rows and three columns.

Finally, we used the to_sql method of our DataFrame (df) and passed the name of the table where the data will be stored along with the connection object.

The SQL database will look like the following:

Write to JSON

You can use the to_json() method of the DataFrame to write to a JSON file:

>>> df.to_json("myJson.json")

In this line of code, the name of the JSON file is passed as an argument. The DataFrame will be stored in the JSON file. The file will have the following content:

Write to HTML file

You can use the to_html() method of the DataFrame to create an HTML file with the DataFrame content.

Consider the following example:

>>> df.to_html("myhtml.html")

The resulting file will have the following content:

When you open the HTML file in the browser, it will look like the following:

Working with pandas is very easy. It’s like working with Excel sheets! pandas DataFrame is a very flexible piece library you can ever use.

I hope you find the tutorial useful. Keep coming back.