Introduction

Pandas is an open source data analysis library in Python and it is extensively used for Data analysis, Data munging and Cleaning. Pandas has a high performing and user friendly data structures.

What makes Pandas a great choice for data analysis? It is it’s rich and highly performant data structures which are built on top of numpy and leverage it’s fast array operations. In thispost we will see how Pandas can be used efficiently for the data analysis and will explore some of it’s basic operations. Pandas has two types of data structures:

a) Series – it’s a one dimensional array with indexes, it stores a single column or row of data in a Dataframe

b) Dataframe – it’s a tabular spreadsheet like structure representing rows each of which contains one or multiple columns

First, Let’s do some import

<span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">pd</span> <span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">numpy</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">np</span> <span class="kn" style="color: #008080;">from</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">import</span> <span class="n" style="color: #0000ff;">Series</span><span class="p">,</span> <span class="n" style="color: #0000ff;">DataFrame </span> <span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">pd</span> <span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">numpy</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">np</span> <span class="kn" style="color: #008080;">from</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">import</span> <span class="n" style="color: #0000ff;">Series</span><span class="p">,</span> <span class="n" style="color: #0000ff;">DataFrame </span>

Series

Create a Series with the list of values and default indexes

<span class="n">this</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">2</span></span><span class="p">])</span> <span class="n">this</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">2</span></span><span class="p">])</span>

Check how the Series object looks like

<span class="n">this</span>

0 6 1 5 2 4 3 9 4 2 dtype: int6

<span class="n">this</span><span class="o">.</span><span class="n">values</span>

array([6, 5, 4, 9, 2])

Check the Index for this Series, Pandas generates the indexes automatically and default starts from 0

<span class="n">this</span><span class="o">.</span><span class="n">index</span>

Int64Index([0, 1, 2, 3, 4], dtype=’int64′)

Create a Series of US Cities with their Population and Index is set as the City for the Series, The Object takes the Indexes as defined with their corresponding column values i.e Population

citypopulation_2016 = Series([8537673,3976322,2704958,2303482,1615017], index=[‘NYC’,‘Los Angeles’,‘Chicago’,‘Houston’,‘Phoenix’])

<span class="n">citypopulation_2016</span>

NYC 8537673 Los Angeles 3976322 Chicago 2704958 Houston 2303482 Phoenix 1615017 dtype: int64

Find Population lesser than 2.5 million

Find the rows with population lower than 2.5 million

<span class="n">citypopulation_2016</span><span class="p">[</span><span style="color: #993300;"><span class="n">citypopulation_2016</span><span class="o"><</span><span class="mi">2500000</span></span><span class="p">]</span>

Houston 2303482 Phoenix 1615017 dtype: int64

Is Chicago in the Series

Check if Chicago is in the Series or not, it will return True or False boolean value

<span class="s">'Chicago'</span> <span class="ow">in</span> <span class="n">citypopulation_2016</span>

True

Create a list of Index

Create a list of Indexes with one extra value Philadelphia

<span class="n">cities</span> <span class="o">=</span> <span class="p">[</span><span style="color: #993300;"><span class="s">'NYC'</span><span class="p">,</span><span class="s">'Los Angeles'</span><span class="p">,</span><span class="s">'Chicago'</span><span class="p">,</span><span class="s">'Houston'</span><span class="p">,</span><span class="s">'Phoenix'</span><span class="p">,</span><span class="s">'Philadelhia'</span></span><span class="p">]</span>

Create a Series with this new Indexes, The value for Philadelhia is NaN (Not a Number)

<span class="n">Ser</span> <span class="o">=</span> <span class="n">Series</span><span class="p">(</span><span class="n">citypopulation_2016</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="n">cities</span><span class="p">)</span>

<span class="n">Ser</span>

NYC 8537673 Los Angeles 3976322 Chicago 2704958 Houston 2303482 Phoenix 1615017 Philadelhia NaN dtype: float64

Philadelphia not in list that’s why it is null

Check what are the null values in the above Series

<span class="n">Ser</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span>

NYC False Los Angeles False Chicago False Houston False Phoenix False Philadelhia True dtype: bool

Adding two series

Add the two Series object Ser & citypopulation_2016 and store the sum to a new object Ser1

<span class="n">Ser1</span> <span class="o">=</span> <span class="n">Ser</span><span class="o">+</span><span class="n">citypopulation_2016</span> <span class="n">Ser1</span>

Chicago 5409916 Houston 4606964 Los Angeles 7952644 NYC 17075346 Philadelhia NaN Phoenix 3230034 dtype: float64

Name the Series

<span class="n">Ser1</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"US Populate cities"</span> <span class="n">Ser1</span>

Cities Chicago 5409916 Houston 4606964 Los Angeles 7952644 NYC 17075346 Philadelhia NaN Phoenix 3230034 Name: US Populate cities, dtype: float64

Name the Index

<span class="n">Ser1</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">'Cities'</span> <span class="n">Ser1</span>

Cities Chicago 5409916 Houston 4606964 Los Angeles 7952644 NYC 17075346 Philadelhia NaN Phoenix 3230034 Name: US Populate cities, dtype: float64

Dataframes

Dataframe has many functions to read the data from various sources like Json, CSV, Excel, HTML etc. ro dataframe object.Here we would be reading the Salary data from the CSV file

<span class="n">df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"<span style="color: #993300;">/Users/vbabu/Documents/personal/MyGitWork/csv/mls-salaries-2016.csv</span>"</span><span class="p">)</span>

Now using the ‘head’ function you can see the first five rows by default and similarly for the bottom 5 rows use ‘tail’ function

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>

club last_name first_name position base_salary guaranteed_compensation ATL Burgos Efrain M 62508 62508.0 ATL Oblitey Otoo Jeffrey M 51504 51504.0 ATL Tambakis Alexander GK 63000 63000.0 CHI Accam David F-M 700000 770937.5 CHI Alvarez Arturo F 115000 118264.0

<span class="n">df</span><span class="o">.</span><span class="n">tail</span><span class="p">()</span>

club last_name first_name position base_salary guaranteed_compensation NaN Martinez Cristian M 62508.00 67008.00 NaN Moore Luke F 115000.00 137500.00 NaN Nyassi Sanna M 135000.00 141250.00 NaN Ovalle Adolfo M 63000.00 73500.00 NaN Tshuma Schillo F 81999.96 119999.96

Get the information about the dataframe using the ‘info’ function, Which gives details like field datatype, number of values, memory usage etc.

<class ‘pandas.core.frame.DataFrame’> Int64Index: 555 entries, 0 to 554 Data columns (total 6 columns): club 549 non-null object last_name 555 non-null object first_name 553 non-null object position 555 non-null object base_salary 555 non-null float64 guaranteed_compensation 555 non-null float64 dtypes: float64(2), object(4) memory usage: 30.4+ KB

Let’s check the data for a single Column ‘last_name”, Using ‘head’ function we will check the first five values for this column

<span class="n">df</span><span class="p">[</span><span class="s">'<span style="color: #993300;">last_name</span>'</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>

0 Burgos 1 Oblitey Otoo 2 Tambakis 3 Accam 4 Alvarez Name: last_name, dtype: object

Add a New Column to Dataframe

Let’s add ‘Age’ column to the existing dataframe, You can see there is no value set for the new column so Pandas handles it internally and set the value as NaN

<span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'last_name'</span><span class="p">,</span><span class="s">'first_name'</span><span class="p">,</span><span class="s">'Age'</span></span><span class="p">])</span>

last_name first_name Age Burgos Efrain NaN Oblitey Otoo Jeffrey NaN Tambakis Alexander NaN Accam David NaN Alvarez Arturo NaN Calistri Joey NaN Campbell Jonathan NaN Cocis Razvan NaN Conner Drew NaN Doody Patrick NaN Fernandez Collin NaN Gehrig Eric NaN Goossens John NaN Harrington Michael NaN Igboananike Kennedy NaN Johnson Sean NaN Junior Gilberto NaN Kappelhof Johan NaN LaBrocca Nick NaN Lampson Matt NaN McLain Patrick NaN Meira Joao NaN Morrell Alex NaN Polster Matt NaN Ramos Rodrigo NaN Stephens Michael NaN Thiam Khaly NaN Vincent Brandon NaN Afful Harrison NaN Ashe Corey NaN … … … Carducci Marco NaN Dean Christian NaN Flores Deybi NaN Froese Kianz NaN Harvey Jordan NaN Hurtado Erik NaN Jacobson Andrew NaN Kah Pa Modou NaN Kudo Masato NaN Laba Matias NaN Manneh Kekuta NaN McKendry Ben NaN Mezquida Nicolas NaN Morales Pedro NaN Ousted David NaN Parker Tim NaN Perez Blas NaN Rivero Octavio NaN Seiler Cole NaN Smith Jordan NaN Techera Cristian NaN Teibert Russell NaN Tornaghi Paolo NaN Waston Kendall NaN Gargan Dan NaN Martinez Cristian NaN Moore Luke NaN Nyassi Sanna NaN Ovalle Adolfo NaN Tshuma Schillo NaN

555 rows × 3 columns

Let’s see how we can get the value from the index, So here we want to get the 4th row value from the dataframe. All the columns of the 4th row is shown

<span class="n">df</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>

club CHI last_name Alvarez first_name Arturo position F base_salary 115000 guaranteed_compensation 118264 Name: 4, dtype: object

Add a new column – base salary * 2

This is another example of adding a new column ‘Double_Salary’ which is double the ‘base_salary’ of existing column.

<span class="n">df</span><span class="p">[</span><span class="s">'Double_Salary'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'base_salary'</span><span class="p">]</span><span class="o">*</span><span class="mi">2</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>

club last_name first_name position base_salary guaranteed_compensation Double_Salary taxes ATL Burgos Efrain M 62508.00 62508.00 125016.00 Base Tax ATL Oblitey Otoo Jeffrey M 51504.00 51504.00 103008.00 Base Tax ATL Tambakis Alexander GK 63000.00 63000.00 126000.00 Base Tax CHI Accam David F-M 700000.00 770937.50 1400000.00 Base Tax CHI Alvarez Arturo F 115000.00 118264.00 230000.00 Base Tax CHI Calistri Joey M 51500.00 51500.00 103000.00 Base Tax CHI Campbell Jonathan D 70000.00 78125.00 140000.00 Base Tax CHI Cocis Razvan M 160000.00 160000.00 320000.00 Base Tax CHI Conner Drew M 62500.00 62500.00 125000.00 Base Tax CHI Doody Patrick D 53471.00 53471.00 106942.00 Base Tax CHI Fernandez Collin M 75000.00 77000.00 150000.00 Base Tax CHI Gehrig Eric D 106875.00 112708.33 213750.00 Base Tax CHI Goossens John M 200000.00 200000.00 400000.00 Base Tax CHI Harrington Michael D 125000.00 125000.00 250000.00 Base Tax CHI Igboananike Kennedy F 800000.00 901666.67 1600000.00 Base Tax CHI Johnson Sean GK 250000.00 253000.00 500000.00 Base Tax CHI Junior Gilberto F 1145000.00 1145000.00 2290000.00 Base Tax CHI Kappelhof Johan D 480000.00 520000.00 960000.00 Base Tax CHI LaBrocca Nick M 110000.00 110000.00 220000.00 Base Tax CHI Lampson Matt GK 71250.00 76625.00 142500.00 Base Tax CHI McLain Patrick GK 72500.00 72500.00 145000.00 Base Tax CHI Meira Joao D-M 110000.00 126500.00 220000.00 Base Tax CHI Morrell Alex M 51500.00 51500.00 103000.00 Base Tax CHI Polster Matt D-M 84000.00 99000.00 168000.00 Base Tax CHI Ramos Rodrigo D 80000.00 84000.00 160000.00 Base Tax CHI Stephens Michael M 105000.00 115000.00 210000.00 Base Tax CHI Thiam Khaly M 144000.00 144000.00 288000.00 Base Tax CHI Vincent Brandon D 70000.00 91875.00 140000.00 Base Tax CLB Afful Harrison D 275000.00 291666.67 550000.00 Base Tax CLB Ashe Corey D 95000.00 105500.00 190000.00 Base Tax … … … … … … … … VAN Carducci Marco GK 63000.00 63000.00 126000.00 Base Tax VAN Dean Christian D 110000.00 191000.00 220000.00 Base Tax VAN Flores Deybi M 62500.00 68385.00 125000.00 Base Tax VAN Froese Kianz M 66000.00 70500.00 132000.00 Base Tax VAN Harvey Jordan D 165000.00 165000.00 330000.00 Base Tax VAN Hurtado Erik F/M 86091.50 121091.50 172183.00 Base Tax VAN Jacobson Andrew M 62500.08 87500.08 125000.16 Base Tax VAN Kah Pa Modou D 97000.00 101850.00 194000.00 Base Tax VAN Kudo Masato F 310000.00 314000.00 620000.00 Base Tax VAN Laba Matias M 560000.00 720500.00 1120000.00 Base Tax VAN Manneh Kekuta M-F 127500.00 157000.00 255000.00 Base Tax VAN McKendry Ben M 52500.00 52500.00 105000.00 Base Tax VAN Mezquida Nicolas M-F 88000.00 88000.00 176000.00 Base Tax VAN Morales Pedro M 1232500.00 1258900.00 2465000.00 Base Tax VAN Ousted David GK 360000.00 378933.33 720000.00 Base Tax VAN Parker Tim D 66000.00 84750.00 132000.00 Base Tax VAN Perez Blas F 215000.00 215000.00 430000.00 Base Tax VAN Rivero Octavio F 890850.00 890850.00 1781700.00 Base Tax VAN Seiler Cole D 51500.00 51500.00 103000.00 Base Tax VAN Smith Jordan D 115000.00 122743.00 230000.00 Base Tax VAN Techera Cristian M 320000.00 345000.00 640000.00 Base Tax VAN Teibert Russell M 115000.00 182500.00 230000.00 Base Tax VAN Tornaghi Paolo GK 62500.00 62500.00 125000.00 Base Tax VAN Waston Kendall D 300000.00 318125.00 600000.00 Base Tax NaN Gargan Dan D 145000.00 145000.00 290000.00 Base Tax NaN Martinez Cristian M 62508.00 67008.00 125016.00 Base Tax NaN Moore Luke F 115000.00 137500.00 230000.00 Base Tax NaN Nyassi Sanna M 135000.00 141250.00 270000.00 Base Tax NaN Ovalle Adolfo M 63000.00 73500.00 126000.00 Base Tax NaN Tshuma Schillo F 81999.96 119999.96 163999.92 Base Tax

Add specific values to the new column based on the indexes, So here for the indexes 0 & 6, the ‘taxes’ column value would be ‘Federal Tax’ & ‘State Tax’ and rest of the indexes will have the NaN value or null

<span class="n">tax</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Federal Tax'</span><span class="p">,</span><span class="s">'State Tax'</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="mi">0</span><span class="p">,</span><span class="mi">6</span></span><span class="p">])</span>

<span class="n">df</span><span class="p">[</span><span class="s">'taxes'</span><span class="p">]</span> <span class="o">=</span> <span class="n">tax</span>

<span class="n">df</span>

club last_name first_name position base_salary guaranteed_compensation Double_Salary taxes ATL Burgos Efrain M 62508.00 62508.00 125016.00 Federal Tax ATL Oblitey Otoo Jeffrey M 51504.00 51504.00 103008.00 NaN ATL Tambakis Alexander GK 63000.00 63000.00 126000.00 NaN CHI Accam David F-M 700000.00 770937.50 1400000.00 NaN CHI Alvarez Arturo F 115000.00 118264.00 230000.00 NaN CHI Calistri Joey M 51500.00 51500.00 103000.00 NaN CHI Campbell Jonathan D 70000.00 78125.00 140000.00 State Tax CHI Cocis Razvan M 160000.00 160000.00 320000.00 NaN CHI Conner Drew M 62500.00 62500.00 125000.00 NaN CHI Doody Patrick D 53471.00 53471.00 106942.00 NaN CHI Fernandez Collin M 75000.00 77000.00 150000.00 NaN CHI Gehrig Eric D 106875.00 112708.33 213750.00 NaN CHI Goossens John M 200000.00 200000.00 400000.00 NaN CHI Harrington Michael D 125000.00 125000.00 250000.00 NaN CHI Igboananike Kennedy F 800000.00 901666.67 1600000.00 NaN CHI Johnson Sean GK 250000.00 253000.00 500000.00 NaN CHI Junior Gilberto F 1145000.00 1145000.00 2290000.00 NaN CHI Kappelhof Johan D 480000.00 520000.00 960000.00 NaN CHI LaBrocca Nick M 110000.00 110000.00 220000.00 NaN CHI Lampson Matt GK 71250.00 76625.00 142500.00 NaN CHI McLain Patrick GK 72500.00 72500.00 145000.00 NaN CHI Meira Joao D-M 110000.00 126500.00 220000.00 NaN CHI Morrell Alex M 51500.00 51500.00 103000.00 NaN CHI Polster Matt D-M 84000.00 99000.00 168000.00 NaN CHI Ramos Rodrigo D 80000.00 84000.00 160000.00 NaN CHI Stephens Michael M 105000.00 115000.00 210000.00 NaN CHI Thiam Khaly M 144000.00 144000.00 288000.00 NaN CHI Vincent Brandon D 70000.00 91875.00 140000.00 NaN CLB Afful Harrison D 275000.00 291666.67 550000.00 NaN CLB Ashe Corey D 95000.00 105500.00 190000.00 NaN VAN Carducci Marco GK 63000.00 63000.00 126000.00 NaN VAN Dean Christian D 110000.00 191000.00 220000.00 NaN VAN Flores Deybi M 62500.00 68385.00 125000.00 NaN VAN Froese Kianz M 66000.00 70500.00 132000.00 NaN VAN Harvey Jordan D 165000.00 165000.00 330000.00 NaN VAN Hurtado Erik F/M 86091.50 121091.50 172183.00 NaN VAN Jacobson Andrew M 62500.08 87500.08 125000.16 NaN VAN Kah Pa Modou D 97000.00 101850.00 194000.00 NaN VAN Kudo Masato F 310000.00 314000.00 620000.00 NaN VAN Laba Matias M 560000.00 720500.00 1120000.00 NaN VAN Manneh Kekuta M-F 127500.00 157000.00 255000.00 NaN VAN McKendry Ben M 52500.00 52500.00 105000.00 NaN VAN Mezquida Nicolas M-F 88000.00 88000.00 176000.00 NaN VAN Morales Pedro M 1232500.00 1258900.00 2465000.00 NaN VAN Ousted David GK 360000.00 378933.33 720000.00 NaN VAN Parker Tim D 66000.00 84750.00 132000.00 NaN VAN Perez Blas F 215000.00 215000.00 430000.00 NaN VAN Rivero Octavio F 890850.00 890850.00 1781700.00 NaN VAN Seiler Cole D 51500.00 51500.00 103000.00 NaN VAN Smith Jordan D 115000.00 122743.00 230000.00 NaN VAN Techera Cristian M 320000.00 345000.00 640000.00 NaN VAN Teibert Russell M 115000.00 182500.00 230000.00 NaN VAN Tornaghi Paolo GK 62500.00 62500.00 125000.00 NaN VAN Waston Kendall D 300000.00 318125.00 600000.00 NaN NaN Gargan Dan D 145000.00 145000.00 290000.00 NaN NaN Martinez Cristian M 62508.00 67008.00 125016.00 NaN NaN Moore Luke F 115000.00 137500.00 230000.00 NaN NaN Nyassi Sanna M 135000.00 141250.00 270000.00 NaN NaN Ovalle Adolfo M 63000.00 73500.00 126000.00 NaN NaN Tshuma Schillo F 81999.96 119999.96 163999.92 NaN

Delete Column

‘taxes’ column deleted from the dataframe

<span class="k">del</span> <span class="n">df</span><span class="p">[</span><span class="s">'taxes'</span><span class="p">]</span>

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>

club last_name first_name position base_salary guaranteed_compensation Double_Salary ATL Burgos Efrain M 62508 62508.0 125016 ATL Oblitey Otoo Jeffrey M 51504 51504.0 103008 ATL Tambakis Alexander GK 63000 63000.0 126000 CHI Accam David F-M 700000 770937.5 1400000 CHI Alvarez Arturo F 115000 118264.0 230000

Dataframe from Dictionary

<span class="n">Myfruits</span> <span class="o">=</span> <span class="p">{</span><span class="s">'fruits'</span><span class="p">:[</span><span style="color: #993300;"><span class="s">'Orange'</span><span class="p">,</span><span class="s">'Mango'</span><span class="p">,</span><span class="s">'Grapes'</span><span class="p">,</span><span class="s">'Guava'</span></span><span class="p">],</span><span class="s">'Price'</span> <span class="p">:[</span><span style="color: #993300;"><span class="mi">20</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">90</span><span class="p">,</span><span class="mi">100</span></span><span class="p">]}</span>

<span class="n">dffruits</span><span class="o">=</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">Myfruits</span><span class="p">)</span>

<span class="n">dffruits</span>

Price fruits 20 Orange 50 Mango 90 Grapes 100 Guava

Reindexing

Create a new Series for colors as index

<span class="n">df</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">7</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span></span><span class="p">])</span> <span class="n">df</span>

Red 6 Blue 4 Green 5 Brown 9 Violet 7 dtype: int64

Re-index and add a new index ‘Pink’ in the existing Series, Therefore the value of index ‘Pink’ is null

<span class="n">df</span><span class="o">.</span><span class="n">reindex</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span><span class="p">,</span><span class="s">'Pink'</span></span><span class="p">])</span>

Red 6 Blue 4 Green 5 Brown 9 Violet 7 Pink NaN dtype: float64

Fill the null values with a default value

<span class="n">df</span><span class="o">.</span><span class="n">reindex</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span><span class="p">,</span><span class="s">'Pink'</span></span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>

Red 6 Blue 4 Green 5 Brown 9 Violet 7 Pink 10 dtype: int64

Reindexing rows, columns or both

Create a Dataframe with random integers and index value as defined below and columns also as shown

<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random_integers</span><span class="p">(</span><span class="mi">25</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="mi">25</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">)),</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'A'</span><span class="p">,</span><span class="s">'B'</span><span class="p">,</span><span class="s">'D'</span><span class="p">,</span><span class="s">'E'</span><span class="p">,</span><span class="s">'F'</span></span><span class="p">],</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'col1'</span><span class="p">,</span><span class="s">'col2'</span><span class="p">,</span><span class="s">'col3'</span><span class="p">,</span><span class="s">'col4'</span><span class="p">,</span><span class="s">'col5'</span></span><span class="p">])</span> <span class="n">df</span>

col1 col2 col3 col4 col5 A 17 3 16 16 18 B 2 24 5 3 15 C 20 22 5 21 16 D 13 13 10 3 20 E 15 18 20 2 2

Summing up columns and rows

Sum along the rows by using axis =1

<span class="n">df</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

A 70 B 49 D 84 E 59 F 57 dtype: int64

Sum along the rows by using axis=0

<span class="n">dframe</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>

col1 65 col2 60 col3 53 col4 82 col5 77 dtype: int64

Simple Stats

<span class="n">dframe</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>

col1 col2 col3 col4 col5 5.000000 5 5.000000 5.000000 5.00000 13.000000 12 10.600000 16.400000 15.40000 8.774964 10 5.549775 5.727128 8.70632 1.000000 2 1.000000 9.000000 2.00000 8.000000 3 11.000000 15.000000 12.00000 16.000000 11 13.000000 16.000000 18.00000 16.000000 19 13.000000 17.000000 21.00000 24.000000 25 15.000000 25.000000 24.00000

Find the correlation between each of the variables of the dataframe using ‘corr’ function

<span class="n">corr</span><span class="o">=</span><span class="n">dframe</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span>

Plot the Correlation heatmap matrix using Seaborn – A Data Visualization Library

<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span> <span class="o">%</span><span class="k">matplotlib</span> inline <span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">corr</span><span class="p">)</span>

Missing Values

<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">([[</span><span style="color: #993300;"><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">7</span></span><span class="p">],[</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NaN</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NAN</span><span class="p">,</span><span class="mi">9</span></span><span class="p">],[</span><span style="color: #993300;"><span class="n">np</span><span class="o">.</span><span class="n">NaN</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><spn class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NaN</span></span><span class="p">]])</span> <span class="n">df</span>

0 1 2 3 4 0 5 4 3 2 7 1 6 3 NaN NaN 9 2 NaN 3 4 2 NaN

drop all the columns where null value is available

<span class="n">df1</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">df1</span>

|1|0| |— |— | |0|4| |1|3| |2|3|

Show rows with atleast ‘thresh’ no. of values, Since all the rows contains minimum three non null values that’s why all rows are displayed

<span class="n">df2</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">thresh</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span> <span class="n">df2</span>

0 1 2 3 4 0 5 4 3 2 7 1 6 3 NaN NaN 9 2 NaN 3 4 2 NaN

fillna

fill nulls with values

<span class="n">df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">'5'</span><span class="p">)</span>

0 1 2 3 4 0 5 4 3 2 7 1 6 3 5 5 9 2 5 3 4 2 5

Drop rows

<span class="n">dframe</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span>

col1 col2 col3 col4 col5 8 2 1 15 18 16 3 11 9 12 16 11 13 25 24 1 25 13 17 21

Selecting only 2 columns

<span class="n">dframe</span><span class="p">[[</span><span class="s">'col1'</span><span class="p">,</span><span class="s">'col2'</span><span class="p">]]</span>

col1 col2 8 2 24 19 16 3 16 11 1 25

Selecting Index

<span class="n">dframe</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="s">'D'</span><span class="p">]</span>

col1 16 col2 3 col3 11 col4 9 col5 12 Name: D, dtype: int64

Selecting on Condition

<span class="n">dframe</span><span class="p">[</span><span class="n">dframe</span><span class="p">[</span><span class="s">'col1'</span><span class="p">]</span><span class="o">></span><span class="mi">10</span><span class="p">]</span>

Out[207]: col1 col2 col3 col4 col5 B 24 19 15 16 2 D 16 3 11 9 12 E 16 11 13 25 24

Selecting values

<span class="n">df</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'A'</span><span class="p">,</span><span class="s">'B'</span><span class="p">,</span><span class="s">'C'</span><span class="p">,</span><span class="s">'D'</span></span><span class="p">])</span> <span class="n">df</span>

A 6 B 4 C 3 D 2 dtype: int64

Select the Index ‘C’ value

<span class="n">df</span><span class="p">[</span><span class="s">'C'</span><span class="p">]</span>

3 1 3

Select First two values

<span class="n">df</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>

A 6 B 4 dtype: int64

Sorting

Sort will be done by default on the Indexes, so before sorting index was P,R,S,Q and after sorting it changed to P,Q,R,S

<span class="n">df</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'P'</span><span class="p">,</span><span class="s">'R'</span><span class="p">,</span><span class="s">'S'</span><span class="p">,</span><span class="s">'Q'</span></span><span class="p">])</span> <span class="n">df</span>

P 6 R 4 S 2 Q 1 dtype: int64

<span class="n">df</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>

P 6 Q 1 R 4 S 2 dtype: int64

Order by values

<span class="n">df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">()</span>

Q 1 S 2 R 4 P 6 dtype: int64

<span class="n">df</span><span class="o">.</span><span class="n">rank</span><span class="p">()</span>

P 4 R 3 S 2 Q 1 dtype: float64

MultiIndexing

Here we have first index as 1,2,3 and the second index has a,b for all the first index values, as shown here

<span class="n">Sr</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">6</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[[</span><span style="color: #993300;"><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span></span><span class="p">],[</span><span style="color: #993300;"><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span></span><span class="p">]])</span> <span class="n">Sr</span>

1 a 5 b 4 2 a 3 b 7 3 a 9 b 6 dtype: int64

Verify the Index Levels

<span class="n">Sr</span><span class="o">.</span><span class="n">index</span>

MultiIndex(levels=[[<span style="color: #993300;">1, 2, 3</span>], [<span style="color: #993300;">'a', 'b'</span>]], labels=[[<span style="color: #993300;">0, 0, 1, 1, 2, 2</span>], [<span style="color: #993300;">0, 1, 0, 1, 0, 1</span>]])

Get Values by first Index

<span class="n">Sr</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>

a 9 b 6 dtype: int64

Get Values by Second Index

<span class="n">Sr</span><span class="p">[:,</span><span class="s">'b'</span><span class="p">]</span>

1 4 2 7 3 6 dtype: int64

<span class="n">Sr</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span>

a b 5 4 3 7 9 6

Group By

Read a CSV file with the data of Booker Prize Winners from 1996-2016

<span class="n">df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'<span style="color: #993300;">./Desktop/Reports-Delete/Booker-Prize.csv</span>'</span><span class="p">)</span>

Get the first 3 rows of the dataset

<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>

Year Author Title Genre(s) Country Unnamed: 5 Unnamed: 6 1969 P. H. Newby Something to Answer For Novel United Kingdom NaN NaN 1970 Bernice Rubens The Elected Member Novel United Kingdom NaN NaN 1970 J. G. Farrell Troubles Novel United Kingdom

Ireland NaN NaN

Groupby Country, Author & Genre to know who are the Authors won the title more than once and which is the country won the Award maximum number of time

<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Country'</span><span class="p">,</span><span class="s">'Author'</span><span class="p">,</span><span class="s">'Genre(s)'</span></span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'<span style="color: #993300;">Year</span>'</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="k">False</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>

Year Title Country Author Genre(s) Australia Peter Carey Historical novel 2 2 United Kingdom

Ireland J. G. Farrell Novel 2 2 United Kingdom Hilary Mantel Historical novel 2 2 South Africa J. M. Coetzee Novel 2 2 United Kingdom Kazuo Ishiguro Historical novel 1 1

Pandas is an amazing library for data analysis and I have shown some of the basic stuffs to learn but to get more hands on with the library and it’s functions, Use Varied set of data and practice a lot. The only mantra to learn Pandas and gain expertise is Practice, Practice & Practice