Python has become the language of choice for a lot of statisticians and data scientists. The main driving force behind Python’s adoption among data scientists is the Pandas, the open source library tool in of Python.

Here are some features of the library:

A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Time series-functionality: date range generation and frequency conversion Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets

Based on these features, we have given below some examples using Pandas on home insurance rates data from vHomeInsurance.



The Home Insurance rates for the the US is the base data set States to explore Pandas.

We have assume the Input File to be:US_Homeinsurance_Rates.csv

This file consists state wise home insurance rates from 1999 to 2013 and the example data set is given below & a summary chart for 2013.

state State Short 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2013 Alaska AK 601 607 614 668 731 810 837 850 861 856 1455 Alabama AL 457 476 501 533 681 793 847 894 905 845 1667 Arkansas AR 501 494 537 616 721 768 775 802 762 788 1545 Arizona AZ 418 438 460 543 614 642 635 640 634 628 1227 California CA 578 592 599 660 753 835 895 937 925 911 1819 Colorado CO 559 571 595 660 762 811 807 813 826 842 1271 Connecticut CT 546 571 608 652 714 777 823 878 929 980 1723 District of Columbia DC 617 635 655 697 806 894 963 1012 1089 926 1574 Delaware DE 317 329 358 390 442 488 498 530 559 535 1586 Florida FL 657 666 709 786 810 929 1083 1386 1534 1390 1364 Georgia GA 448 458 471 517 570 635 672 703 724 749 2155 Hawaii HI 606 554 553 565 687 726 837 776 850 862 1304 Iowa IA 347 358 382 450 542 575 594 596 610 612 1028 Idaho ID 319 326 334 382 433 448 457 477 422 387 969 Illinois IL 387 410 428 516 610 659 660 674 700 628 1322

To start the process of using the data, we need to import the standard pandas libraries

import pandas as pd

Next we can read the data from csv file using read_csv function in pandas, and to view the summary head function will be useful.

Insurance_rates=pd.read_csv("US_Homeinsurance_Rates.csv",delimiter=”,”) Insurance_rates.head()

Output of head

state State Short 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2013 Alaska AK 601 607 614 668 731 810 837 850 861 856 1455 Alabama AL 457 476 501 533 681 793 847 894 905 845 1667 Arkansas AR 501 494 537 616 721 768 775 802 762 788 1545 Arizona AZ 418 438 460 543 614 642 635 640 634 628 1227 California CA 578 592 599 660 753 835 895 937 925 911 1819

Once we are done with the quick reading of the data, we can start exploring the data. Here is an example of a simple Sort Function.

syntax of sort: DataFrame. sort (columns=None, axis=0, ascending=True, inplace=False, kind='quicksort',na_position='last')

Now we will sort the data by “2013” column

Insurance_rates.sort(['2013','State'],ascending=[1, 0])

Output:

Maine ME 336 352 372 416 462 513 553 573 596 572 827 Vermont VT 414 426 451 493 552 608 646 677 704 650 873 Idaho ID 319 326 334 382 433 448 457 477 422 387 969 Iowa IA 347 358 382 450 542 575 594 596 610 612 1028 North Carolina NC 427 449 484 527 576 623 644 649 674 683 1085

Based, on the above sorting, you will find that North Carolina home insurance rates was the 5th cheapest in the US.

We can move onto a relatively powerful statistical summarization function called “Describe” where the data set can be summarized according to key metrics like count, min, max, average etc.

result.describe()

Output:

1999 2000 2001 2002 2003 \ count 51.000000 51.000000 51.000000 51.000000 51.000000 mean 463.745098 477.823529 502.509804 562.705882 643.313725 std 114.516696 113.403828 119.359017 144.584411 154.357311 min 266.000000 287.000000 308.000000 340.000000 433.000000 25% 387.500000 403.500000 424.500000 478.500000 551.500000 50% 441.000000 451.000000 480.000000 536.000000 621.000000 75% 529.500000 548.000000 566.500000 613.500000 717.500000 max 861.000000 880.000000 955.000000 1238.000000 1328.000000 2004 2005 2006 2007 2008 \ count 51.000000 51.000000 51.000000 51.000000 51 mean 700.725490 730.078431 758.000000 778.000000 757 std 164.628197 176.989925 203.835326 231.311219 217 min 448.000000 457.000000 477.000000 422.000000 387 25% 604.500000 631.500000 639.000000 646.500000 620 50% 659.000000 682.000000 706.000000 721.000000 703 75% 773.000000 831.500000 858.500000 882.500000 859.000000 max 1362.000000 1372.000000 1409.000000 1534.000000 1460.000000 2013 count 51.000000 mean 1479.000000 std 359.323698 min 827.000000 25% 1287.500000 50% 1431.000000 75% 1631.500000 max 2699.000000

To find the median and mean, here are the commands:

result.median() result.mean()

output:

1999 441 2000 451 2001 480 2002 536 2003 621 2004 659 2005 682 2006 706 2007 721 2008 703 2013 1431 1999 463.745098 2000 477.823529 2001 502.509804 2002 562.705882 2003 643.313725 2004 700.725490 2005 730.078431 2006 758.000000 2007 778.000000 2008 757.509804 2013 1479.000000

We can attempt a more complicated problem of finding the percentage change from 2003 to 2013. Executing a query like this in SQL would have required several steps and calculation of many base values before getting a final answer. In Pandas, the query is just two lines!

To find the percentage of change from 2003 to 2013

we can easily add new columns to Data Frame. just say DataFrame[‘column_name’]=Value

result['Percentage Change']=(result['2013']-result['2003'])*100/result['2003'] print result.loc[:,['State.1','2003','2013','Percentage Change']]

Output

State Short 2003 2013 Percentage Change 0 AK 731 1455 99.042408 1 AL 681 1667 144.787078 2 AR 721 1545 114.285714 3 AZ 614 1227 99.837134 4 CA 753 1819 141.567065 5 CO 762 1271 66.797900 6 CT 714 1723 141.316527 7 DC 806 1574 95.285360 8 DE 442 1586 258.823529 9 FL 810 1364 68.395062 10 GA 570 2155 278.070175

The above examples illustrate, that if you are a data scientist or analyst used to Python, Pandas can quickly become a must have tool.

This is a guest blog post by vHomeInsurance.com.