UPDATE: If you're interested in learning pandas from a SQL perspective and would prefer to watch a video, you can find video of my 2014 PyData NYC talk here.

This is part three of a three part introduction to pandas, a Python library for data analysis. The tutorial is primarily geared towards SQL users, but is useful for anyone wanting to get started with the library.

Part 1: Intro to pandas data structures

Part 2: Working with DataFrames

Part 3: Using pandas with the MovieLens dataset

In [61]: chicago [ chicago . department == "LAW" ][: 5 ] Out[61]: name title department salary dept_rank 21971 PATTON, STEPHEN R CORPORATION COUNSEL LAW 173664 1 6311 DARLING, LESLIE M FIRST ASST CORPORATION COUNSEL LAW 149160 2 17680 MARTINICO, JOSEPH P CHIEF LABOR NEGOTIATOR LAW 144036 3 22357 PETERS, LYNDA A CITY PROSECUTOR LAW 139932 4 31383 WONG JR, EDWARD J DEPUTY CORPORATION COUNSEL LAW 137076 5

We can now see where each employee ranks within their department based on salary.

Using pandas on the MovieLens dataset¶

To show pandas in a more "applied" sense, let's use it to answer some questions about the MovieLens dataset. Recall that we've already read our data into DataFrames and merged it.

In [62]: # pass in column names for each CSV u_cols = [ 'user_id' , 'age' , 'sex' , 'occupation' , 'zip_code' ] users = pd . read_csv ( 'ml-100k/u.user' , sep = '|' , names = u_cols , encoding = 'latin-1' ) r_cols = [ 'user_id' , 'movie_id' , 'rating' , 'unix_timestamp' ] ratings = pd . read_csv ( 'ml-100k/u.data' , sep = ' \t ' , names = r_cols , encoding = 'latin-1' ) # the movies file contains columns indicating the movie's genres # let's only load the first five columns of the file with usecols m_cols = [ 'movie_id' , 'title' , 'release_date' , 'video_release_date' , 'imdb_url' ] movies = pd . read_csv ( 'ml-100k/u.item' , sep = '|' , names = m_cols , usecols = range ( 5 ), encoding = 'latin-1' ) # create one merged DataFrame movie_ratings = pd . merge ( movies , ratings ) lens = pd . merge ( movie_ratings , users )

What are the 25 most rated movies?

In [63]: most_rated = lens . groupby ( 'title' ) . size () . sort_values ( ascending = False )[: 25 ] most_rated Out[63]: title Star Wars (1977) 583 Contact (1997) 509 Fargo (1996) 508 Return of the Jedi (1983) 507 Liar Liar (1997) 485 English Patient, The (1996) 481 Scream (1996) 478 Toy Story (1995) 452 Air Force One (1997) 431 Independence Day (ID4) (1996) 429 Raiders of the Lost Ark (1981) 420 Godfather, The (1972) 413 Pulp Fiction (1994) 394 Twelve Monkeys (1995) 392 Silence of the Lambs, The (1991) 390 Jerry Maguire (1996) 384 Chasing Amy (1997) 379 Rock, The (1996) 378 Empire Strikes Back, The (1980) 367 Star Trek: First Contact (1996) 365 Back to the Future (1985) 350 Titanic (1997) 350 Mission: Impossible (1996) 344 Fugitive, The (1993) 336 Indiana Jones and the Last Crusade (1989) 331 dtype: int64

There's a lot going on in the code above, but it's very idomatic. We're splitting the DataFrame into groups by movie title and applying the size method to get the count of records in each group. Then we order our results in descending order and limit the output to the top 25 using Python's slicing syntax. In SQL, this would be equivalent to: SELECT title, count(1) FROM lens GROUP BY title ORDER BY 2 DESC LIMIT 25; Alternatively, pandas has a nifty value_counts method - yes, this is simpler - the goal above was to show a basic groupby example.

In [64]: lens . title . value_counts ()[: 25 ] Out[64]: Star Wars (1977) 583 Contact (1997) 509 Fargo (1996) 508 Return of the Jedi (1983) 507 Liar Liar (1997) 485 English Patient, The (1996) 481 Scream (1996) 478 Toy Story (1995) 452 Air Force One (1997) 431 Independence Day (ID4) (1996) 429 Raiders of the Lost Ark (1981) 420 Godfather, The (1972) 413 Pulp Fiction (1994) 394 Twelve Monkeys (1995) 392 Silence of the Lambs, The (1991) 390 Jerry Maguire (1996) 384 Chasing Amy (1997) 379 Rock, The (1996) 378 Empire Strikes Back, The (1980) 367 Star Trek: First Contact (1996) 365 Titanic (1997) 350 Back to the Future (1985) 350 Mission: Impossible (1996) 344 Fugitive, The (1993) 336 Indiana Jones and the Last Crusade (1989) 331 Name: title, dtype: int64

Which movies are most highly rated?

In [65]: movie_stats = lens . groupby ( 'title' ) . agg ({ 'rating' : [ np . size , np . mean ]}) movie_stats . head () Out[65]: rating size mean title 'Til There Was You (1997) 9 2.333333 1-900 (1994) 5 2.600000 101 Dalmatians (1996) 109 2.908257 12 Angry Men (1957) 125 4.344000 187 (1997) 41 3.024390

We can use the agg method to pass a dictionary specifying the columns to aggregate (as keys) and a list of functions we'd like to apply. Let's sort the resulting DataFrame so that we can see which movies have the highest average score.

In [66]: # sort by rating average movie_stats . sort_values ([( 'rating' , 'mean' )], ascending = False ) . head () Out[66]: rating size mean title They Made Me a Criminal (1939) 1 5 Marlene Dietrich: Shadow and Light (1996) 1 5 Saint of Fort Washington, The (1993) 2 5 Someone Else's America (1995) 1 5 Star Kid (1997) 3 5

Because movie_stats is a DataFrame, we use the sort method - only Series objects use order . Additionally, because our columns are now a MultiIndex, we need to pass in a tuple specifying how to sort. The above movies are rated so rarely that we can't count them as quality films. Let's only look at movies that have been rated at least 100 times.

In [67]: atleast_100 = movie_stats [ 'rating' ][ 'size' ] >= 100 movie_stats [ atleast_100 ] . sort_values ([( 'rating' , 'mean' )], ascending = False )[: 15 ] Out[67]: rating size mean title Close Shave, A (1995) 112 4.491071 Schindler's List (1993) 298 4.466443 Wrong Trousers, The (1993) 118 4.466102 Casablanca (1942) 243 4.456790 Shawshank Redemption, The (1994) 283 4.445230 Rear Window (1954) 209 4.387560 Usual Suspects, The (1995) 267 4.385768 Star Wars (1977) 583 4.358491 12 Angry Men (1957) 125 4.344000 Citizen Kane (1941) 198 4.292929 To Kill a Mockingbird (1962) 219 4.292237 One Flew Over the Cuckoo's Nest (1975) 264 4.291667 Silence of the Lambs, The (1991) 390 4.289744 North by Northwest (1959) 179 4.284916 Godfather, The (1972) 413 4.283293

Those results look realistic. Notice that we used boolean indexing to filter our movie_stats frame. We broke this question down into many parts, so here's the Python needed to get the 15 movies with the highest average rating, requiring that they had at least 100 ratings: movie_stats = lens . groupby ( 'title' ) . agg ({ 'rating' : [ np . size , np . mean ]}) atleast_100 = movie_stats [ 'rating' ] . size >= 100 movie_stats [ atleast_100 ] . sort_values ([( 'rating' , 'mean' )], ascending = False )[: 15 ] The SQL equivalent would be: SELECT title, COUNT(1) size, AVG(rating) mean FROM lens GROUP BY title HAVING COUNT(1) >= 100 ORDER BY 3 DESC LIMIT 15;

Limiting our population going forward Going forward, let's only look at the 50 most rated movies. Let's make a Series of movies that meet this threshold so we can use it for filtering later.

In [68]: most_50 = lens . groupby ( 'movie_id' ) . size () . sort_values ( ascending = False )[: 50 ]

The SQL to match this would be: CREATE TABLE most_50 AS ( SELECT movie_id, COUNT(1) FROM lens GROUP BY movie_id ORDER BY 2 DESC LIMIT 50 ); This table would then allow us to use EXISTS, IN, or JOIN whenever we wanted to filter our results. Here's an example using EXISTS: SELECT * FROM lens WHERE EXISTS (SELECT 1 FROM most_50 WHERE lens.movie_id = most_50.movie_id);

Which movies are most controversial amongst different ages?

Let's look at how these movies are viewed across different age groups. First, let's look at how age is distributed amongst our users.

In [69]: users . age . plot . hist ( bins = 30 ) plt . title ( "Distribution of users' ages" ) plt . ylabel ( 'count of users' ) plt . xlabel ( 'age' );

pandas' integration with matplotlib makes basic graphing of Series/DataFrames trivial. In this case, just call hist on the column to produce a histogram. We can also use matplotlib.pyplot to customize our graph a bit (always label your axes).

Binning our users I don't think it'd be very useful to compare individual ages - let's bin our users into age groups using pandas.cut .

In [70]: labels = [ '0-9' , '10-19' , '20-29' , '30-39' , '40-49' , '50-59' , '60-69' , '70-79' ] lens [ 'age_group' ] = pd . cut ( lens . age , range ( 0 , 81 , 10 ), right = False , labels = labels ) lens [[ 'age' , 'age_group' ]] . drop_duplicates ()[: 10 ] Out[70]: age age_group 0 60 60-69 397 21 20-29 459 33 30-39 524 30 30-39 782 23 20-29 995 29 20-29 1229 26 20-29 1664 31 30-39 1942 24 20-29 2270 32 30-39

pandas.cut allows you to bin numeric data. In the above lines, we first created labels to name our bins, then split our users into eight bins of ten years (0-9, 10-19, 20-29, etc.). Our use of right=False told the function that we wanted the bins to be exclusive of the max age in the bin (e.g. a 30 year old user gets the 30s label). Now we can now compare ratings across age groups.

In [71]: lens . groupby ( 'age_group' ) . agg ({ 'rating' : [ np . size , np . mean ]}) Out[71]: rating size mean age_group 0-9 43 3.767442 10-19 8181 3.486126 20-29 39535 3.467333 30-39 25696 3.554444 40-49 15021 3.591772 50-59 8704 3.635800 60-69 2623 3.648875 70-79 197 3.649746

Young users seem a bit more critical than other age groups. Let's look at how the 50 most rated movies are viewed across each age group. We can use the most_50 Series we created earlier for filtering.

In [72]: lens . set_index ( 'movie_id' , inplace = True )

In [73]: by_age = lens . loc [ most_50 . index ] . groupby ([ 'title' , 'age_group' ]) by_age . rating . mean () . head ( 15 ) Out[73]: title age_group Air Force One (1997) 10-19 3.647059 20-29 3.666667 30-39 3.570000 40-49 3.555556 50-59 3.750000 60-69 3.666667 70-79 3.666667 Alien (1979) 10-19 4.111111 20-29 4.026087 30-39 4.103448 40-49 3.833333 50-59 4.272727 60-69 3.500000 70-79 4.000000 Aliens (1986) 10-19 4.050000 Name: rating, dtype: float64

Notice that both the title and age group are indexes here, with the average rating value being a Series. This is going to produce a really long list of values. Wouldn't it be nice to see the data as a table? Each title as a row, each age group as a column, and the average rating in each cell. Behold! The magic of unstack !

In [80]: by_age . rating . mean () . unstack ( 1 ) . fillna ( 0 )[ 10 : 20 ] Out[80]: age_group 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 title E.T. the Extra-Terrestrial (1982) 0 3.680000 3.609091 3.806818 4.160000 4.368421 4.375000 0.000000 Empire Strikes Back, The (1980) 4 4.642857 4.311688 4.052083 4.100000 3.909091 4.250000 5.000000 English Patient, The (1996) 5 3.739130 3.571429 3.621849 3.634615 3.774648 3.904762 4.500000 Fargo (1996) 0 3.937500 4.010471 4.230769 4.294118 4.442308 4.000000 4.333333 Forrest Gump (1994) 5 4.047619 3.785714 3.861702 3.847826 4.000000 3.800000 0.000000 Fugitive, The (1993) 0 4.320000 3.969925 3.981481 4.190476 4.240000 3.666667 0.000000 Full Monty, The (1997) 0 3.421053 4.056818 3.933333 3.714286 4.146341 4.166667 3.500000 Godfather, The (1972) 0 4.400000 4.345070 4.412844 3.929412 4.463415 4.125000 0.000000 Groundhog Day (1993) 0 3.476190 3.798246 3.786667 3.851064 3.571429 3.571429 4.000000 Independence Day (ID4) (1996) 0 3.595238 3.291429 3.389381 3.718750 3.888889 2.750000 0.000000

unstack , well, unstacks the specified level of a MultiIndex (by default, groupby turns the grouped field into an index - since we grouped by two fields, it became a MultiIndex). We unstacked the second index (remember that Python uses 0-based indexes), and then filled in NULL values with 0. If we would have used: by_age . rating . mean () . unstack ( 0 ) . fillna ( 0 ) We would have had our age groups as rows and movie titles as columns.

Which movies do men and women most disagree on? EDIT: I realized after writing this question that Wes McKinney basically went through the exact same question in his book. It's a good, yet simple example of pivot_table, so I'm going to leave it here. Seriously though, go buy the book. Think about how you'd have to do this in SQL for a second. You'd have to use a combination of IF/CASE statements with aggregate functions in order to pivot your dataset. Your query would look something like this: SELECT title, AVG(IF(sex = 'F', rating, NULL)), AVG(IF(sex = 'M', rating, NULL)) FROM lens GROUP BY title; Imagine how annoying it'd be if you had to do this on more than two columns. DataFrame's have a pivot_table method that makes these kinds of operations much easier (and less verbose).

In [75]: lens . reset_index ( 'movie_id' , inplace = True )

In [76]: pivoted = lens . pivot_table ( index = [ 'movie_id' , 'title' ], columns = [ 'sex' ], values = 'rating' , fill_value = 0 ) pivoted . head () Out[76]: sex F M movie_id title 1 Toy Story (1995) 3.789916 3.909910 2 GoldenEye (1995) 3.368421 3.178571 3 Four Rooms (1995) 2.687500 3.108108 4 Get Shorty (1995) 3.400000 3.591463 5 Copycat (1995) 3.772727 3.140625

In [77]: pivoted [ 'diff' ] = pivoted . M - pivoted . F pivoted . head () Out[77]: sex F M diff movie_id title 1 Toy Story (1995) 3.789916 3.909910 0.119994 2 GoldenEye (1995) 3.368421 3.178571 -0.189850 3 Four Rooms (1995) 2.687500 3.108108 0.420608 4 Get Shorty (1995) 3.400000 3.591463 0.191463 5 Copycat (1995) 3.772727 3.140625 -0.632102

In [78]: pivoted . reset_index ( 'movie_id' , inplace = True )

In [79]: disagreements = pivoted [ pivoted . movie_id . isin ( most_50 . index )][ 'diff' ] disagreements . sort_values () . plot ( kind = 'barh' , figsize = [ 9 , 15 ]) plt . title ( 'Male vs. Female Avg. Ratings

(Difference > 0 = Favored by Men)' ) plt . ylabel ( 'Title' ) plt . xlabel ( 'Average Rating Difference' );

Of course men like Terminator more than women. Independence Day though? Really?

Closing

This is the point where I finally wrap this tutorial up. Hopefully I've covered the basics well enough to pique your interest and help you get started with the library. If I've missed something critical, feel free to let me know on Twitter or in the comments - I'd love constructive feedback.