In today’s tutorial we are going to build off of our last blog post by using the function we created to aggregate the ENTIRE history of team performance data. Once the data is collected we will analyze it primarily using dplyr and wrap up with some ggvis generated visualizations.

As an ode to one of the a music group near to my heart I have decided to assign naming rights to this process that will help you tackle data questions big and small.

This process is somewhat complicated since Github uses https . I highly reccomend that if you have any questions about the functions being used here, or anywhere in this tutorial, that you use R’s fantastic help function by enter ?functionName

First thing we need after to do after firing up is load/install the packages we’re going to use.

How do we go about that you ask? Though R has a multiple options up to this task, today we are going to use the magic of a for loop . Since every season besides the 2014-2015 is in the record books we are going to separate historic seasons from this current 2014-2015 season and then bind the 2 data frames together that way we are only scraping the historic data once.

After some perusing it looks like the first official season was 1949-1950. Perfect we know what to feed our function to make her happy, every season end from 1950 to 2015.

You should see the function getBREFTeamStatTable in your workspace but before we can use it must decide which data we want to investigate. How about the entire history of NBA team statistics

Time to Analyze

Now that we are in possession of the 1363 row data frame containing every team’s statistical performance since the 1949-50 season it’s time to explore it in order to better understand the data and find interesting ideas through visualization.

One nice way to quickly analyze a bunch of a data frame's numeric data is with the summary function. Let’s use it on our NBA data frame.

all_nba_team_data %>>% summary #summary of the variables

## season table_name bref_team_id ## Length:1363 Length:1363 Length:1363 ## Class :character Class :character Class :character ## Mode :character Mode :character Mode :character ## ## ## ## ## team g mp fg ## Length:1363 Min. : 6.0 Min. : 1465 Min. : 214 ## Class :character 1st Qu.:82.0 1st Qu.:19755 1st Qu.:2932 ## Mode :character Median :82.0 Median :19805 Median :3179 ## Mean :78.3 Mean :19087 Mean :3087 ## 3rd Qu.:82.0 3rd Qu.:19855 3rd Qu.:3491 ## Max. :82.0 Max. :20080 Max. :3980 ## NA's :1 NA's :141 NA's :1 ## fga fg. X3p X3pa ## Min. : 496 Min. :0.3100 Min. : 10.0 Min. : 75.0 ## 1st Qu.:6526 1st Qu.:0.4410 1st Qu.:133.8 1st Qu.: 423.5 ## Median :6899 Median :0.4570 Median :333.0 Median : 963.5 ## Mean :6784 Mean :0.4538 Mean :329.3 Mean : 943.4 ## 3rd Qu.:7380 3rd Qu.:0.4748 3rd Qu.:493.2 3rd Qu.:1388.2 ## Max. :9295 Max. :0.5450 Max. :891.0 Max. :2371.0 ## NA's :1 NA's :1 NA's :379 NA's :379 ## X3p. X2p X2pa X2p. ## Min. :0.1040 Min. : 168 Min. : 358 Min. :0.3100 ## 1st Qu.:0.3140 1st Qu.:2466 1st Qu.:5249 1st Qu.:0.4550 ## Median :0.3430 Median :2785 Median :5996 Median :0.4750 ## Mean :0.3293 Mean :2849 Mean :6102 Mean :0.4682 ## 3rd Qu.:0.3620 3rd Qu.:3449 3rd Qu.:7197 3rd Qu.:0.4910 ## Max. :0.4280 Max. :3972 Max. :9295 Max. :0.5580 ## NA's :379 NA's :1 NA's :1 NA's :1 ## ft fta ft. orb ## Min. : 99 Min. : 124 Min. :0.6250 Min. : 57 ## 1st Qu.:1486 1st Qu.:1984 1st Qu.:0.7310 1st Qu.: 919 ## Median :1646 Median :2196 Median :0.7510 Median :1044 ## Mean :1635 Mean :2183 Mean :0.7498 Mean :1022 ## 3rd Qu.:1836 3rd Qu.:2443 3rd Qu.:0.7700 3rd Qu.:1172 ## Max. :2434 Max. :3411 Max. :0.8320 Max. :1520 ## NA's :1 NA's :1 NA's :1 NA's :260 ## drb trb ast stl ## Min. : 174 Min. : 243 Min. : 115 Min. : 34.0 ## 1st Qu.:2335 1st Qu.:3374 1st Qu.:1679 1st Qu.: 585.5 ## Median :2440 Median :3533 Median :1854 Median : 655.0 ## Mean :2363 Mean :3610 Mean :1812 Mean : 647.5 ## 3rd Qu.:2545 3rd Qu.:3763 3rd Qu.:2041 3rd Qu.: 731.5 ## Max. :3074 Max. :6131 Max. :2575 Max. :1059.0 ## NA's :260 NA's :18 NA's :1 NA's :260 ## blk tov pf pts ## Min. : 21.0 Min. : 76 Min. : 101 Min. : 614 ## 1st Qu.:347.0 1st Qu.:1175 1st Qu.:1731 1st Qu.: 7820 ## Median :399.0 Median :1281 Median :1875 Median : 8340 ## Mean :399.5 Mean :1279 Mean :1833 Mean : 8047 ## 3rd Qu.:460.0 3rd Qu.:1433 3rd Qu.:2028 3rd Qu.: 8881 ## Max. :716.0 Max. :2011 Max. :2470 Max. :10371 ## NA's :260 NA's :260 NA's :1 NA's :1 ## pts.g playoff_team scrape_time ## Min. : 70.0 Mode :logical Min. :2014-11-14 16:36:47 ## 1st Qu.: 96.6 FALSE:584 1st Qu.:2014-11-14 16:37:02 ## Median :102.4 TRUE :779 Median :2014-11-14 16:37:12 ## Mean :102.5 NA's :0 Mean :2014-11-14 16:37:11 ## 3rd Qu.:108.6 3rd Qu.:2014-11-14 16:37:22 ## Max. :126.5 Max. :2014-11-14 16:37:31 ## NA's :1 ## season_end ## Min. :1950 ## 1st Qu.:1978 ## Median :1992 ## Mean :1990 ## 3rd Qu.:2004 ## Max. :2015 ##

We now can see all sorts interesting things, for example the highest ever team field goal percentage was 54.5% [the 1984-85 Lakers]. I wonder if we can find quickly find all the unique NBA teams that ever played? Super easy to do in R

all_nba_team_data %>>% select(team) %>>% #select the team arrange(team) %>>% #sort it so unique %>>% # gives us the unique results unlist %>>% #comes in a list form so need to unlist it as.character ## every NBA Team

## [1] "Anderson Packers" ## [2] "Atlanta Hawks" ## [3] "Baltimore Bullets" ## [4] "Boston Celtics" ## [5] "Brooklyn Nets" ## [6] "Buffalo Braves" ## [7] "Capital Bullets" ## [8] "Charlotte Bobcats" ## [9] "Charlotte Hornets" ## [10] "Chicago Bulls" ## [11] "Chicago Packers" ## [12] "Chicago Stags" ## [13] "Chicago Zephyrs" ## [14] "Cincinnati Royals" ## [15] "Cleveland Cavaliers" ## [16] "Dallas Mavericks" ## [17] "Denver Nuggets" ## [18] "Detroit Pistons" ## [19] "Fort Wayne Pistons" ## [20] "Golden State Warriors" ## [21] "Houston Rockets" ## [22] "Indiana Pacers" ## [23] "Indianapolis Olympians" ## [24] "Kansas City Kings" ## [25] "Kansas City-Omaha Kings" ## [26] "Los Angeles Clippers" ## [27] "Los Angeles Lakers" ## [28] "Memphis Grizzlies" ## [29] "Miami Heat" ## [30] "Milwaukee Bucks" ## [31] "Milwaukee Hawks" ## [32] "Minneapolis Lakers" ## [33] "Minnesota Timberwolves" ## [34] "New Jersey Nets" ## [35] "New Orleans Hornets" ## [36] "New Orleans Jazz" ## [37] "New Orleans Pelicans" ## [38] "New Orleans/Oklahoma City Hornets" ## [39] "New York Knicks" ## [40] "New York Nets" ## [41] "Oklahoma City Thunder" ## [42] "Orlando Magic" ## [43] "Philadelphia 76ers" ## [44] "Philadelphia Warriors" ## [45] "Phoenix Suns" ## [46] "Portland Trail Blazers" ## [47] "Rochester Royals" ## [48] "Sacramento Kings" ## [49] "San Antonio Spurs" ## [50] "San Diego Clippers" ## [51] "San Diego Rockets" ## [52] "San Francisco Warriors" ## [53] "Seattle SuperSonics" ## [54] "Sheboygan Red Skins" ## [55] "St. Louis Bombers" ## [56] "St. Louis Hawks" ## [57] "Syracuse Nationals" ## [58] "Toronto Raptors" ## [59] "Tri-Cities Blackhawks" ## [60] "Utah Jazz" ## [61] "Vancouver Grizzlies" ## [62] "Washington Bullets" ## [63] "Washington Capitols" ## [64] "Washington Wizards" ## [65] "Waterloo Hawks"

Look at Us, Master's of This Data

I wonder if this arrogant troll knew that or could tell us how many official NBA teams there have been since 1949-50??



Time Create Some New Variables As fun as it is to explore the variables we already have in the data frame we need to move on to creating some of our own. One thing this data frame doesn't tell us is how many points each team scored during the course of the season. I also see a metric that captures the number of points per field goal attempt. Seems like a daunting task, it would be in Excel but we hate Excel and don't need it when we have nuclear weapons like R. Yet again there are countless ways to achieve both of these things in R but my favorite way is to use dplyr's mutate function. all_nba_team_data %>>% filter(!is.na(g)) %>>% mutate(points_total = g * pts.g, points_per_fga = points_total / fga) -> all_nba_team_data all_nba_team_data %>>% select(team,season,points_total, points_per_fga) #make sure it worked ## Source: local data frame [1,362 x 4] ## ## team season points_total points_per_fga ## 1 Dallas Mavericks 2014-2015 963.9 1.276689 ## 2 Portland Trail Blazers 2014-2015 948.6 1.224000 ## 3 Toronto Raptors 2014-2015 948.6 1.278437 ## 4 Phoenix Suns 2014-2015 838.4 1.243917 ## 5 Golden State Warriors 2014-2015 838.4 1.332909 ## 6 Boston Celtics 2014-2015 732.2 1.190569 ## 7 Brooklyn Nets 2014-2015 831.2 1.272894 ## 8 Sacramento Kings 2014-2015 934.2 1.326989 ## 9 Chicago Bulls 2014-2015 932.4 1.307714 ## 10 Houston Rockets 2014-2015 826.4 1.341558 ## .. ... ... ... ...

Cool like Joe Johnson at end of a game with the Brooklyn Nets looking for the win, but we went to all that work to add those new columns we should at least explore them a little. How about trying to find which team had the highest ever points per field goal attempt? all_nba_team_data %>>% filter(max(points_per_fga) == points_per_fga) %>>% select(team, season_end, points_per_fga, playoff_team) ## Source: local data frame [1 x 4] ## ## team season_end points_per_fga playoff_team ## 1 Utah Jazz 1995 1.376369 TRUE Well, well, well those 1994-95 Utah Jazz. That was a pretty good team with superstars like Felton Spencer, John Crotty, Blue Edwards, Adam Keefe and this bad ass at age 35.

I don’t know, maybe that team had a few other decent players, just can’t think of them right now! This whole analysis thing has been fun and we just scratched the surface of what R can do but we came to here visualize, let’s do it. Visualize What You Can’t C No better way to think about it than those lyrics from the late Tupac Shakur. We can see the data of course, all 44,946 pieces of it, but its hard to really see what’s going on in a tabular format. That, ladies and gentleman, is what data visualization is for.

What Do We Visualize? Before we get started visualizing we need to decide what exactly we want to explore. I keep hearing about this Daryl Morey character and how he is all into all these crazy things including data science. I also hear rumors that he is encouraging a game-plan centered around shooting tons of three pointers and even tried to formulate his roster around doing that.

Can data analysis tell us anything about this and the history of the three point shot? Of course it can.

Step 1: Filter Down to the Three Point Era First we need to create a data frame that includes only the seasons since the advent of the Three Point Shot Era. We could use Google to find the answer but we are learning we are data scientists so let’s use R to answer the question of what years cover this era. How can R do this? When we looked at the summary of our data frame we noticed that X3p., x3p and X3pa columns all contained NA values. Hmmm, maybe thats the answer, if we filter out NAs in one of those columns intuition says that should work. all_nba_team_data %>>% filter(!is.na(X3pa)) -> era3pt_shot #create the 3pt era data frame by filter out NA remember that that symbol ! means NOT and is.na is a function to find NA values It Worked, JEAH Looks like R got us the right answer and if you don’t trust R, this should do the trick.



Step 2: Add A New Discrete Columns for the Decade We have one item before to take care of before getting to the best part and that’s adding a discrete column for the decade which we intend utilize to add some spice to our visualizations.. era3pt_shot$season_end %>>% #select season end cut(breaks = c(0,1989,1999,2009,2015), #use cut to tell it end of each decade labels = c('1980s','1990s','2000s','2010s') #label the new column ) -> era3pt_shot$decade

Step 3A: Get Hyped

Step 3B: Visualization Time Alright we have what we need, its #DataViz time. Let’s look at a colored Scatter Plot of Total Three Point Shot Attempts against Total Field Goal Attempts. Note since these variables are not calculated in real terms across the seasons [remember our 2015 season data is on going] we must filter the 2014-15 data in order to keep things consistent and not ruin the visualization. Time to Admire Our First Visualization era3pt_shot %>>% filter(!season_end == 2015) %>>% #filter out 2015 ggvis(x =~fga, y =~X3pa, fill=~factor(decade), shape=~factor(decade), fillOpacity := 0.65) %>>% #fga vs 3pt attempts colored by decade layer_points() %>>% #ggvis nomenclature for scatter add_axis("x", title = "Field Goal Attempts", title_offset = 50) %>>% #fix x title add_axis("y", title = "3PT Shot Attempts", title_offset = 55) %>>% #fix y title add_legend(c("fill",'shape'), title = "Decade") #clean up the legend Renderer: SVG | Canvas

Download

Look at that beautiful ggvis scatter plot. What do we see here? Well first it looks like back in the 1980s and 1990s team’s on average used to take significantly more field goal attempts, we even see one insane out-lier, the absolutely DREADFUL 1990-91 Denver Nuggets who attempted an insane 8,668 field goals [bonus challenge to those up for it, try to use the 2 lines of dplyr I used to figure this out]. In addition to that it, looks like as we move through the decades, despite less overall field goal attempts, teams appear to be taking more 3 point shots. There are all sorts of factors that may have influenced this pattern outside of the magical powers of Daryl Morey, I don’t want to give it away but if you have time, try to investigate how the NBA has changed the Three Point Shot rules since its adoption in the 1979-80 season. Though we won’t get into it in this post, it looks like the data during the 1990s and the 2010s would be ripe for Clustering Analysis [hint there maybe actual clusters in what we see or these very apparent clusters data maybe the result of different external force]. This visualization is quite interesting and the more we look and think about it the more potential investigative topics may come across but this visualization isn’t fair, it's missing the 2015 season data and it isn't easy to see if there have been trends in how the 3 point shot progressed as the years have passed. Step 4: Figure Out A Way to Create a Real Variable that Compares All Season’s Equally I know a way, lets add a new column that is an apples to apples comparison across all seasons since 1979-80 but what? Hmm… what are potential real variables that could be used to try to generate something like this? Two good options come to mind, minutes and games played. We could use either, but for the purposes of this next analysis let’s use minutes played. Since we are investigating three point shot attempts lets select that and divide by total minutes played. era3pt_shot %>>% mutate(X3pa_per_min = X3pa/mp) -> era3pt_shot #add the new variable Step 5: Time to Visualize It, the Iggy Azelea Way

Ok we have the variables we need let’s get to it. For this visualization we want to look at 3 Point Shot Attempts Per Minute by Season End. We could do this visually in any number of ways but since I want to get fancy and showcase the power of easy regression analysis in ggvis thanks to the hard work of the man, the myth, the legend Hadley Wickham we are going to stick with a scatter plot and layer on a regression line. era3pt_shot %>>% ggvis(x =~season_end, y =~X3pa_per_min, fill = ~factor(decade), fillOpacity := 0.45) %>>% add_axis("x", title = "", values = seq(from = 1980,to =2015,1), format="####", title_offset = 50, properties = axis_props( ticks = list(stroke = "black"), majorTicks = list(strokeWidth = 2), labels = list( angle = 50, fontSize = 11, align = "left", baseline = "middle", dx = 3 ))) %>>% add_axis("y", title = "3PT Point Attempts Per Minute", title_offset = 55, ticks = 20) %>>% layer_points(shape=~factor(decade)) %>>% group_by(decade) %>>% layer_model_predictions(model = 'lm', stroke =~factor(decade), se = T) %>>% add_legend(c("fill","stroke","shape"),orient = 'right', title = "Decade", properties = legend_props( title = list(fontSize = 12), labels = list(fontSize = 10, dx = 10), symbol = list(stroke = "black", strokeWidth = 1, size = 100) )) Renderer: SVG | Canvas

Download

Damn That 1362 Point Plot is FINE

Look at this amazing regression scatter plot highlighting every single 3 point shot attempt per minute since the introduction of the shot. It appears there was certainly a an upward trend from the 1980’s into the late 1990s and another starting in the mid 2000s through this season. There are a host of potential explanatory factors and other follow on analysis we could perform based on this viz, some which may be topics in forthcoming posts, but before we call it a day we want to look at, and visualize, one last item. When looking at this graph it may be should be come apparent that, with only a few exceptions, each season’s leader in three point shot attempts appears to be higher than the prior season’s leader, signifying a potential trend. In fancy statistical words, this possible trend appears to display linearity. Appearances can deceive though and we need to thoroughly investigate this hypothesis in keeping with the data ninja code of mathematically proving the possibility of a relationship. There are a number of ways to do this while adhering to this code, all of which R or any good programming language makes easy for you. In this case we are going to use the results of a linear model, the easiest of which to do in R uses the lm function.

Step 6: Let’s Do It, Linear Regression That Is. era3pt_shot %>>% group_by(season_end) %>>% #group by season filter(X3pa_per_min == max(X3pa_per_min)) %>>% #take the max of each year (lm(X3pa_per_min ~ season_end, data = .)) %>>% #apply lm against season summary #let's look at the summary data to see the fit ## ## Call: ## lm(formula = X3pa_per_min ~ season_end, data = .) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.019115 -0.006928 -0.000761 0.002920 0.032767 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.5236007 0.3623104 -18.01 <2e-16 *** ## season_end 0.0033032 0.0001814 18.21 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.01131 on 34 degrees of freedom ## Multiple R-squared: 0.907, Adjusted R-squared: 0.9043 ## F-statistic: 331.7 on 1 and 34 DF, p-value: < 2.2e-16

Good Job Intuition, We Were Right That’s a 90% Fit to the Data

Our instincts served us well. There is unquestionably a linear relationship over time that indicates that the team that leads the league in 3 point shot attempts per minute has been going up year by year. What this means for the future we can’t say definitively and I by no means would extrapolate that this trend will continue, but historically it has. Now for our final visualization let’s look at who the teams were that actually lead the league in three point shot attempts per minute were.

Step 7: Create A Better “Real” Variable to Plot With Before we do this, although I like our real statistic we created it is kind of hard to process what it is actually telling us, put it this way could we use this to explain this trend these basketball loving ladies?

Before we do this, although I like our real statistic we created it is kind of hard to process what it is actually telling us, put it this way could we use this to explain this trend these basketball loving ladies? Maybe, but I think we could do better. We know that there are 5 players on the court per team at a time, that there are a minimum of 4 quarters each of 12 minute length. That means, excluding overtime, there are 240 available minutes per game. So we can take our new variable, and multiply it by 240 to get a new version of it that contains the same information but per 240 minutes, essentially it is a roundabout that should nearly mimic three shot attempts per game if we were to calculate it. Let’s add our new variable and then create a new data frame containing only the top performing teams by season. era3pt_shot %>>% mutate(X3pa_per_240_min = X3pa_per_min * 240) %>>% group_by(season_end) %>>% filter(X3pa_per_min == max(X3pa_per_min)) %>>% select(season_end, team, X3pa_per_240_min) -> top_teams

Step 8: Lets Visualize These Teams Now that we have data and new variable this let’s quickly visualize it, but since I am stickler for colors matching the entities they are associated with I want to find the right colors for each team. Fortunately I’ve already done this and we can just read in the file with this data and add in the 2 teams missing teams [the San Diego Clippers and the late Seattle Supersonics, RIP]. #Bring in Correct Colors 'https://asbcllc.com/data/NBA/team_colors.csv' %>>% read.csv %>>% data.frame %>>% tbl_df -> active_colors data.frame(team = top_teams$team %>>% unique) %>>% tbl_df %>>% arrange((team)) -> teams teams %>>% merge(active_colors,all.x = T) -> teams '#EE2944' -> teams[14,2] #add the San Diego Clippers '#266A2E'-> teams[15,2] #add the Sonics We are now ready to create our final visualization. top_teams %>>% ggvis(x =~season_end, #season y =~X3pa_per_240_min,#by our 240 stat showi text := ~team, #plot the team name fill = ~team) %>>% #color the team add_axis("x", title = "", #no title values = seq(from = 1980,to =2015,1), #plot the years format="####", title_offset = 50, #format the axis properties = axis_props( ticks = list(stroke = "black"), majorTicks = list(strokeWidth = 2), labels = list( angle = 50, fontSize = 11, align = "left", baseline = "middle", dx = 3 ))) %>>% add_axis("y", title = "3PT Attempts Per 240 Available Minutes", format="####", #uses d3 axis format title_offset = 55, ticks = 20) %>>% #fix title and ticks layer_text() %>>% #adds the text scale_nominal(property = 'fill', #use our pretty colors domain = as.character(teams$team), range = as.character(teams$primary_color)) %>>% add_legend("fill", orient = 'right', title = "Team", #need a pretty legend properties = legend_props( title = list(fontSize = 12), labels = list(fontSize = 10, dx = 10), symbol = list(stroke = "black", strokeWidth = 1, size = 100) )) Renderer: SVG | Canvas

Download Wow look at this amazing chart. This clearly validates the linearity we explored earlier and shows that in recent times this Daryl Morey guy has clearly pushed for and structured teams around taking lots of three point shots. What is even more interesting is that there appears to be another candy loving, last minute ETO signing, father of 8 that is nearly universally despised by Nets and Lakers fans whose presence on a roster, whether by design or chance, appears as a constant in all but 1 of the team’s that have lead the NBA in 3 point shot attempts per 240 minutes since the 2009-10 season. Any idea who he may be? Here’s a hint



Wow I never knew there was a team called the Chicago Stags.