Apache Spark has become a common tool in the data scientist’s toolbox, and in this post we show how to use the recently released Spark 2.1 for data analysis using data from the National Basketball Association (NBA). All code and examples from this blog post are available on GitHub.

Analytics have become a major tool in the sports world, and in the NBA in particular analytics have shaped how the sport is played. The league has skewed towards taking more 3-point shots due to their high efficiency as measured by points per field goal attempt. In this post we evaluate and analyze this trend in the NBA using season statistics data going back to 1979 along with geospatial shot chart data. The concepts in this post -- data cleansing, visualization, and modeling in Spark -- are general data science concepts and are applicable for other tasks beyond analyzing sports data. The post concludes with the author’s general impressions about using Spark and with tips and suggestions for new users.

For the analyses, we use Python 3 with the Spark Python API (PySpark) to create and analyze Spark DataFrames. In addition, we utilize both the Spark DataFrame’s domain-specific language (DSL) and Spark SQL to cleanse and visualize the season data, finally building a simple linear regression model using the spark.ml package -- Spark’s now primary machine learning API.

Finally, we note that the analysis in this tutorial can be run with a distributed Spark setup running on a cloud service such as Amazon Web Service (AWS) or on a Spark instance running on a local machine. We have tested both and have included resources for getting started on either AWS or a local machine at the end of this post.

The Code

Using data from Basketball Reference, we read in the season total stats for every player since the 1979-80 season into a Spark DataFrame using PySpark. DataFrames are designed to ease processing large amounts of structured tabular data on the Spark infrastructure and are now in fact just a type alias for a Dataset of Row.

We can also view the column names of our DataFrame:

print(df.columns)

['_c0', 'player', 'pos', 'age', 'team_id', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct', 'fg3', 'fg3a', 'fg3_pct', 'fg2', 'fg2a', 'fg2_pct', 'efg_pct', 'ft', 'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'yr']

Using our DataFrame, we can view the top 10 players, sorted by number of points in an individual season. Notice we use the toPandas function to retrieve our results. The corresponding result looks cleaner for display than when using the take function.

df.orderBy('pts',ascending = False).limit(10).toPandas()[['yr','player','age','pts','fg3']]

yr player age pts fg3 1987 Jordan,Michael 23 3041 12 1988 Jordan,Michael 24 2868 7 2006 Bryant,Kobe 27 2832 180 1990 Jordan,Michael 26 2753 92 1989 Jordan,Michael 25 2633 27 2014 Durant,Kevin 25 2593 192 1980 Gervin,George 27 2585 32 1991 Jordan,Michael 27 2580 29 1982 Gervin,George 29 2551 10 1993 Jordan,Michael 29 2541 81 Next, using the DataFrame domain specific language (DSL), we can analyze the average number of 3-point attempts for each season, scaled to the industry standard per 36 minutes (fg3a_p36m). The per 36 minutes metric provides an estimate of a given player’s stats projected to 36 minutes, an interval corresponding to an approximate full NBA game with adequate rest, while also allowing comparison across players that play different numbers of minutes. We compute this metric using the number of 3-point field goal attempts (fg3a) and minutes played (mp). Alternatively, we can utilize Spark SQL to perform the same query using SQL syntax: Now that we have aggregated our data and computed the average attempts per 36 minutes for each season, we can query our results into a Pandas DataFrame and plot it using matplotlib. We can see a steady rise in the number of 3 point attempts since the shot's introduction in the 1979-80 season, along with a blip in number of attempts during the period in the mid 90's when the NBA moved the line in a few feet. We can fit a linear regression model to this curve to model the number of shot attempts for the next 5 years. Of course, this assumes a linear nature of the rate of increase of attempts and is likely a naive assumption. Firstly, we must transform our data using the VectorAssembler function to a single column where each row of the DataFrame contains a feature vector. This is a requirement for the linear regression API in MLlib. We first build the transformer using our single variable `yr` and transform our season total data using the transformer function. We then build our linear regression model object using our transformed data. yr fga_pm fg3a_pm features label 1980 13.49321407 0.410089262 [1980.0] 0.410089262 1981 13.15346947 0.3093759891 [1981.0] 0.3093759891 1982 13.20229631 0.3415114296 [1982.0] 0.3415114296 1983 13.30541336 0.3314785517 [1983.0] 0.3314785517 1984 13.14301635 0.3571099981 [1984.0] 0.3571099981 Next, we want to apply our trained model object model to our original training set along with 5 years of future data. Containing this time period, we build a new DataFrame, transform it to include a feature vector, and then apply our model to make a prediction. We can then plot our results: Analyzing Geospatial Shot Chart Data In addition to season total data, we process and analyze NBA shot charts to view the impact the 3-point revolution has had on shot selection. The shot chart data was originally sourced from NBA.com. The shot chart data contains xy coordinates of field goal attempts on the court for individual players, game date, time of shot, shot distance, a shot made flag, and other fields. We have compiled all individual seasons where a player attempted at least 1000 field goals attempts from the 2010-11 through the 2015-16 season. As before we can read in the CSV data into a Spark DataFrame. We preview the data. df.orderBy('game_date').limit(10).toPandas()[['yr','name','game_date','shot_distance','x','y','shot_made_flag']] yr name game_date shot_distance x y shot_made_flag 2011 LaMarcus Aldridge 2010-10-26 1 4 11 0 2011 Paul Pierce 2010-10-26 25 67 246 1 2011 Paul Pierce 2010-10-26 18 165 83 0 2011 Paul Pierce 2010-10-26 24 159 186 0 2011 Paul Pierce 2010-10-26 24 198 148 1 2011 Paul Pierce 2010-10-26 23 231 4 1 2011 Paul Pierce 2010-10-26 1 -7 9 0 2011 Paul Pierce 2010-10-26 0 -2 -5 1 2011 LaMarcus Aldridge 2010-10-26 21 39 211 0 2011 LaMarcus Aldridge 2010-10-26 8 -82 23 0

We can query an individual player and season and visualize their shots locations. We built a plotting function plot_shot_chart (see the GitHub repo) that is based on Savvas Tjortjoglou's example.

As an example, we query and visualize Steph Curry's 2015-16 historic shooting season using a hexbin plot, which is a two-dimensional histogram with hexagonal-shaped bins.

The shot chart data is rich in information, but it does not specify if the shot type is a 3-point attempt or a corner 3. We solve for this by building User Defined Functions (UDF), which identify the shot type given the xy coordinates of the shot attempt.

Here we defined our shot labeling functions using standard Python functions utilizing numpy routines.

We then register our UDFs and apply each UDF to the entire dataset to classify each shot type:

We can visualize the change in the shot selection over the past 6 years using all of our data from the 2010-11 season up until the 2015-16 season. For visualization purposes, we exclude all shot attempts taken inside of 8 feet as we would like to focus on the midrange and 3 point shots.

Over the years, there is a notable trend towards more three-pointers and fewer midrange shots.

Finally, we evaluate shot efficiency as a function of shot distance.

We then plot our results.

Among the top scorers in the league, close 3-point attempts are among the most efficient shots in the league, on par with shots taken close to the basket. It's no wonder that accurate 3-point shooting is among the most coveted talents in the NBA today!

Conclusion

Lastly as a seasoned data scientist, SQL user, and Python junkie, here are my two cents on getting started with Spark. The Spark ecosystem and documentation are continually evolving and it is important to use the newest Spark version. A first time user will notice there are multiple ways to solve a problem using different languages (Scala, Java, Python, R), different APIs (Resilient Distributed Dataset (RDD), Dataset, DataFrame), and different data manipulation routines (DataFrame DSL, Spark SQL). Many choices are up to the users and others are guided by the documentation. Since Spark 2.0 for example, the DataFrame is now the primary Spark API for Python and R users (rather than the original and still useful RDD). In addition, the DataFrame-based spark.ml package is now the primary machine learning API in Spark replacing the RDD-based API. Bottom line: the platform is evolving and it pays to stay up to date.

In this post, we’ve demonstrated how to use Apache Spark to accomplish key data science tasks including data exploration, visualization, and model building. These principles are applicable to other data science tasks and datasets, and we encourage you to check out the repository and try it on your own!

Additional Resources