Methods

The methods described below attempt to summarize the variety of applications and machine learning algorithms in layman’s terms. All programming was done in Python 2.7.

Step 1: Data Acquisition with Selenium

Selenium is a popular web automation application that can be used for web scraping purposes. For this study, I used Selenium to scrape Basketball-Reference.com for my data.

Can you guess who this player is? (http://www.basketball-reference.com/players/w/westbru01.html)

To best define a player, I identified a player’s career statistics from Per-100 Possessions, Advanced Metrics and Shooting Metrics. Using rate statistics (i.e. points per game) or cumulative statistics (i.e. total points) can be misleading when it comes to analysis because these statistics tend to inflate players with lengthier careers. To deal with outliers, I instituted a minimum threshold of 40 games played.

Prior to analysis, my data consisted of 547 players and 56 features (or dimensions) from 2014 to 2017. While this was definitely a small sample size, my goal was to uncover the various positions in today’s NBA rather than comparing today’s NBA players with those from stylistically different generations.

Note: The 2016–2017 data includes everything up until the NBA All-Star Break.

Step 2: Dimensionality Reduction with Linear Discriminant Analysis

As dimensions increase, the available data becomes more and more sparse.(http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/)

In high-dimensional data, the volume of space increases leading to the available data to become sparse. This is known as the “Curse of Dimensionality” and is problematic for any method that requires statistical significance. In this study, each dimension is represented by a player’s feature statistics (i.e. PER, TS%, 3P%, etc.) and in order to obtain a statistically sound result, the amount of data must be reduced by obtaining a set of principal components.

Linear Discriminant Analysis (LDA) is a a method used in statistics and machine learning to find a linear combination of features that characterizes or separates classes of objects. Put simply, LDA attempts to find a feature subspace that maximizes class separability. In this case, I used a player’s current position (i.e. point guard, shooting guard, small forward, power forward, and center) as the prior class. Next LDA found the linear combination of features that best separated the five classes and reduced the dimensions of the data into two dimensions. While Principal Component Analysis (PCA) is also a method for dimensionality reduction, I captured 71.85% of the data in two components with LDA while only capturing 54.46% of the data in two components with PCA.

Step 3: Cluster the Data with KMeans Clustering

Here is some R code which generates a data set and implements the algorithm (http://rossfarrelly.blogspot.com/2012/12/k-means-clustering.html)

KMeans Clustering is a simple and popular clustering algorithm that finds the cluster centers that best represent certain regions of the data. The algorithm alternates between assigning each data point to the closest cluster center and then setting each cluster center as the mean of the data points that are assigned to it. The algorithm finishes when the assignment of instances to clusters no longer changes.

The decision to have 8 clusters was based on the best silhouette score, which is a measure of how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a high value indicates that an object is well matched to its own cluster and poorly matched to neighboring clusters. If the clusters have a high value, then the clustering configuration is appropriate.

Step 4: Feature Extraction with Principal Component Analysis

Principal Component Analysis (PCA) is a common feature extraction method in machine learning. The algorithm finds the eigenvectors of a covariance matrix with the highest eigenvalues and then uses those values to project the data into a new subspace of equal or less dimensions. In feature extraction, PCA reduces the number of features by constructing a smaller number of variables that capture a significant portion of the data found in the original features. Using PCA, I identified the most important features in order to define each cluster.

Step 5: Data Visualization with Tableau

Tableau is a powerful application that renders data in a clean and concise way. In the plots below, I map out the total number of clusters in the NBA and highlight each cluster in detail as well as explore some Advanced Metrics for each cluster.