In basketball, we typically talk about 5 positions: point guard, shooting guard, small forward, power forward, and center. Based on this, one might expect NBA players to fall into 5 distinct groups- Point guards perform similar to other point guards, shooting guards perform similar to other shooting guards, etc. Is this the case? Do NBA players fall neatly into position groups?

To answer this question, I will look at how NBA players “group” together. For example, there might be a group of players who collect lots of rebounds, shoot poorly from behind the 3 point line, and block lots of shots. I might call these players forwards. If we allow player performance to create groups, what will these groups look like?

To group players, I will use k-means clustering (https://en.wikipedia.org/wiki/K-means_clustering).

When choosing a clustering algorithm, its important to think about how the clustering algorithm defines clusters. k-means minimizes the distance between data points (players in my case) and the center of K different points. Because distance is between the cluster center and a given point, k-means assumes clusters are spherical. When thinking about clusters of NBA players, do I think these clusters will be spherical? If not, then I might want try a different clustering algorithm.

For now, I will assume generally spherical clusters and use k-means. At the end of this post, I will comment on whether this assumption seems valid.

1 2 3 4 5 6 7 #import libraries and tell ipython we want inline figures rather than interactive figures. import matplotlib.pyplot as plt , pandas as pd , numpy as np , matplotlib as mpl , requests , time % matplotlib inline pd . options . display . mpl_style = 'default' #load matplotlib for plotting plt . style . use ( 'ggplot' ) #im addicted to ggplot. so pretty. mpl . rcParams [ 'font.family' ] = [ 'Bitstream Vera Sans' ]

We need data. Collecting the data will require a couple steps. First, I will create a matrix of all players who ever played in the NBA (via the NBA.com API).

1 2 3 4 5 6 url = 'http://stats.nba.com/stats/commonallplayers?IsOnlyCurrentSeason=0&LeagueID=00&Season=2015-16' # the critical part of the URL is IsOnlyCurrentSeason=0. The 0 means all seasons. response = requests . get ( url ) #get data headers = response . json ()[ 'resultSets' ][ 0 ] players = response . json ()[ 'resultSets' ][ 0 ][ 'rowSet' ] players_df = pd . DataFrame ( players , columns = headers [ 'headers' ]) #turn data into dataframe

In the 1979-1980 season, the NBA started using the 3-point line. The 3-point has dramatically changed basketball, so players performed different before it. While this change in play was not instantaneous, it does not make sense to include players before the 3-point line.

1 2 players_df [ 'TO_YEAR' ] = players_df [ 'TO_YEAR' ] . apply ( lambda x : int ( x )) #change data in the TO_YEAR column from strings to numbers players_df = players_df [ players_df [ 'TO_YEAR' ] > 1979 ] #only use players after 3 pt line

I have a list of all the players after 1979, but I want data about all these players. When grouping the players, I am not interested in how much a player played. Instead, I want to know HOW a player played. To remove variability associated with playing time, I will gather data that is standardized for 36 minutes of play. For example, if a player averages 4 points and 12 minutes a game, this player averages 12 points per 36 minutes.

Below, I have written a function that will collect every player’s performance per 36 minutes. The function collects data one player at a time, so its VERY slow. If you want the data, it can be found on my github (https://github.com/dvatterott/nba_project).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def careerPer36 ( playerID ): url = 'http://stats.nba.com/stats/playercareerstats?LeagueID=00&PerMode=Per36&PlayerID=' + str ( playerID ) header_data = { #I pulled this header from the py goldsberry library 'Accept-Encoding' : 'gzip, deflate, sdch' , 'Accept-Language' : 'en-US,en;q=0.8' , 'Upgrade-Insecure-Requests' : '1' , 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64)' \ ' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 ' \ 'Safari/537.36' , 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9' \ ',image/webp,*/*;q=0.8' , 'Cache-Control' : 'max-age=0' , 'Connection' : 'keep-alive' } try : response = requests . get ( url , headers = header_data ) #get the data #print playerID #testing purposes except : time . sleep ( 5 ) #sometime the nba api gets mad about all the requests, so i take a quick break response = requests . get ( url , headers = header_data ) headers = response . json ()[ 'resultSets' ][ 1 ] #find headers players = response . json ()[ 'resultSets' ][ 1 ][ 'rowSet' ] #actual data players_df = pd . DataFrame ( players , columns = headers [ 'headers' ]) #create dataframe return players_df #df = pd.DataFrame() #player_list = players_df['PERSON_ID'] #df = df.append([careerPer36(x) for x in player_list]) #BEWARE. this takes forever. #df.index = range(np.size(df,0)) #df.to_pickle('nba_career_stats_'+time.strftime("%Y_%b_%d", time.gmtime())+'.pkl') df = pd . read_pickle ( 'nba_career_stats_2016_Feb_14.pkl' )

1 df . columns

Index([u'PLAYER_ID', u'LEAGUE_ID', u'TEAM_ID', u'GP', u'GS', u'MIN', u'FGM', u'FGA', u'FG_PCT', u'FG3M', u'FG3A', u'FG3_PCT', u'FTM', u'FTA', u'FT_PCT', u'OREB', u'DREB', u'REB', u'AST', u'STL', u'BLK', u'TOV', u'PF', u'PTS'], dtype='object')

Great! Now we have data that is scaled for 36 minutes of play (per36 data) from every player between 1979 and 2016. Above, I printed out the columns. I don’t want all this data. For instance, I do not care about how many minutes a player played. Also, some of the data is redundant. For instance, if I know a player’s field goal attempts (FGA) and field goal percentage (FG_PCT), I can calculate the number of made field goals (FGM). I removed the data columns that seem redundant. I do this because I do not want redundant data exercising too much influence on the grouping process.

Below, I create new data columns for 2 point field goal attempts and 2 point field goal percentage. I also remove all players who played less than 50 games. I do this because these players have not had the opportunity to establish consistent performance.

1 2 3 4 5 6 7 df = df [ df [ 'GP' ] > 50 ] #only players with more than 50 games. df = df . fillna ( 0 ) #some players have "None" in some cells. Turn these into 0s df [ 'FG2M' ] = df [ 'FGM' ] - df [ 'FG3M' ] #find number of 2pt field goals df [ 'FG2A' ] = df [ 'FGA' ] - df [ 'FG3A' ] #2 point field goal attempts df [ 'FG2_PCT' ] = df [ 'FG2M' ] / df [ 'FG2A' ] #2 point field goal percentage saveIDs = df [ 'PLAYER_ID' ] #save player IDs for later df = df . drop ([ 'PLAYER_ID' , 'LEAGUE_ID' , 'TEAM_ID' , 'GP' , 'GS' , 'MIN' , 'FGM' , 'FGA' , 'FG_PCT' , 'FG3M' , 'FTM' , 'REB' , 'PTS' , 'FG2M' ], 1 ) #pull out unncessary columns

It’s always important to visualize the data, so lets get an idea what we’re working with!

The plot below is called a scatter matrix. This type of plot will appear again, so lets go through it carefully. Each subplot has the feature (stat) labeled on its row which serves as its y-axis. The column feature serves as the x-axis. For example the subplot in the second column of the first row plots 3-point field goal attempts by 3-point field goal percentage. As you can see, players that have higher 3-point percentages tend to take more 3-pointers… makes sense.

On the diagonals, I plot the Kernel Density Estimation for the sample histogram. More players fall into areas where where the line is higher on the y-axis. For instance, no players shoot better than ~45% from behind the 3 point line.

One interesting part about scatter matrices is the plots below the diagonal are a reflection of the plots above the diagonal. For example, the data in the second column of the first row and the first column of the second row are the same. The only difference is the axes have switched.

1 2 3 axs = pd . tools . plotting . scatter_matrix ( df , alpha = 0.2 , figsize = ( 12 , 12 ), diagonal = 'kde' ); #the diagonal will show kernel density [ ax . set_yticks ([]) for ax in axs [:, 0 ]] #turn off the ticks that take up way too much space in such a crammed figure [ ax . set_xticks ([]) for ax in axs [ - 1 ,:]];

There are a couple things to note in the graph above. First, there’s a TON of information there. Second, it looks like there are some strong correlations. For example, look at the subplots depicting offensive rebounds by defensive rebounds.

While I tried to throw out redundant data, I clearly did not throw out all redundant data. For example, players that are good 3-point shooters are probably also good free throw shooters. These players are simply good shooters, and being a good shooter contributes to multiple data columns above.

When I group the data, I do not want an ability such as shooting to contribute too much. I want to group players equally according to all their abilities. Below I use a PCA to seperate variance associated with the different “components” (e.g., shooting ability) of basketball performance.

For an explanation of PCA I recommend this link - https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 from sklearn.decomposition import PCA from sklearn.preprocessing import scale X = df . as_matrix () #take data out of dataframe X = scale ( X ) #standardize the data before giving it to the PCA. #I standardize the data because some features such as PF or steals have lower magnitudes than other features such as FG2A #I want both to contribute equally to the PCA, so I make sure they're on the same scale. pca = PCA () #great PCA object pca . fit ( X ) #pull out principle components var_expl = pca . explained_variance_ratio_ #find amount of variance explained by each component tot_var_expl = np . array ([ sum ( var_expl [ 0 : i + 1 ]) for i , x in enumerate ( var_expl )]) #create vector with cumulative variance plt . figure ( figsize = ( 12 , 4 )) #create cumulative proportion of variance plot plt . subplot ( 1 , 2 , 1 ) plt . plot ( range ( 1 , len ( tot_var_expl ) + 1 ), tot_var_expl * 100 , 'o-' ) plt . axis ([ 0 , len ( tot_var_expl ) + 1 , 0 , 100 ]) plt . xlabel ( 'Number of PCA Components Included' ) plt . ylabel ( 'Percentage of variance explained (%)' ) plt . subplot ( 1 , 2 , 2 ) #create scree plot plt . plot ( range ( 1 , len ( var_expl ) + 1 ), var_expl * 100 , 'o-' ) plt . axis ([ 0 , len ( var_expl ) + 1 , 0 , 100 ]) plt . xlabel ( 'PCA Component' );

On the left, I plot the amount of variance explained after including each additional PCA component. Using all the components explains all the variability, but notice how little the last few components contribute. It doesn’t make sense to include a component that only explains 1% of the variability…but how many components to include!?

I chose to include the first 5 components because no component after the 5th explained more than 5% of the data. This part of the analysis is admittedly arbitrary, but 5% is a relatively conservative cut-off.

Below is the fun part of the data. We get to look at what features contribute to the different principle components.

Assists and 3-point shooting contribute to the first component. I will call this the Outside Skills component.

component. Free throw attempts, assists, turnovers and 2-point field goals contribute to the second component. I will call this the Rim Scoring component.

component. Free throw percentage and 2-point field goal percentage contribute to the third component. I will call this the Pure Points component.

component. 2-point field goal percentage and steals contribute to the fourth component. I will call this the Defensive Big Man component.

component. 3-point shooting and free throws contribute to the fifth component. I will call this the Dead Eye component.

One thing to keep in mind here is that each component explains less variance than the last. So while 3 point shooting contributes to both the 1st and 5th component, more 3 point shooting variability is probably explained by the 1st component.

It would be great if we had a PCA component that was only shooting and another that was only rebounding since we typically conceive these to be different skills. Yet, there are multiple aspects of each skill. For example, a 3-point shooter not only has to be a dead-eye shooter, but also has to find ways to get open. Additionally, being good at “getting open” might be something akin to basketball IQ which would also contribute to assists and steals!

1 2 3 factor_names = [ 'Outside Skills' , 'Rim Scoring' , 'Pure Points' , 'Defensive Big Man' , 'Dead Eye' ] #my component names loadings_df = pd . DataFrame ( pca . components_ , columns = df . columns ) #loadings_df[0:5] #all the factor loadings appear below.

Cool, we have our 5 PCA components. Now lets transform the data into our 5 component PCA space (from our 13 feature space - e.g., FG3A, FG3_PCT, ect.). To do this, we give each player a score on each of the 5 PCA components.

Next, I want to see how players cluster together based on their scores on these components. First, let’s investigate how using more or less clusters (i.e., groups) explains different amounts of variance.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 from scipy.spatial.distance import cdist , pdist , euclidean from sklearn.cluster import KMeans from sklearn import metrics #http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means #The above link was helpful when writing this code. reduced_data = PCA ( n_components = 5 , whiten = True ) . fit_transform ( X ) #transform data into the 5 PCA components space #kmeans assumes clusters have equal variance, and whitening helps keep this assumption. k_range = range ( 2 , 31 ) #looking amount of variance explained by 1 through 30 cluster k_means_var = [ KMeans ( n_clusters = k ) . fit ( reduced_data ) for k in k_range ] #fit kmeans with 1 cluster to 30 clusters #get labels and calculate silhouette score labels = [ i . labels_ for i in k_means_var ] sil_score = [ metrics . silhouette_score ( reduced_data , i , metric = 'euclidean' ) for i in labels ] centroids = [ i . cluster_centers_ for i in k_means_var ] #get the center of each cluster k_euclid = [ cdist ( reduced_data , cent , 'euclidean' ) for cent in centroids ] #calculate distance between each item and each cluster center dist = [ np . min ( ke , axis = 1 ) for ke in k_euclid ] #get the distance between each item and its cluster wcss = [ sum ( d ** 2 ) for d in dist ] #within cluster sum of squares tss = sum ( pdist ( reduced_data ) ** 2 / reduced_data . shape [ 0 ]) #total sum of squares bss = tss - wcss #between cluster sum of squares plt . clf () plt . figure ( figsize = ( 12 , 4 )) #create cumulative proportion of variance plot plt . subplot ( 1 , 2 , 1 ) plt . plot ( k_range , bss / tss * 100 , 'o-' ) plt . axis ([ 0 , np . max ( k_range ), 0 , 100 ]) plt . xlabel ( 'Number of Clusters' ) plt . ylabel ( 'Percentage of variance explained (%)' ); plt . subplot ( 1 , 2 , 2 ) #create scree plot plt . plot ( k_range , np . transpose ( sil_score ) * 100 , 'o-' ) plt . axis ([ 0 , np . max ( k_range ), 0 , 40 ]) plt . xlabel ( 'Number of Clusters' ); plt . ylabel ( 'Average Silhouette Score*100' );

As you can in the left hand plot, adding more clusters explains more of the variance, but there are diminishing returns. Each additional cluster explains a little less data than the last (much like each PCA component explained less variance than the previous component).

The particularly intersting point here is the point where the second derivative is greatest, when the amount of change changes the most (the elbow). The elbow occurs at the 6th cluster.

Perhaps not coincidently, 6 clusters also has the highest silhouette score (right hand plot). The silhouette score computes the average distance between a player and all other players in this player’s cluster. It then divides this distance by the distance between this player and all players in the next nearest cluster. Silhouette scores range between -1 and 1 (where negative one means the player is in the wrong cluster, 0 means the clusters completely overlap, and 1 means the clusters are extermely well separated).

Six clusters has the highest silhouette score at 0.19. 0.19 is not great, and suggests a different clustering algorithm might be better. More on this later.

Because 6 clusters is the elbow and has the highest silhouette score, I will use 6 clusters in my grouping analysis. Okay, now that I decided on 6 clusters lets see what players fall into what clusters!

1 2 3 4 5 6 7 8 9 10 final_fit = KMeans ( n_clusters = 6 ) . fit ( reduced_data ) #fit 6 clusters df [ 'kmeans_label' ] = final_fit . labels_ #label each data point with its clusters df [ 'PLAYER_ID' ] = saveIDs #of course we want to know what players are in what cluster player_names = [ pd . DataFrame ( players_df [ players_df [ 'PERSON_ID' ] == x ][ 'DISPLAY_LAST_COMMA_FIRST' ]) . to_string ( header = False , index = False ) for x in df [ 'PLAYER_ID' ]] # because playerID #s mean nothing to me, lets get the names too df [ 'Name' ] = player_names #lets also create a dataframe with data about where the clusters occur in the 5 component PCA space. cluster_locs = pd . DataFrame ( final_fit . cluster_centers_ , columns = [ 'component %s ' % str ( s ) for s in range ( np . size ( final_fit . cluster_centers_ , 1 ))]) cluster_locs . columns = factor_names

Awesome. Now lets see how all the clusters look. These clusters were created in 5 dimensional space, which is not easy to visualize. Below I plot another scatter matrix. The scatter matrix allows us to visualize the clusters in different 2D combinations of the 5D space.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 from scipy.stats import gaussian_kde plt . clf () centroids = final_fit . cluster_centers_ #find centroids so we can plot them colors = [ 'r' , 'g' , 'y' , 'b' , 'c' , 'm' ] #cluster colors Clusters = [ 'Cluster 0' , 'Cluster 1' , 'Cluster 2' , 'Cluster 3' , 'Cluster 4' , 'Cluster 5' ] #cluster...names numdata , numvars = reduced_data . shape #players by PCA components fig , axes = plt . subplots ( nrows = numvars , ncols = numvars , figsize = ( 10 , 10 )) #create a scatter matrix with 5**2 cells fig . subplots_adjust ( hspace = 0.05 , wspace = 0.05 ) recs = [] for col in colors : #make some patches for the legend recs . append ( mpl . patches . Rectangle (( 0 , 0 ), 1 , 1 , fc = col )) fig . legend ( recs , Clusters , 8 , ncol = 6 ) #create legend with patches above for i , ax in enumerate ( axes . flat ): # Hide all ticks and labels plt . setp ( ax . get_yticklabels (), visible = False ) #tick labels are too much with this many subplots plt . setp ( ax . get_xticklabels (), visible = False ) ax . grid ( False ) #again, too much if i % 5 == 0 : ax . set_ylabel ( factor_names [ i / 5 ]) #label outer y axes if i > 19 : ax . set_xlabel ( factor_names [ i - 20 ]) #label outer x axes for i , j in zip ( * np . triu_indices_from ( axes , k = 1 )): for x , y in [( i , j ), ( j , i )]: #plot individual data points and cluster centers axes [ y , x ] . plot ( reduced_data [:, x ], reduced_data [:, y ], 'k.' , markersize = 2 ) axes [ y , x ] . scatter ( centroids [:, x ], centroids [:, y ], marker = 'x' , s = 169 , linewidth = 3 , color = colors , zorder = 10 ) #create kernel density estimation for each PCA factor on the diagonals for i , label in enumerate ( factor_names ): density = gaussian_kde ( reduced_data [:, i ]) density . covariance_factor = lambda : . 25 density . _compute_covariance () x = np . linspace ( min ( reduced_data [:, i ]), max ( reduced_data [:, 1 ])) axes [ i , i ] . plot ( x , density ( x ))

In this plot above. I mark the center of a given cluster with an X. For example, Cluster 0 and Cluster 5 are both high in outside skills. Cluster 5 is also high in rim scoring, but low in pure points.

Below I look at the players in each cluster. The first thing I do is identify the player closest to the cluster’s center. I call this player the prototype. It is the player that most exemplifies a cluster.

I then show a picture of this player because… well I wanted to see who these players were. I print out this player’s stats and the cluster’s centroid location. Finally, I print out the first ten players in this cluster. This is the first ten players alphabetically. Not the ten players closest to cluster center.

1 2 3 4 5 6 7 8 9 10 11 from IPython.display import display from IPython.display import Image name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 0 ]) for x in reduced_data ])] #find cluster prototype PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #get players ID number #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image ( url = 'http://4.bp.blogspot.com/_RaOrchOImw8/S3mNk3exLeI/AAAAAAAAdZk/Hs-81mnXO_E/s400/Lloyd+Daniels.jpg' , width = 100 ) display ( player ) #display(df[df['Name']==name]) #prototype's stats display ( cluster_locs [ 0 : 1 ]) #cluster centroid location df [ df [ 'kmeans_label' ] == 0 ][ 'Name' ][: 10 ] #first ten players in the cluster (alphabetically)

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye 0.830457 -0.930833 0.28203 -0.054093 0.43606

16 Afflalo, Arron 20 Ainge, Danny 40 Allen, Ray 46 Alston, Rafer 50 Aminu, Al-Farouq 53 Andersen, David 54 Anderson, Alan 56 Anderson, Derek 60 Anderson, James 63 Anderson, Kyle Name: Name, dtype: object

First, let me mention that cluster number is a purely categorical variable. Not ordinal. If you run this analysis, you will likely create clusters with similar players, but in a different order. For example, your cluster 1 might be my cluster 0.

Cluster 0 has the most players (25%; about 490 of the 1965 in this cluster analysis) and is red in the scatter matrix above.

Cluster 0 players are second highest in outside shooting (in the table above you can see their average score on the outside skills component is 0.83). These players are lowest in rim scoring (-0.93), so they do not draw many fouls - they are basically the snipers from the outside.

The prototype is Lloyd Daniels who takes a fair number of 3s. I wouldn’t call 31% a dominant 3-point percentage, but its certainly not bad. Notably, Lloyd Daniels doesn’t seem to do much but shoot threes, as 55% of his shots come from the great beyond.

Cluster 0 notable players include Andrea Bargnani, JJ Barea, Danilo Gallinari, and Brandon Jennings. Some forwards. Some Guards. Mostly good shooters.

On to Cluster 1… I probably should have made a function from this code, but I enjoyed picking the players pictures too much.

1 2 3 4 5 6 7 8 name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 1 ]) for x in reduced_data ])] PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image ( url = 'https://allshookdown.files.wordpress.com/2009/06/laettner_200_921121.jpg?w=200&h=300' , width = 100 ) display ( player ) #display(df[df['Name']==name]) display ( cluster_locs [ 1 : 2 ]) df [ df [ 'kmeans_label' ] == 1 ][ 'Name' ][: 10 ]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye -0.340177 1.008111 1.051622 -0.150204 0.599516

1 Abdul-Jabbar, Kareem 4 Abdur-Rahim, Shareef 9 Adams, Alvan 18 Aguirre, Mark 75 Antetokounmpo, Giannis 77 Anthony, Carmelo 85 Arenas, Gilbert 121 Baker, Vin 133 Barkley, Charles 148 Bates, Billyray Name: Name, dtype: object

Cluster 1 is green in the scatter matrix and includes about 14% of players.

Cluster 1 is highest on the rim scoring, pure points, and Dead Eye components. These players get the ball in the hoop.

Christian Laettner is the prototype. He’s a solid scoring forward.

Gilbert Arenas stands out in the first ten names as I was tempted to think of this cluster as big men, but it really seems to be players who shoot, score, and draw fouls.

Cluster 1 Notable players include James Harden,Kevin Garnet, Kevin Durant, Tim Duncan, Kobe, Lebron, Kevin Martin, Shaq, Anthony Randolph??, Kevin Love, Derrick Rose, and Michael Jordan.

1 2 3 4 5 6 7 8 name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 2 ]) for x in reduced_data ])] PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image ( url = 'http://imageocd.com/images/nba10/doug-west-wallpaper-height-weight-position-college-high-school/doug-west.jpg' , width = 100 ) display ( player ) #display(df[df['Name']==name]) display ( cluster_locs [ 2 : 3 ]) df [ df [ 'kmeans_label' ] == 2 ][ 'Name' ][: 10 ]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye 0.013618 0.101054 0.445377 -0.347974 -1.257634

2 Abdul-Rauf, Mahmoud 3 Abdul-Wahad, Tariq 5 Abernethy, Tom 10 Adams, Hassan 14 Addison, Rafael 24 Alarie, Mark 27 Aldridge, LaMarcus 31 Alexander, Courtney 35 Alford, Steve 37 Allen, Lavoy Name: Name, dtype: object

Cluster 2 is yellow in the scatter matrix and includes about 17% of players.

Lots of big men who are not outside shooters and don’t draw many fouls. These players are strong 2 point shooters and free throw shooters. I think of these players as mid-range shooters. Many of the more recent Cluster 2 players are forwards since mid-range guards do not have much of a place in the current NBA.

Cluster 2’s prototype is Doug West. Doug West shoots well from the free throw line and on 2-point attempts, but not the 3-point line. He does not draw many fouls or collect many rebounds.

Cluster 2 noteable players include LaMarcus Aldridge, Tayshaun Prince, Thaddeus Young, and Shaun Livingston

1 2 3 4 5 6 7 8 name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 3 ]) for x in reduced_data ])] PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image ( url = 'http://4.bp.blogspot.com/_f5MWZq1BJXI/TCDRy3v6m9I/AAAAAAAAACc/Ux8M7uiadoc/s400/cato.jpg' , width = 100 ) display ( player ) #display(df[df['Name']==name]) display ( cluster_locs [ 3 : 4 ]) df [ df [ 'kmeans_label' ] == 3 ][ 'Name' ][: 10 ]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye -1.28655 -0.467105 -0.133546 0.905368 0.000679

7 Acres, Mark 8 Acy, Quincy 13 Adams, Steven 15 Adrien, Jeff 21 Ajinca, Alexis 26 Aldrich, Cole 34 Alexander, Victor 45 Alston, Derrick 51 Amundson, Lou 52 Andersen, Chris Name: Name, dtype: object

Cluster 3 is blue in the scatter matrix and includes about 16% of players.

Cluster 3 players do not have outside skills such as assists and 3-point shooting (they’re last in outside skills). They do not draw many fouls or shoot well from the free throw line. These players do not shoot often, but have a decent shooting percentage. This is likely because they only shoot when wide open next to the hoop.

Cluster 3 players are highest on the defensive big man component. They block lots of shots and collect lots of rebounds.

The Cluster 3 prototype is Kelvin Cato. Cato is not and outside shooter and he only averages 7.5 shots per 36, but he makes these shots at a decent clip. Cato averages about 10 rebounds per 36.

Notable Cluster 3 players include Andrew Bogut, Tyson Chandler, Andre Drummond, Kawahi Leonard??, Dikembe Mutumbo, and Hassan Whiteside.

1 2 3 4 5 6 7 8 name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 4 ]) for x in reduced_data ])] PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image ( url = 'http://www.thenolookpass.com/wp-content/uploads/2012/01/IMG-724x1024.jpg' , width = 100 ) #a photo just for fun display ( player ) #display(df[df['Name']==name]) display ( cluster_locs [ 4 : 5 ]) df [ df [ 'kmeans_label' ] == 4 ][ 'Name' ][: 10 ]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye -0.668445 0.035927 -0.917479 -1.243347 0.244897

0 Abdelnaby, Alaa 17 Ager, Maurice 28 Aleksinas, Chuck 33 Alexander, Joe 36 Allen, Jerome 48 Amaechi, John 49 Amaya, Ashraf 74 Anstey, Chris 82 Araujo, Rafael 89 Armstrong, Brandon Name: Name, dtype: object

Cluster 4 is cyan in the scatter matrix above and includes the least number of players (about 13%).

Cluster 4 players are not high on outsize skills. They are average on rim scoring. They do not score many points, and they don’t fill up the defensive side of the stat sheet. These players don’t seem like all stars.

Looking at Doug Edwards’ stats - certainly not a 3-point shooter. I guess a good description of cluster 4 players might be … NBA caliber bench warmers.

Cluster 4’s notable players include Yi Jianlian and Anthony Bennet….yeesh

1 2 3 4 5 6 7 8 9 name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 5 ]) for x in reduced_data ])] #PlayerID = str(int(df[df['Name']==name]['PLAYER_ID'])) PlayerID = str ( 76993 ) #Two Gerald Hendersons! #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png", width=100) player = Image ( url = 'http://cdn.fansided.com/wp-content/blogs.dir/18/files/2014/04/b__gerald_henderson_82.jpg' , width = 100 ) display ( player ) #display(df[df['PLAYER_ID']==76993]) display ( cluster_locs [ 5 :]) df [ df [ 'kmeans_label' ] == 5 ][ 'Name' ][: 10 ]

Outside Skills Rim Scoring Pure Points Defensive Big Man Dead Eye 0.890984 0.846109 -0.926444 0.735306 -0.092395

12 Adams, Michael 30 Alexander, Cory 41 Allen, Tony 62 Anderson, Kenny 65 Anderson, Mitchell 78 Anthony, Greg 90 Armstrong, Darrell 113 Bagley, John 126 Banks, Marcus 137 Barrett, Andre Name: Name, dtype: object

Cluster 5 is magenta in the scatter matrix and includes 16% of players.

Cluster 5 players are highest in outside skills and second highest in rim scoring yet these players are dead last in pure points. It seems they score around the rim, but do not draw many fouls. They are second highest in defensive big man.

Gerald Henderson Sr is the prototype. Henderson is a good 3 point and free throw shooter but does not draw many fouls. He has lots of assists and steals.

Of interest mostly because it generates an error in my code, Gerald Henderson Jr is in cluster 2 - the mid range shooters.

Notable cluster 5 players include Mugsy Bogues, MCW, Jeff Hornacek, Magic Johnson, Jason Kidd, Steve Nash, Rajon Rando, John Stockton. Lots of guards.

In the cell below, I plot the percentage of players in each cluster.

1 2 3 4 plt . clf () plt . hist ( df [ 'kmeans_label' ], normed = True , bins = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 ], rwidth = 0.5 ) plt . xticks ([ 0.5 , 1.5 , 2.5 , 3.5 , 4.5 , 5.5 ],[ 'Group 0' , 'Group 1' , 'Group 2' , 'Goup 3' , 'Group 4' , 'Group 5' ]) plt . ylabel ( 'Percentage of Players in Each Cluster' );

I began this post by asking whether player positions is the most natural way to group NBA players. The clustering analysis here suggests not.

Here’s my take on the clusters: Cluster 0 is pure shooters, Cluster 1 is talented scorers, Cluster 2 is mid-range shooters, Cluster 3 is defensive big-men, Cluster 4 is bench warmers, Cluster 5 is distributors. We might call the “positions” shooters, scorers, rim protectors, and distributors.

It’s possible that our notion of position comes more from defensive performance than offensive. On defense, a player must have a particular size and agility to guard a particular opposing player. Because of this, a team will want a range of sizes and agility - strong men to defend the rim and quick men to defend agile ball carriers. Box scores are notoriously bad at describing defensive performance. This could account for the lack of “positions” in my cluster.

I did not include player height and weight in this analysis. I imagine height and weight might have made clusters that resemble the traditional positions. I chose to not include height and weight because these are player attributes; not player performance.

After looking through all the groups one thing that stands out to me is the lack of specialization. For example we did not find a single cluster of incredible 3-point shooters. Cluster 1 includes many great shooters, but it’s not populated exclusively by great shooters. It would be interesting if adding additional clusters to the analysis could find more specific clusters such as big-men that can shoot from the outside (e.g., Dirk) or high-volume scorers (e.g., Kobe).

I tried to list some of the aberrant cluster choices in the notable players to give you an idea for the amount of error in the clustering. These aberrant choices are not errors, they are simply an artifact of how k-means defines clusters. Using a different clustering algorithm would produce different clusters. On that note, the silhouette score of this clustering model is not great, yet the clustering algorithm definitely found similar players, so its not worthless. Nonetheless, clusters of NBA players might not be spherical. This would prevent a high silhouette score. Trying a different algorithm without the spherical clusters assumption would definitely be worthwhile.

Throughout this entire analysis, I was tempted to think about group membership, as a predictor of a player’s future performance. For instance, when I saw Karl Anthony Towns in the same cluster as Kareem Abdul-Jabbar, I couldn’t help but think this meant good things for Karl Anthony Towns. Right now, this doesn’t seem justified. No group included less than 10% of players so not much of an oppotunity for a uniformly “star” group to form. Each group contained some good and some bad players. Could more clusters change this? I plan on examining whether more clusters can improve the clustering algorithm’s ability to find clusters of exclusively quality players. If it works, I’ll post it here.