Databricks recently announced GraphFrames, awesome Spark extension to implement graph processing using DataFrames.

I performed graph analysis and visualized beautiful ball movement network of Golden State Warriors using rich data provided by NBA.com’s stats

Pass network of Warriors

Passes received & made

The league’s MVP Stephen Curry received the most passes and the team’s MVP Draymond Green provides the most passes.

We’ve seen most of the offense start with their pick & roll or Curry’s off-ball cuts with Green as a pass provider.

via GIPHY

inDegree

id inDegree CurryStephen 3993 GreenDraymond 3123 ThompsonKlay 2276 LivingstonShaun 1925 IguodalaAndre 1814 BarnesHarrison 1241 BogutAndrew 1062 BarbosaLeandro 946 SpeightsMarreese 826 ClarkIan 692 RushBrandon 685 EzeliFestus 559 McAdooJames Michael 182 VarejaoAnderson 67 LooneyKevon 22

outDegree

id outDegree GreenDraymond 3841 CurryStephen 3300 IguodalaAndre 1896 LivingstonShaun 1878 BogutAndrew 1660 ThompsonKlay 1460 BarnesHarrison 1300 SpeightsMarreese 795 RushBrandon 772 EzeliFestus 765 BarbosaLeandro 758 ClarkIan 597 McAdooJames Michael 261 VarejaoAnderson 94 LooneyKevon 36

Label Propagation

Label Propagation is an algorithm to find communities in a graph network.

The algorithm nicely classifies players into backcourt and frontcourt without providing label!

name label Thompson, Klay 3 Barbosa, Leandro 3 Curry, Stephen 3 Clark, Ian 3 Livingston, Shaun 3 Rush, Brandon 7 Green, Draymond 7 Speights, Marreese 7 Bogut, Andrew 7 McAdoo, James Michael 7 Iguodala, Andre 7 Varejao, Anderson 7 Ezeli, Festus 7 Looney, Kevon 7 Barnes, Harrison 7

Pagerank

PageRank can detect important nodes (players in this case) in a network.

It’s no surprise that Stephen Curry, Draymond Green and Klay Thompson are the top three.

The algoritm detects Shaun Livingston and Andre Iguodala play key roles in the Warriors’ passing games.

name pagerank Curry, Stephen 2.17 Green, Draymond 1.99 Thompson, Klay 1.34 Livingston, Shaun 1.29 Iguodala, Andre 1.21 Barnes, Harrison 0.86 Bogut, Andrew 0.77 Barbosa, Leandro 0.72 Speights, Marreese 0.66 Clark, Ian 0.59 Rush, Brandon 0.57 Ezeli, Festus 0.48 McAdoo, James Michael 0.27 Varejao, Anderson 0.19 Looney, Kevon 0.16

Everything together

library ( networkD3 ) setwd ( '/Users/yuki/Documents/code_for_blog/gsw_passing_network' ) passes <- read.csv ( "passes.csv" ) groups <- read.csv ( "groups.csv" ) size <- read.csv ( "size.csv" ) passes $ source <- as.numeric ( as.factor ( passes $ PLAYER )) -1 passes $ target <- as.numeric ( as.factor ( passes $ PASS_TO )) -1 passes $ PASS <- passes $ PASS / 50 groups $ nodeid <- groups $ name groups $ name <- as.numeric ( as.factor ( groups $ name )) -1 groups $ group <- as.numeric ( as.factor ( groups $ label )) -1 nodes <- merge ( groups , size [ -1 ], by = "id" ) nodes $ pagerank <- nodes $ pagerank ^ 2 * 100 forceNetwork ( Links = passes , Nodes = nodes , Source = "source" , fontFamily = "Arial" , colourScale = JS ( "d3.scale.category10()" ), Target = "target" , Value = "PASS" , NodeID = "nodeid" , Nodesize = "pagerank" , linkDistance = 350 , Group = "group" , opacity = 0.8 , fontSize = 16 , zoom = TRUE , opacityNoHover = TRUE )

Here is a network visualization using the results of above.

Node size: pagerank

Node color: community

Link width: passes received & made

Workflow

Calling API

I used the endpoint playerdashptpass and saved data for all the players in the team into local JSON files.

The data is about who passed how many times in 2015-16 season

# GSW player IDs playerids = [ 201575 , 201578 , 2738 , 202691 , 101106 , 2760 , 2571 , 203949 , 203546 , 203110 , 201939 , 203105 , 2733 , 1626172 , 203084 ] # Calling API and store the results as JSON for playerid in playerids : os . system ( 'curl "http://stats.nba.com/stats/playerdashptpass?' 'DateFrom=&' 'DateTo=&' 'GameSegment=&' 'LastNGames=0&' 'LeagueID=00&' 'Location=&' 'Month=0&' 'OpponentTeamID=0&' 'Outcome=&' 'PerMode=Totals&' 'Period=0&' 'PlayerID={playerid}&' 'Season=2015-16&' 'SeasonSegment=&' 'SeasonType=Regular+Season&' 'TeamID=0&' 'VsConference=&' 'VsDivision=" > {playerid}.json' . format ( playerid = playerid ))

JSON -> Panda’s DataFrame

Then I combined all the individual JSON files into a single DataFrame for later aggregation.

raw = pd . DataFrame () for playerid in playerids : with open ( "{playerid}.json" . format ( playerid = playerid )) as json_file : parsed = json . load ( json_file )[ 'resultSets' ][ 0 ] raw = raw . append ( pd . DataFrame ( parsed [ 'rowSet' ], columns = parsed [ 'headers' ])) raw = raw . rename ( columns = { 'PLAYER_NAME_LAST_FIRST' : 'PLAYER' }) raw [ 'id' ] = raw [ 'PLAYER' ] . str . replace ( ', ' , '' )

Prepare vertices and edges

You need a special data format for GraphFrames in Spark, vertices and edges.

Vertices are lis of nodes and IDs in a graph.

Edges are the relathionship of the nodes.

You can pass additional features like weight but I couldn’t find out a way to utilize there features well in later analysis. A workaround I took below is brute force and not even a proper graph operation but works (suggestions/comments are very welcome).

# Make raw vertices pandas_vertices = raw [[ 'PLAYER' , 'id' ]] . drop_duplicates () pandas_vertices . columns = [ 'name' , 'id' ] # Make raw edges pandas_edges = pd . DataFrame () for passer in raw [ 'id' ] . drop_duplicates (): for receiver in raw [( raw [ 'PASS_TO' ] . isin ( raw [ 'PLAYER' ])) & ( raw [ 'id' ] == passer )][ 'PASS_TO' ] . drop_duplicates (): pandas_edges = pandas_edges . append ( pd . DataFrame ( { 'passer' : passer , 'receiver' : receiver . replace ( ', ' , '' )}, index = range ( int ( raw [( raw [ 'id' ] == passer ) & ( raw [ 'PASS_TO' ] == receiver )][ 'PASS' ] . values )))) pandas_edges . columns = [ 'src' , 'dst' ]

Graph analysis

Bring the local vertices and edges to Spark and let it spark.

vertices = sqlContext . createDataFrame ( pandas_vertices ) edges = sqlContext . createDataFrame ( pandas_edges ) # Analysis part g = GraphFrame ( vertices , edges ) print ( "vertices" ) g . vertices . show () print ( "edges" ) g . edges . show () print ( "inDegrees" ) g . inDegrees . sort ( 'inDegree' , ascending = False ) . show () print ( "outDegrees" ) g . outDegrees . sort ( 'outDegree' , ascending = False ) . show () print ( "degrees" ) g . degrees . sort ( 'degree' , ascending = False ) . show () print ( "labelPropagation" ) g . labelPropagation ( maxIter = 5 ) . show () print ( "pageRank" ) g . pageRank ( resetProbability = 0.15 , tol = 0.01 ) . vertices . sort ( 'pagerank' , ascending = False ) . show ()

Visualise the network

When you run gsw_passing_network.py in my github repo, you have passes.csv, groups.csv and size.csv in your working directory.

I used networkD3 package in R to make a cool interactive D3 chart.

library ( networkD3 ) setwd ( '/Users/yuki/Documents/code_for_blog/gsw_passing_network' ) passes <- read.csv ( "passes.csv" ) groups <- read.csv ( "groups.csv" ) size <- read.csv ( "size.csv" ) passes $ source <- as.numeric ( as.factor ( passes $ PLAYER )) -1 passes $ target <- as.numeric ( as.factor ( passes $ PASS_TO )) -1 passes $ PASS <- passes $ PASS / 50 groups $ nodeid <- groups $ name groups $ name <- as.numeric ( as.factor ( groups $ name )) -1 groups $ group <- as.numeric ( as.factor ( groups $ label )) -1 nodes <- merge ( groups , size [ -1 ], by = "id" ) nodes $ pagerank <- nodes $ pagerank ^ 2 * 100 forceNetwork ( Links = passes , Nodes = nodes , Source = "source" , fontFamily = "Arial" , colourScale = JS ( "d3.scale.category10()" ), Target = "target" , Value = "PASS" , NodeID = "nodeid" , Nodesize = "pagerank" , linkDistance = 350 , Group = "group" , opacity = 0.8 , fontSize = 16 , zoom = TRUE , opacityNoHover = TRUE )

Code

The full codes are available on github.

Please enable JavaScript to view the comments powered by Disqus.

Disqus