We talk about it all of the time — who is the greatest player of all-time?

Is it the player who won the most Grand Slam titles, the player ranked number one for the most weeks or the person with the best record against the premier players of that era? Is it a combination of these three things or more? What factors are important when it comes to this discussion?

These are the questions that arise while discussing the GOAT, and it is especially fun in the current era of Roger Federer, Novak Djokovic and Rafael Nadal.

I recently began pondering who even makes the cut if we were to construct a list of the greats. Another interesting proposition would be to determine the best second-tier players of all-time or ever third-tier players. How can we identify the best of each hierarchical step in tennis?

In order to answer this question, I have turned to a common tool in analytics called cluster analysis.

In the effort of full disclosure, I drew inspiration from an article in the Journal of Quantitative Analysis of Sports entitled, “Match Play: Using Statistical Methods to Categorize PGA Tour Players’ Careers” by Martin L. Puterman and Stefan M. Wittman at the University of British Columbia.

METHODOLOGY

I decided to categorize male professional tennis players during the most of the Open Era (1972–2015). Initially, I created over twenty categories from the year-end rankings. This included mean rank, best rank, range of ranks, years in top-10, percent of career improving and maximum single decline. However, I found the this data, statistically speaking, did not describe overall career performance because it was too highly correlated.

Ultimately, I found overall career performance better described by the proportion of time a player spent ranked in various categories or buckets — top 10, top 20, top 50, top 100, beyond 100.

To cut down on the sheer number of players being considered, find the crème de la crème, I required players to finish the year in the top 100 for a minimum of eight years.

Why top 100? It seems to be right around the cutoff for the Grand Slam tournaments each year and a bellwether of success.

Why eight years? Five years seems to little to be considered one of the ‘greats’ and 10 years was a bit of a stretch.

Of course due to these assumptions, several older players from the early years of the Open Era are eliminated or just plain misinterpreted within the analysis because their careers ended soon after 1972. Also, there are several younger players who did not reach the top-100 until after 2007 for the first time, and therefore not eight years. With constraints like these, I expected the cut offs.

In addition, I eliminated all year-end rankings greater than 600, since for the first 30 years of the rankings, the lists rarely passed that range. Also, it did not seem useful when comparing against each decade, since player are most likely playing only a few ITF tournaments and working their way into the main tour.

K-MEANS CLUSTERING

Clustering is a mathematical method of grouping in which k represents the number of categories. Interestingly enough, there was very little change between the top groups no matter the selection. The numbers suggested somewhere between five and seven clusters so I opted for six clusters.

UNDERSTANDING THE RESULTS

The table below represents the percentages that players from each group were within those ranking levels during their careers (using year-end rankings). For instance, Cluster A has spent approximately 72% of their collective career in the top-10 and about 8% outside the top-100. Based on this knowledge, we can identify the different groups.