ehir. This is no good, as only the champions of the Turkish Super League directly qualify for the next year’s CL group stage, while the team who achieves second place start from earlier qualifying rounds. History tells that it is unlikely a Turkish team will pass these early qualifying rounds and make it to the group stage. So why is Beşiktaş doing so well in the CL, but failing to achieve glory at home? There are likely to be many interconnected variables at play here causing such an outcome. Industrial soccer and its professional competitions in an information age is such a complicated phenomenon; as a wise man once said, soccer resembles life itself. The common consensus implies that the management and the team itself are investing so much physical and mental energy into the CL, that they have difficulty focusing on local league games. In light of this, I analysed historical data to investigate the relationship between local and CL performances of the teams which attended the CL competition and, indeed, there seem to be some interesting correlations at play.

I looked at all the teams which played in the CL between 2003–04 and 2016–17. I started with the 2003 season because this was when the current tournament structure was initialized: a single group stage (32 teams) followed by a knockout stage starting from the round of 16 advancing all the way to the final. I implemented a web scraping algorithm to read data from the Wikipedia pages of each CL season to get the advance levels of each team (final, semi-final, quarter-final etc.) during each competition. Then for each team, I implemented another web scraping algorithm to read the final standings from the Wikipedia pages of their local league seasons. Finally, I compared the CL advance levels and the corresponding final local league standings to investigate a possible relationship between the two variables, making use of suitable data stratifications when necessary (as will be detailed in the rest of this blog).

Most of the elements in the full sample are non-attendances (NA), as can be seen in Figure 1c; i.e., there are many teams who attended CL only once or a few times during the 14 years of analysis. The histogram of CL advance levels for the limited sample (Figure 1d) is a zoomed-in version of Figure 1c without the NA. Figure 1d shows that, among the teams who attended the CL during the analysis period, those who advanced to round of 16 almost equals that of the 4 th and 3 rd places in the group stage. The number of advances to QF and onwards, of course, decreases steadily. Before looking at some distributions of CL advance level versus the final league standings, here are some of the facts that came out of this preliminary analysis which I found interesting:

Most of the elements in the full sample are non-attendances (NA), as can be seen in Figure 1c; i.e., there are many teams who attended CL only once or a few times during the 14 years of analysis. The histogram of CL advance levels for the limited sample (Figure 1d) is a zoomed-in version of Figure 1c without the NA. Figure 1d shows that, among the teams who attended the CL during the analysis period, those who advanced to round of 16 almost equals that of the 4 th and 3 rd places in the group stage. The number of advances to QF and onwards, of course, decreases steadily. Before looking at some distributions of CL advance level versus the final league standings, here are some of the facts that came out of this preliminary analysis which I found interesting:

The final league standings exhibit a geometric distribution skewed towards the top of the league table. Of course, these are some pretty strong teams; strong enough to compete in the CL (151 of the 448 topped their league, and 103 came second within the limited sample).

The final league standings exhibit a geometric distribution skewed towards the top of the league table. Of course, these are some pretty strong teams; strong enough to compete in the CL (151 of the 448 topped their league, and 103 came second within the limited sample).

Over the 14 years, 32 teams make up a sample size of 14x32 = 448 for the limited sample. The full sample number goes up to 1426 with relegations excluded. The histograms and the distributions of the final league standings and the CL advance levels of the teams in both the full and the limited samples are given in Figure 1.

Over the 14 years, 32 teams make up a sample size of 14x32 = 448 for the limited sample. The full sample number goes up to 1426 with relegations excluded. The histograms and the distributions of the final league standings and the CL advance levels of the teams in both the full and the limited samples are given in Figure 1.

Data sample populated by including all final local league standings (except when relegated) of each team for every year during the analysis period. This means the inclusion of the final league standing of a team even when it didn’t make it to the CL group stage in that year. Hereafter, this sample will be referred to as ‘ full ’.

I will try to keep the technicalities of the analysis to a minimum in this blog, but occasionally, there will be some details that might appeal to readers who have prior knowledge of statistics. In such cases, I provide references in case you want to learn more about the techniques I employ; however, it is not necessary to follow the story. All the web scraping and the consequent data analysis is conducted using Python 3.7.

…La Liga is tough! In fact, the worst official finish is by Juventus in the 2005 – 06 season in Serie s A (20 th ) , while they advanced to QF in CL. However, this was due to the fine they got after a match fixing scandal was uncovered (they actually finished 1 st in the league that year).

…La Liga is tough! In fact, the worst official finish is by Juventus in the 2005 – 06 season in Serie s A (20 th ) , while they advanced to QF in CL. However, this was due to the fine they got after a match fixing scandal was uncovered (they actually finished 1 st in the league that year).

Belarus ( BATE Borisov ), Croatia ( Dinamo Zagreb ), Norway ( Rosenborg ), Sweden ( Malmo ), Hungary ( Debrecen ), Poland ( Legia Warsaw ), Serbia/Serbia and Montenegro ( Partizan ), Slovenia ( Maribor ) and Kazakhstan ( Astana ) have been represented by only one team. Among these teams Legia Warsaw, Debrecen, Partizan, Maribor and Astana attended the competition only once between 2003–2016. BATE Borisov attended five5 times (e.g., higher than Beşiktaş and Fenerbahce of Turkey with four times each). Although Austria sent two teams (Austria Wien and Rapid Wien) to the competition, they each only attended once (2013 and 2005 respectively) making Austria one of the most underrepresented countries in the competition between 2003–2016.

A total of 107 teams attended the CL competition between 2003–2016. During this period, 26 of these teams were either relegated from their local league (Monaco is one notable example among the relegated teams) or promoted to their top local competition league.

kemen but in this case they really look like So, this nice and colourful plot shows some first-order relationship between the two variables with some statistics on it. This plot gives some idea about how the final league standings of the teams look like if they didn’t attend CL, or if they did but only advanced until the R16, and so on. Each of the coloured shapes in Figure 2 (supposedly looking like a violin, ç e , so I will refer to them as kemen , so I will refer to them as ç e plots) shows the distribution of different categories. All the category (CL advance levels) distributions in Figure 2 resemble the geometric distribution apparent in Figures 1a and 1b, only sideways and symmetric, thus forming the shape of an upright standing string instrument. The thick black lines in the middle of each kemen ç e interquartile ranges , while the white dots show the median value of the final league standings for each category. We can see that the median value for the final league standing for the teams which didn’t attend CL (NA) is 4, whereas the value for the teams who advanced to every level other than the final (F) is 2. The mighty finalists usually finish their leagues as champions (thus the median value 1). We also see the distribution of the finalists bulged towards the lower numbers in Figure 2 (the top of the league table) with a short tail pointing upwards (the shorter the tail, the higher the confidence on the median values and the ranges shown on the kemen show the, while the white dots show theof the final league standings for each category. We can see that the median value for the final league standing for the teams which didn’t attend CL (NA) is 4, whereas the value for the teams who advanced to every level other than the final (F) is 2. The mighty finalists usually finish their leagues as champions (thus the median value 1). We also see the distribution of the finalists bulged towards the lower numbers in Figure 2 (the top of the league table) with a short tail pointing upwards (the shorter the tail, the higher the confidence on the median values and the ranges shown on the ç e ). So, this means that the CL finalists are likely to finish at the top of their leagues, where the final standings of the rest (that is all the NA, G4, G3, R16, QF and SF) can vary between top and the bottom with a higher likelihood towards the top.





So, we should be sufficiently confident that a CL finalist will finish their league in 1st or 2nd place, or at least 3rd place. Or should we? So far, both figures I’ve shown demonstrate what we can already guess: the higher up a team progresses in the CL competition means they are stronger and thus have a higher likelihood of finishing at the top in their local league.





Well, this doesn’t really agree with my initial motivation of Beşiktaş doing really badly in the local league because of their success in the CL. Before jumping to quick conclusions and claiming that my team sucks in the Turkish Super League because the players (such as Ricardo Quaresma) just don’t care about local league anymore, we should look at the data in more depth and detail. The results so far are only out of the bulk data, so next I will look at some correlations over specific stratifications of both the full and the limited samples.

Analysis of the Correlation between CL Advance Levels and the Final League Standings

Spearman’s rank statistic Here, I will investigate if there is any correlation between two variables, namely, CL advance levels and final league standings. If so, is it a positive correlation (i.e., higher the CL advance level, the higher the final league standing) or vice versa? Before delving into the results, I would like to present information about the correlation metrics I used in this analysis. First, I assigned monotonic values to each CL level, such that NA=7, G4=6, G3=5…, F=1, to convert this categorical/ordinal data to numerical values. Then I calculatedto quantify any correlation between the two variables. This statistic is suitable for this case, as it is a nonparametric measure of rank correlation which is a more reliable statistic for non-normal distributions (as is the case observed in Figures 1a and 1b).





Pearson correlation statistics The conversion of the categorical CL level data to monotonic numerical values have some caveats, though. For example, the difficulty of reaching the final (F, value = 1) is probably exponentially – not linearly – higher than that of, say, the quarter final (QF, value = 3). However, the same conclusion can also be made for the final league standings, which are already monotonic (1 to 20), thus making it a fair comparison. I played with the conversion, changing it to weighted values etc., to understand if it makes a difference in the results and didn’t observe any significant changes. For the sake of completeness, I also added the, which is the measure of linear correlation between two variables and, as you will see next, both Pearson and Spearman agree on the final correlations.





Let’s look at some numbers and try to understand what they mean in relation to the question at hand.









Full Sample Limited Sample Spearman’s Correlation Coeff. 0.28 0.08 P value 0.00 0.07 Pearson’s Correlation Coeff. 0.24 0.12 P value 0.00 0.01

Table 1: Spearman’s and Pearson’s correlation coefficients and their corresponding p vales for the full and limited samples.

The correlation coefficient tells us how much correlation exists, and the p value tells us how much confidence we have in the coefficient (the lower the p value, the higher the confidence). In statistics, a p value less than or equal to 0.05 is usually accepted as the value to have sufficient confidence – 95% to be exact – in the coefficient. In Table 1, the coefficient is almost zero (0.08/0.12) for the limited sample and a bit higher (0.28/0.24) for the full sample, along with very low p values, thus indicating a high confidence for both. A perfect positive correlation would be 1, so in our case that would indicate the higher up a team progresses in the CL means the higher up they finish in the local league (and vice versa for the perfect negative correlation -1). In summary, good performance in the CL indicates good performance in the local league (as Beşiktaş is failing to do so far) with a very linear fashion in the case of a correlation of 1. In both limited and full samples, we have positive correlations. The fact that the correlation for the full sample is higher can be attributed to the inclusions of the non-attendances to the CL (NA), which comprise a very large portion of the full sample (Figure 1c). This creates a larger contrast between the local league standings and the CL levels. This all makes sense, after all, the CL advance level is definitely not the only effect on a team’s final local league standing. There are, of course, many other parameters at play, such as a club’s wealth, difficulty of the league etc. And here, in both samples, we have a mixture of beasts like Real Madrid and Manchester United along with underdogs, such as Artmedia, Zilina etc.





If you follow the CL at all, you would know that experience is everything in that competition. It is imperative for a team to have competed in the CL for a while, to understand the atmosphere and get used to what it is like to play against the top teams in the world before being able to even think about advancing towards, say, QF. And advancing even higher is usually reserved for select teams such as Barcelona, Real Madrid, Bayern Munich, Liverpool and a few others which UEFA loves (!!!). But let’s assume for a moment that we are living in a fair world and that there is total meritocracy in the CL. So, next I will investigate how CL experience affects the correlation between the CL advance levels and final league standings. I will look at this problem via two different angles by defining the ‘CL experience’ in two different ways:

The number of seasons a team attended the CL competition ( Quantity )

The highest level a team advanced to during a CL season (Quality)





The first definition (quantity) assumes that the more a team was involved in the competition and listened to that CL music before the games, the more they will be used to compete both in the CL and the local league at the same time. The second definition (quality) assumes that the higher a team ascends the mountain, the more they will be attuned to the atmosphere of the CL and will not be consumed by its glitter while competing in the local league.





Cl Experience: Quantity





Here, I stratified the teams in both the full and the limited samples by the number of their attendances to the CL (i.e., less than or equal to 14 times, 13 times, 12 times and so on, until 1). Then, I calculated correlation coefficients and their corresponding p values for each stratification. Additionally, I removed the teams which have small standard deviations (less than 1.5) in their league performances to eliminate teams such as Celtic, Olympiacos, Bate Borisov, Ludogorets Razgard etc. who are consistently placed 1st or 2nd in their local leagues regardless of their performance in the CL, thus contaminating the signal I am looking for in the correlation analysis. The results are in Figure 3.

Figure 3: The Spearman’s (blue lines) and Pearson’s (green lines) correlation coefficients and their corresponding p values (dashed lines) between the CL advance levels and the final league standings with respect to the CL experience: attendance over (a) the full sample and (b) the limited sample.

Finally, the data starts telling something. The CL experience of teams are the highest at the leftmost side of Figures 3a and 3b, decreasing towards the right. We see a relationship between the correlation of CL advance level versus final local league standing with diminishing CL experience in Figure 3: The correlations diminish as well! It never goes negative when we look at the full sample (Figure 3a), but it does in the limited sample (Figure 3b), indicating that the less experienced a team is, the more poorly they will likely perform in their local league as they advance higher in the CL. The p values exhibit an increase towards the low experience side in both Figures 3a and 3b, due to the reduction of the sample size as the cumulative experience level goes down. However, the consistent decrease in the correlation coefficients indicate a reliable analysis. Besides, the p values start heading down again after the experience level 6 in Figure 3b, indicating that even if the sample size is smaller, the negative correlation becomes clearer, and so the confidence in the results increases. One question that pops up in my head looking at Figure3 is: why do the correlation coefficients go all the way down to -0.4 in the limited sample, whereas it never drops below 0.0 in the full sample? This may be attributed to the long streaks of non-attendance by a significant number of teams in the full sample, introducing an irrelevant relationship to the final results (that is, irrelevant to the CL advance level vs. final local league standing). The limited sample only contains data from when every team attended the CL, thus giving a more accurate view of the correlation I am looking for. My main motivation in including the full sample to my analysis was to see if there is a clear effect of non-attendance, (i.e., if the teams were performing significantly better in their local leagues when they didn’t attend the CL). However, it seems hard to discern such information from this analysis due to the possible introduction of the irrelevant relationships I implied above.





Cl Experience: Quality





In the case of quality, I looked for the maximum CL level reached by teams and stratified accordingly. For example, if a team attended CL only twice during the 14-year analysis period, but reached the quarter final in one of them, that team goes to the QF bin. The results are in Figure 4.

Figure 4: Same as Figure 3 but with respect to the CL experience: highest advance level.

What we see in Figure 4 is quite similar to what we saw in Figure 3, and this time the decrease in the correlation coefficient is more linear and sharper for the limited sample (Figure 4b). The correlation becomes negative below QF level, suggesting it is likely that a team who only reached round 16 or lower during the analysis period, would struggle in their local league if they advance to higher levels in the CL. You might notice there is no reported value for G4 in the limited sample. That is because there is no variability of the sample for the lowest advance level (it only contains G4) thus not yielding a correlation. It is possible, however, to calculate a correlation coefficient for G4 in the full sample (Figure 4a), since it contains two categories, namely G4 and NA. However, having only two categories leads to weak variability, leading to a bump in the p value for G4 in Figure 4a. The p values for the limited sample reach and exceed 0.6 for SF and QF, reducing the reliability of the statistics for these two levels. However, the consistency in the monotonic decrease in the results is convincing enough (for me, at least) that they have value.





As a matter of fact, it is no surprise that the results of the quantity (Figure 3) and the quality (Figure 4) analyses are similar, since these two approaches are not independent of each other. To demonstrate that, I calculated the average CL attendances for each of the CL advance level categories and plotted them (in other words, I plotted the quantity versus quality).





Figure 5: CL advance level vs. the average CL attendance

Indeed, there is a very strong relationship between the experience of a team in the CL and the highest level it advances to. Figure 5 demonstrates that the average attendance for the finalists between 2003–2016 is 10.2, that is, they attended more than 10 of the 14 competitions on average. This number is 1.3 for the teams who didn’t go higher than placing last in the initial group stage (G4). You can see from Figure 5 that you might want your team to be in the CL at least 5 out of 14 times to see the quarter finals or higher. Did you also notice the abrupt jump from around 5.9 to 10.2, the average CL attendances for SF and F respectively? Crunch time is when you need experience the most.





Correlation Coefficient by Country



