Analysis of third-party website references and the online advertising ecosystem

Overview of websites and size of the ecosystem

Our approach to analysing the ‘behind the scenes’ ecosystem began with an assessment of the 1250 websites selected. The websites examined were of various types including search engines, news agencies, social media, commerce and finance, professional networks, video streaming, and cloud services. While each country had its unique set of websites, there were several common domains present, and with similar degrees of popularity. For instance, well-known sites such as google.com, youtube.com and wikipedia.org all featured in the top 10 sites of the five countries.

The largest similarity in web domains was between the UK and USA (at 46%), but this is to be expected given the likeness of the two countries (e.g., in society and language). This was followed by Europe’s UK and Germany (at 35%), and then the USA and Germany (at 29%). Unsurprisingly, the most dissimilar countries were Russia and Japan (at 9%) and the USA and Russia (at 11%). Language spoken and location are likely to be reasons for these differences. It is important to note such similarities and differences as they may help explain differences in ecosystems across countries in the subsequent sections.

Fig. 1 The network of UK websites and all of the third-party references made by these sites. This illustrates the size and complexity of these networks, and while it is not possible to read the intricate details, this does highlight the significance of these networks Full size image

From our analysis of each country’s websites, we were able to extract a comprehensive list of third-party sites that were referenced. We used a simple “source-to-reference” list format which also catered for third-party sites making their own references; where source was the initial site and reference was the site referenced. The real advantage of this format however, was the ease at which social network analysis (SNA) tools could be applied to allow the characteristics of the third-party ecosystem to be explored and visualised. As an example, we present Fig. 1, which was created using the SNA tool, Gephi.Footnote 1 This figure displays a network constituted of the UK websites, the third-party sites referenced, and the various connections between them.

The first point of significance regarding Fig. 1 was the substantial size of the network created based on only 250 initial sites; we do not intend this image to be legible but instead to use it to exemplify the network’s vastness. Overall, the network contained 1078 unique domains (or in graph terms, nodes), with 4242 connections (also known as directed graph edges) between them. In this context, connections denote references that indicate third-party website calls, potential links to advertisement networks, or associations to legitimate sites (e.g., bbc.com making references to bbci.co.uk for images). From a user perspective two key aspects to highlight here are, the amount of additional domains referenced (4 times the initial number), many of which are likely to be unknown to users, and the large number of connections between these various sites. Moreover, as can be seen, some of the domains are very well connected and may act as hubs that gather user information from various sites—this point will be discussed further in the next section.

With regards to the characteristics of the other networks, the USA possessed the largest and most dense network with 1135 domains and 4455 references. It was followed by Germany with 1034 domains and 3808 references, Russia with 976 domains and 3582 references, and Japan with 796 domains and 2967 references. Each of these networks highlights a notable number of third-party sites that are referenced as individuals browse webpages.

Although the size of these networks was a noteworthy factor in itself, our most intriguing finding was the positioning of Japan given their large market size (126 million citizens) and high Internet adoption rate (at 91%) [31]. In particular, Japan has the 5th largest number of Internet users in the world (second only to USA in our study) and with a 91% Internet adoption rate is only behind the UK (at 92.6%) in our study. A market such as this would be ideal for web advertising networks, hence our expectation of a larger third-party website prevalence. Upon further investigation, we found that the reason for this disparity may not due to lack of attempts, but rather because Japan still primarily focuses on television advertising [32]. This is in contrast to markets such as the USA and UK where online advertisements—as might be inferred from the size of the networks discussed—are the dominant way to reach consumers.

The big players: third-party sites referenced the most

The networks discussed above provide a ‘bird’s eye view’ of the website ecosystem and its complexities. Our aim in this section is to further understand that ecosystem, and in particular, the most referenced sites. This would allow us to empirically identify the central third-party websites in each country, the key players behind these domains, and how they may differ across locales.

Fig. 2 The key sites that are referenced within the UK network, where the size of the domain name and node indicates the number of times that this site is referenced. While all of the finer details are not entirely clear, this figure is key in illustrating the sheer volume of referenced sites Full size image

To determine the most referenced websites in the country networks that were created, we calculated the in-degree network centrality (i.e., an SNA metric that measures the number of incoming connections) for all of the websites. We then used this in-degree measure to proportionally size the various nodes in each of the networks such that the larger the node the more times it was referenced. Figure 2 shows the visual of the UK network from Fig. 1 updated to account for this metric. For purposes of readability, we also present an alternative visualisation in Fig. 3 which focuses on the top 50 domains.

Fig. 3 Word cloud displaying the key sites that are referenced within the UK network, where the size of the domain name indicates the frequency in which that site is referenced Full size image

From an analysis of each website’s in-degree, the top ten were: doubleclick.net (189 references to it), google-analytics.com (134 references), google.com (117 references), facebook.com (107 references), google.co.uk (90 references), adnxs.com (82 references), scorecardresearch.com (71 references), googletagmanager.com (56 references), twitter.com (55 references) and

quantserve.com (52 references).

The first observation that may be made from this list is the prevalence of domains that are attributable to the Internet giant, Google Inc. These include doubleclick.net (DoubleClick is a subsidiary of Google that develops and provides Internet advertisement-serving services), google-analytics.com (Google Analytics is a web analytics service offered by Google), google.com and google.co.uk (Google search engines or portals to other Google services), and googletagmanager.com (Google Tag Manager is a system that allows the management of tags or code snippets which send information to third parties). Given that most of these sites can be linked to online ads, this finding highlights Google’s dominance of the online advertising industry in the UK and acts to re-emphasise the conclusions of earlier works [11].

While not as significant as Google, Facebook (facebook.com) and Twitter (twitter.com) also are referenced by several websites. These references may be due to embedded Facebook and Twitter services (e.g., ‘Like’ or ‘Share’ buttons on sites) or ads, considering that Facebook, for instance, is one of the lead companies in the online advertising industry [11]. Our analysis of the other top domain names uncovered that they are all generally linked to digital advertising services and platforms.

AppNexus (the owner of adnxs.com), for instance, regards itself as an independent technology company that provides trading solutions and powers marketplaces for Internet advertising—they partner with, and are backed by, companies such as Microsoft [33]. The aim of the next company, Scorecard Research (scorecardresearch.com), is to collect data (in this instance, information on users’ website visitation patterns) to assists their clients in better meeting the needs of consumers.Footnote 2 Finally, Quantcast (quantserve.com) engages in website audience measurement for publishers and tailored online advertisement delivery.Footnote 3

Broadly speaking, a central goal of these and other third-party organisations is to facilitate a better understanding of website users (including their profiles and browsing patterns across sites) to allow for more unique and targeted marketing. This is intended to benefit advertisers by enabling them to reach their ideal consumers, and consumers in ensuring they are shown the more suitable ads.

The networks of the other countries were similar to that of the UK with Google dominating in the top 10 most referenced sites. As shown in Table 1, Google domains, doubleclick.net, google-analytics.com and google.com were permanent features, with only facebook.com also appearing in all countries.

Table 1 The most referenced sites in the USA, Germany, Russia and Japan Full size table

There were also a range of new domains found, many of which were associated with other digital advertising firms. These include: DemDex (demdex.net), now owned by the American multinational Adobe Systems Inc., which captures behavioural data on users to allow for better targeting of online ads [34]. InfoOnline (ioam.de) is a German-based organisation focused on digital audience measurement. In Russia, Yandex N.V. (yandex.ru) is a large technology firm which also engages in online advertising, while the site tns-counter.ru appears to be part of a project by a Russian enterprise (TNS) seeking to understand behaviour of Russian citizens on websites. Lastly, the technology company OpenX (openx.net), specialises in online advertising marketplaces.

For those domains not directly associated with advertising firms, one reason for their prevalence may be social plug-ins (as discussed with Facebook and Twitter earlier). For instance, VK (vk.com) is a Russian social networking service that has site widgets and plug-ins for sharing and ‘liking’ similar to Facebook. We also have found plug-ins for auctions and marketplaces, which may explain the popularity of sites such as yahoo.co.jp (though we should note that Yahoo! does also engage in the online advertising industry [35]).

While Table 1 is useful at highlighting the key domains in the ecosystem, we were also keen to explore the websites that were well referenced but not present in the initial 250 for each country. This would more clearly depict the prominent websites that were in the ecosystem only as a result of being referenced by other sites. Figure 4 presents our findings when considering the top 10 websites of each country. We also have noted the prevalence of these third-party sites in other countries to allow for comparison.

Fig. 4 A comparison of the most common third-party websites across the five countries that were studied (USA, UK, Germany, Russia and Japan). This shows the third-party sites that were the most prevalent across all of the countries Full size image

Looking beyond the websites that are well-known to us based on earlier findings, Fig. 4 uncovers several additional domains. These include adition.com (from the German-based, ADITION Technologies), bluekai.com (BlueKai, which was acquired by Oracle in 2014), criteo.com (Criteo) and rubiconproject.com (Rubicon Project). These are all companies that engage in digital advertising of some form. One noteworthy point here is the fact that although certain organisations (e.g., AppNexus) are established in multiple countries, others appear to concentrate only on local networks. This seems particularly relevant in the German and Russian cases with sites such as adition.com, adriver.ru, ioam.de and yadro.ru. While culture may generally be a factor, it could also be a business decision to focus on core markets and their dominance.

With respect to the US and UK, they have similar distributions across each of the various domains, with the US typically leading the UK. Figure 4, therefore, also mirrors the structure of those networks more broadly. Although Japan has the smallest network overall, it should be noted that several of the advertising domains are present which may suggest a growing interest by marketers generally.

The dominance of Google in Japan however, is not to be overlooked. We compared the percentage of references made to the main Google domains (i.e., those in the top 10 in Table 1) across the countries and found that 20.42% of all (2967) third-party references in Japan sites are to Google domains. For the other countries, the US, UK, Germany and Russia are at 11.76, 13.81, 14.36 and 14.48% of all references respectively. It will be interesting to monitor this trend in the future and examine whether Google’s dominance expands, or other local advertising companies such as Dentsu and Hakuhodo establish a substantial online presence.

The inbetweeners: third-party sites acting as a bridge

In addition to analysing websites that were referenced the most, we also engaged in a brief assessment of the third-party sites that made the most references. In SNA terms, this is called the out-degree network centrality. Identifying these sites could indicate central points in the ecosystem which act as a connector or bridge for other sites, even if they themselves are not heavily referenced. In Fig. 5, we present a comparison of the top ten third-party websites that make the most references in each country (this is similar in concept to the depiction in Fig. 4).

Fig. 5 A comparison to show the number of references made by each of the third-party websites, across the five countries that were studied (USA, UK, Germany, Russia and Japan) Full size image

The first point of note as indicated in Fig. 5 was that several of the domains (e.g., adnxs.com, casalemedia.com, and doubleclick.net) and organisations uncovered in the previous section, remain present. This means that not only were these domains heavily referenced, but that they were also responsible for several references to other third-party sites themselves. A reason for this could be partnering third-party services, shared advertising networks or advertising marketplaces. Moreover, although we found that Google domains were again prevalent, they were not as dominant as when examining sites by references made. For instance, other domains such as those by AppNexus (adnxs.com) and Rubicon Project (rubiconproject.com) featured reasonably well here.

Across countries, there were nuances in the domains present as might be expected. AdRiver (adriver.ru) for example, is a Russian company that specialises in Internet advertising technology; this explains their presence in Russia as opposed to in other nations. The US-based company AppNexus (adnxs.com), was responsible for a significant number of out-going references in the US, UK and Germany, but none in Russia or Japan. This finding echoes the low in-degree centrality measures for these countries in this domain (as shown in Fig. 4). In Japan, the Rubicon Project (rubiconproject.com) has the highest number of out-going references and could indicate a hub or main ‘bridge’ platform in the Japanese advertisement marketplace. This presence is the opposite in the US and Russia, where the Project is hardly prominent.

While understanding the domains with the highest out-degrees allows some insight into the main intermediate third-parties, we were also interested in the specific sites that they reference. This could enable for more insight into the ecosystem. We therefore analysed each of the top ten domains for each country to determine what domains they referenced. As an example of our findings, we present Fig. 6 which is a directed network graph of the top UK third-party domains.

Fig. 6 A directed graph that shows the top ten third-party sites in the UK, which are ordered by out-degree. The size of the domain is proportionate to the number of references made by that site (i.e., the larger the text, the more references). Additionally, the other domains that are referenced by these third-party sites are also included Full size image

In addition to re-emphasising the prominence of adnxs.com and doubleclick.net, Fig. 6 displays the range of domains which are referenced. A detailed inspection of the graph highlights the existence of additional advertising networks and organisations such as Adblade (adblade.com), Optimatic (optimatic.com) and Smart AdServer (smartadserver.com). This may suggest some association or sharing across web domains to supply ads to users. We also identified a few well-known organisations such as Expedia (expedia.co.uk), Facebook (facebook.com), GoDaddy (godaddy.com), and Travelocity (travelocity.com). The links here could be justified by various means, but possibly the most likely is in the provision of ads. This, in addition to our earlier findings, exemplify the extent to advertising networks and organisations may work together in the website and third-party ecosystem.

To follow up on the potential association between these various third-party networks, we decided to conduct a brief exploratory study. For this, we returned to our initial network graphs of the full sites and references (e.g., as shown in Fig. 2). We then applied a clustering algorithm [36] to these graphs to determine the main associated groups and clusters that can be uncovered. Figure 7 depicts the clusters discovered in the UK network graph.

Fig. 7 A graph to highlight the different clusters within the third-party sites, which aims to draw out the distinct groupings within these sites in the UK network Full size image

From Fig. 7, we can see that the clustering algorithm identified three large clusters as represented in purple, green and light blue. The purple cluster is the largest one and accounts for \(\sim\) 25% of the sites. It contains many of Google’s sites such as google.com, doubleclick.net, google-analytics.com and googletagmanager.com. An interesting addition to this cluster is facebook.com, given its own advertising ambitions, but this may be explained by the presense of these third-party references on many of the same sites.

The green cluster is the second in size at \(\sim\) 20% of the network. The prominent sites within this cluster include adnxs.com, rubiconproject.com, mathtag.com and yahoo.com. Visually analysing the cluster, many of the second-tier third-party sites and advertising companies (e.g., AppNexus and the Rubicon Project) can be seen. This better hints to the reality that advertising networks and organisations, particularly smaller ones, may work together in the third-party ecosystem.

Finally, in the light blue and covering \(\sim\) 13% of the network, is the third-largest cluster. Here there are more domains from advertisers (or enabling the provision of advertisements) such as googlesyndication.com, moatads.com, adsafeprotected.com, googletagservices.com. This cluster was somewhat understandable given the associations between the included sites, but it we might have expected a closer association with the first cluster (e.g., potentially the main Google domains being present in the same cluster). Possibly the most significant finding from this exploratory study was the close network (or grouping) between some sites that almost certainly alludes to a common advertising marketplace. This is an area we will seek to investigate more in our future work.

User perceptions and understanding

The results presented to date have highlighted the significance of advertising networks within popular websites across five countries. In fact it is clear to see that there are a number of networks that are prevalent across the world, with a core set of companies providing a large percentage of third-party content on websites. The next stage of our research involved understanding the perceptions of Internet users as it pertains to such networks.

To facilitate our investigation, an online survey was carried out over the course of 3 months. The specific aim was to consider topics such as personal privacy, and also, to understand how aware users are of the prevalence of advertising networks and other third-party content on popular websites. The survey received ethical approval from our local university review board, and we also ensured to gain informed consent from participants. In total we received 109 responses, with questions covering standard biographic information (age, gender, education and technical understanding), perceptions of online privacy, and finally perceptions regarding the amount of third-party content that was linked to by a selection of popular websites (drawn from the Alexa Top websites). The majority of respondents to the survey were based in the United Kingdom, however, the survey was not limited by a user’s location with a range of countries being represented.

The majority of responses to the survey came from females (54%) and 48% of the respondents were under the age of 34. The younger profile of the respondents suggested our sample had grown up with the Internet and associated technologies, therefore might be more knowledgeable in this area. This was further underlined by the participants’ high knowledge of computers and the Internet: over 67% of participants ranked themselves at 7 or above, with no-one ranking themselves below a 3 (this is based on self-reports using a scale of 1 to 10, where 10 was extremely knowledgeable and 1 was very little knowledge).

After completing the biographic information, we queried participants about their own perceptions of online privacy and its importance to them; this would be especially important when considering knowledge of third-party networks. Specifically, they were asked to rate the importance of online privacy, using a 5-point Likert scale. In total, 80% of respondents expressed that online privacy was either very important (52%) or extremely important to them (28%). A Pearson’s Chi-Squared test was used to determine any correlation between the importance of online privacy to a participant and their perceived level of technical knowledge. This test resulted in a p-value of 0.85, which indicates that there was no correlation between perceived technical knowledge and the importance of online privacy. This independence could be interpreted as online privacy being an important concept regardless of the level of technical knowledge and understanding of the individual, which in itself is a positive finding.

Fig. 8 A graph showing a comparison between the sites that were used most commonly by survey participants and the sites that they thought to contain the most advertising Full size image

To assess participant’s perceptions and understanding of online advertisements, we asked them to indicate the category of sites they regularly visit and which of these categories they thought contained the most advertisements. Figure 8 shows the comparison between the types of sites that are commonly used and those which participants thought contained the most advertisements. The results show that the participants were of the opinion that news (15%) and social media (60%) sites were likely to contain the most advertisements. Interestingly, although that was felt to be the case, these were still the top two types of sites most commonly used by participants (with news at 75% of respondents and social media used by 86% of respondents). While there are several other variables that could be in play here, this could highlight the value of site utility over concerns of ads or online tracking.

We were also interested in understanding how often participants tended to access third-party links and advertisements online. The majority of respondents said that they ‘never’ (38%) or ‘rarely’ (53%) clicked on advertisements with only 2% of respondents claiming to ‘always’ click on such links. This result goes some way to explaining the responses to the previous question, in that users might be aware that certain sites contain the most advertisements (e.g. news or social media sites) but they simply choose to ignore these ads, for the most part. In further support of this point, we found that over half of the respondents (52%) claimed to use advertisement-blocking browser extensions ‘frequently’ (26%) or ‘always’ (26%). This potentially could account for why users were content visiting sites with advertisements and also, why they rarely clicked on them.

When comparing the importance of online privacy and the use of advertisement-blocking browser extensions, a Pearson’s Chi-Squared test resulted in a p-value of 0.69. This suggests that there is little dependence between those people who use advertisement-blocking extensions and the importance of online privacy. This result is understandable because it may be argued that advertisement-blocking browser extensions are not necessarily a privacy preserving tool, with the main focus being to hide unwanted advertisements. For example, Adblock [19] is listed as the ‘most popular Chrome extension with over 40 million downloads’. However, this extension is marketed as a means to block advertisements with no mention of preserving privacy. When comparing gender to the use of ad-blocking extensions it was found that there was a correlation (with a Pearson’s Chi-Squared test resulting in a p-value of 0.01), meaning that there is a correlation between gender and the use of advertisement-blocking software, with males generally using ad-blockers more frequently.

Our next aim was to understand how much participants were aware of some components of the underlying advertising ecosystem. For this, participants were asked to identify where they thought that the majority of advertisements on web pages originated from. A large proportion of respondents (90%) correctly identified that advertisements typically originate from a third-party, as opposed to the company that owns the site (as demonstrated in our earlier website analysis section). We also found that over half of participants were either ‘moderately aware’ (42%) or ‘extremely aware’ (21%) that when a website was visited several third-party sites were also automatically referenced. While this was good to see as it demonstrated awareness, approximately a third of participants (37%) were in the lower bracket of ‘not at all aware’, ‘slightly aware’, and ‘somewhat aware’. This is some cause for concern as it highlights that the activities that occur when users visit websites may not be transparent for all users involved.

In addition to exploring participants’ awareness, we asked them to estimate the number of third-party sites that particular websites might make reference to. This would allow us to assess how well participants’ perceptions match reality (as presented in earlier sections). We used a subset of the main-site (or home) pages from the Alexa Top 250 such that they would cover a range of different categories and would be generally recognisable by individuals. While there were a number of UK-based sites included they were those that have a global reach. As previously discussed, while the majority of participants were indeed based in the UK the sites were specifically chosen to be recognisable by a wider, global audience.

Table 2 The number of third-party websites that participants estimate specific sites link to, and the actual number of links made by those sites Full size table

Table 2 contains a summary of the survey responses and for each site, it includes the minimum number of third-party sites stated, the maximum number of third-party sites stated, and the mean, median and mode number of sites stated across the entire set of responses. It should be noted here that we only asked individuals about the number of third-party website references on the main page of the site (e.g., google.com), and did not focus on sub-pages (e.g., google.com/news).

A point that is immediately obvious from the table is the extremely large estimates that participants expressed about third-party links. Overall, we found that in 38% of responses users estimated site references of above 500 third-party links (far above the actual cases as mentioned in earlier). This highlights a sizeable disparity between perceptions and reality (as will be discussed below). To take Google as an example, it was the site that participants felt would contain the most references to third-party sites. One factor in this ranking could be that Google is a company built on data. Therefore, as much of Google’s business relates to collecting and analysing information, it is reasonable that participants would rank Google highly. In fact, from our earlier findings, we know that Google is one of the largest companies behind advertising networks.

The comparison between the number of third-party links on a website (the ‘actual value’ column in Table 2) and participants’ perceptions of these links, is an interesting one. As can be seen, the estimates given by respondents are a long way from the values that were actually recorded in the course of our research. The mean values presented can be directly attributed to a number of outlying, and extremely high values for each of the websites as mentioned earlier. These values have acted to skew the mean and exemplify why the mean would not an appropriate centroid for comparison in such distributions. In Amazon’s case for instance, there is a mean value of 71,974, as values ranged from 1 to 5,000,000. In reality however, there were only 3 third-party references made on the site. Comparing these would be unhelpful, and would allow no insight into the difference between perceptions and reality.

To address the issue of the extreme values entered by participants and the resulting skewed mean, we also examined the median (this is a common use of the median [37]) and mode values for each site. In Table 2 we present the median values and mode values (after value grouping), along with the total number of times (and respective percentage that) those values occur. While the median identifies the central value in the list of values supplied by participants, the modal groups attempt to categorise the values (into 25 value groups, e.g., 1–25 links, 26–50 links, and so on) and give an indication of which value-groups appear the most in the sample.

Comparing the medians to the actual observed values, some were close (e.g., Facebook, Huffington Post and BBC), and overall, they were certainly within the correct order of magnitude—Google and Amazon being the main outliers. An interesting point to note here is that through the median, participants were largely correct at identifying the site with the highest number of links to third-party sites i.e., Facebook. Facebook generates a large percentage of their income through data and through advertising revenues, as alluded to earlier in our article. The fact that the survey participants were able to correctly identify this site as the one with the most third-party content demonstrates some understanding of which organisations are more likely to rely on third-party material. Conversely, Google was identified as a site that users would consider to reference numerous third-party websites. In reality, Google (at least google.com) ranks as one of the lowest sites in terms of links to third-party content.

From assessing the mode groups, we could see that the most common values supplied by participants were actually within the 0–25 range for all sites. This inclination was somewhat accurate for a few cases, with the BBC, Spotify and TalkTalk, but not for others, such as Facebook and Expedia. The total percentage (shown in Table 2) is useful here as it highlights the proportion of participants with a value within the modal range. From this we can see that for the BBC for instance, most participants (67%) were relatively close to the correct value. However, for Google this was not the case as less than 1/3 of individuals were within the modal range. To investigate this further, we examined the second most common value ranges and found that Google and Facebook were both in the range of 76–100 with 16 and 16% of participants respectively. This acts to support our other findings above, and emphasise that while participants do have some understanding of the ‘behind the scenes’ activities, there are still misconceptions about sites and how much third-party referencing is conducted.