Alethio’s data scientists dig into the Ethereum blockchain to identify the major players across the transaction network. Leveraging the rich data available through Alethio’s platform, learn about the Hubs and Authorities of the Ethereum blockchain through Alethio’s most recent case study.

Alethio Data Science & SANSA Stack Lab

Since the inception of the Ethereum blockchain, participants have wanted to know about the most impactful players across the blockchain’s transaction network. To explore this question, we experimented with the SANSA [1] team using the two most classic algorithms, Connected Components [2] and PageRank [3], and found out some interesting facts about the distribution of top accounts.

Preliminaries & Definitions

Nodes & Edges

In this article, we are analyzing the value transaction network graph, where all nodes are external accounts that have had transactions on the Ethereum blockchain. The edges on the network graph indicate the transaction relationship between the two connected nodes: when a node (an external account) sends ETH to another, a transaction record is written, and an edge between them is added in the network with direction of the ETH flow. The edge weights are their total transaction value in Ether.

For example, if address A sent x ETH to address B, there will be an edge of weight x from node A to node B. Furthermore, self-loops, transactions from an external account to itself, have been removed.

RDF and Triples

The data we have is in RDF (Resource Description Framework) [4] format, which is a schema-less data model different from relational data model. It consists of triples [5]: subject, predicate and object. In RDF data, all the information are described in a semantical way: each record, i.e. each triple, shows a link of the predicate type of relationship, between subject and object. For instance, if there is a transaction where address A sends x ETH to address B, we will have a triple: address A (subject: from) sends ETH (predicate: hasValueTx) to address B (object: to). RDF data are essentially graphs.

SANSA library

For data processing, the SANSA library, an open-source library for distributed processing of large-scale RDF data, is used to efficiently read the RDF triple data and execute SPARQL queries. With SANSA on Spark, RDF triples are loaded into dataframes for further analysis. The query layer performs well with respect to speed, functioning with SPARQL language compatibility, as well as all Spark SQL functions.

In this implementation of the model, we applied two classic graph analysis algorithms: Connected Components and Page Rank.

Connected Components

Connected Components algorithm detects and separates clusters in a network, in which any node can reach to another through the edge paths. It treats the graph as an undirected graph, meaning once there is an edge, two nodes connected to it are seen to be reachable in both directions. Another stricter algorithm is Strongly Connected Components, which sees the graph as a directed graph and nodes are only reachable along the direction of the edge. If there is a strongly connected component, all the nodes in it are reachable to each other with direction considered.

Page Rank

Page Rank was used initially by Google to rank web pages depending on their link flow. The web links are regarded as a Markov Chain, where a “random surfer” (i.e. a random walk) is simulated. Each incoming edge is seen as a “vote” for the receiving web page. For the websites, getting more votes are interpreted as “Hubs” in the network, while the ones having more outgoing votes are considered as “Authorities”. The hub- and authority-scores for each node are updated recursively until either a maximum iteration limit is met, or the scores converge.

“Authorities” and “Hubs”

In our case, “Authorities” should be accounts who pays out to a large crowd of addresses, with high volume, while “Hubs” should be the big players who receive extensive ETH flow into their accounts. In the implementation we don’t differentiate these two roles, but rank them altogether as biggest players.

Data Set & Implementation

We limited the dataset to 10,000 blocks (Block 4,100,000 to Block 4,109,999), which contain 38,599,485 (38 million) triples, including both value transactions and contract messages.

After basic sanitization of the dataset, we applied the following two graph algorithms:

1. Connected Components [6]

By limiting the dataset to 10,000 blocks, we effectively truncated the whole transaction history. It is therefore highly possible that there are isolated nodes standing alone next to the main connected node cluster. These nodes will be of no interest for us as they do not contribute to the connectivity of the main transaction network. Therefore, we filter them out and proceed to find the largest cluster inside the remaining transaction graph.

The Connected Components algorithm enables us to find well-connected clusters. In this case, we favor it over the Strongly Connected Components algorithm [7], as we only need to explore all connected nodes. Strongly Connected Components requires all the node pairs in the graph to be reachable in either direction, which is too strict for our case: in a transaction graph, account pairs can only trade in one direction. For example, A always pays to B; the pair is still highly related for our purposes if they trade with high frequency or volume.

Using the Connected Components algorithm, we checked whether there are members of a connected component forming a complete graph and filtered out the self-loop edges, which are not compatible with Python’s NetworkX PageRank implementation. The largest component contains 185,741 nodes (accounts) and 250,637 edges (aggregated transaction relations).

2. PageRank [8]

Over the years, different versions of PageRank have emerged: initial design didn’t include edge attributes, but some newly developed implementations consider edge weight; some consider multiple edges and aggregate their weights first. To comply more consistently with our data semantics, we chose the NetworkX reference implementation [9], which takes a directed graph and initializes edge values with user-defined weights. In this way, the transaction value information carried by our edges is taken into consideration. The rank result would theoretically make more sense, with transaction volume considered.

Results & Findings

Category of Top Accounts

With page rank scores for all 185,741 nodes (top 50 are shown in appendix), we plot the distribution of nodes’ scores. The following figure shows scores in y-axis for all nodes in x-axis. Scores are scaled to 0–1 automatically in the NetworkX implementation. From it we see a long-tail distribution: there is only very small portion of nodes showing relatively higher scores, compared to the majority nodes. Most of the nodes in network show page rank scores close to 0, implying that they have no impact in the transaction network, compared to the few top accounts.

Figure 1 — PageRank Score Distribution of all Nodes

With the top accounts’ score distribution zoomed in, we see below more clearly that only the top 50 show meaningful values — others are too small to be considered as big players. We decide to stick to only top 50 for further exploration.

Figure 2 — PageRank Score Distribution of Top 50 Accounts

Furthermore, we labelled the top 50 accounts with data gathered from Etherscan nametags, comments, and various forums in which the addresses have been mentioned. It turns out that they can be described by two different types: mining pool wallets, and (mostly centralized) exchange wallets. The full table is attached in the Appendix. The pie chart in Figure 3 demonstrates that 58% of the addresses are controlled by exchanges, while another 12% with convincing tags are related to mining pools.

Figure 3 — Category Distribution of Top-50 Accounts

The exchange and mining pool wallets can be found in top positions of our ranking, which underlines the effectiveness of PageRank: Addresses related to mining pools allocate extensive amounts of payouts to their subscribed miners, resulting in large out-degrees in our graph representation, as well as a high accumulated transaction value. The main wallets of centralized exchanges distribute (and receive) large volumes of transactions to (and from) their deposit wallets, token contracts, etc. The most popular exchanges and mining pools appear in our ranking. The algorithms successfully detect the most influential accounts across the network, which manifest as Hubs and Authorities, connecting various transactors and carrying heavy flow weights.

The whole network graph is too large to show. Focusing on those known accounts (with labels from Etherscan), we present the network overview of top hubs and authorities with transactions as edges surrounding them: