Introduction

Ethereum users may be anonymous, but their addresses are unique identifiers that leave a trail publicly visible on the blockchain.

I built a clustering algorithm based on transaction activity that divides Ethereum users into distinct behavioral subgroups. It can predict whether an address belongs to an exchange, miner, or ICO wallet.

The database was constructed using SQL, and the model was coded in Python. Source code is available on GitHub.

3D representation of Ethereum address feature space using T-SNE

Background

The Ethereum blockchain is a platform for decentralized applications called smart contracts. These contracts are often used to represent other assets. These assets can represent physical objects in the real world (like real estate titles) or be purely digital objects (such as utility tokens).

The computations required to execute smart contracts are paid for in ether, the native currency of the ecosystem.

Ether is stored in cryptographically secured accounts called addresses.

Motivation

Many people believe that cryptocurrencies offer digital anonymity, and there is some truth to that belief. In fact, anonymity is the core mission of Monero and ZCash.

Ethereum, however, is more widely used, and its broad flexibility results in a rich, public dataset of transactional behavior. Because Ethereum addresses are unique identifiers whose ownership does not change, their activity can be tracked, aggregated, and analyzed.

Here, I attempt to create user archetypes by effectively clustering the Ethereum address space. These archetypes could be used to predict the owner of an unknown address.

This opens up a wide array of applications:

understanding network activity

enhancing trading strategies

improving AML activities

Results

Participants in the Ethereum ecosystem can be separated by patterns in their transaction activity. Addresses known to belong to exchanges, miners, and ICOs qualitatively show that the results of clustering are accurate.

Technical Details

Feel free to skip to Interpreting the Results below.

Feature Engineering

The Ethereum transaction dataset is hosted on Google BigQuery. Using the 40,000 addresses with the highest ether balances, I created 25 features to characterize differences in user behavior.

Features derived for each address

Choosing the Appropriate Number of Clusters

Using silhouette analysis, I determined the optimal number of clusters to be roughly 8.

This choice minimizes the number of samples with negative silhouette scores, which indicate that a sample may be assigned to the wrong cluster.