The Importance of Datasets

This is where the data requirements for unsupervised learning on the blockchain become an important problem (read: opportunity!). In order to train and calibrate a supervised learning model, there must be some large initial set of data for which the value of the labels or responses is known. This calibrates the model so that the predicted and actual response are as close as possible. This means that when a new observation comes in where the response is unknown, the prediction will be close to the true value, assuming the new observation is being generated by a similar process that generated the original dataset. Once the training phase is complete and the model calibrated, it can then be applied to new observations where the response is unknown.

In the case of price prediction, this means having a large database of historical prices. In the case of classification of accounts, this means having an initial set of accounts that are already labeled as being a decentralized exchange, a DOS account, or a Ponzi.

In these classification examples, the labels in the dataset used for training are often only available through significant effort. One possibility would be to pull data from websites like coinmarketcap or etherscan, building ETLs to import interesting data from other blockchain businesses, or through the painstaking effort of trained research assistants who gather data about on-chain accounts by surfing the web and analyzing source code.

The realization of the importance of gathering external data about accounts (metadata) for the purposes of machine learning was the motivation for creating a new spoke at ConsenSys called Rakr. Through collaboration with Alethio and other spokes and services within the mesh, Rakr hopes to provide a platform for gathering and sharing this valuable metadata. While the implications of integrating blockchain metadata with raw on-chain data go far beyond machine learning, the applicability of this metadata for supervised machine learning will continue to be a primary use case for the Rakr platform. By combining Alethio’s powerful analytics platform with the valuable metadata provided by Rakr, the applications of data science at ConsenSys will be limited only by the imagination.

In Practice

The first example of a supervised learning model produced at ConsenSys was the Ponzi model developed by Alethio, which will be described in more detail during the sequel to this article. The development of this model lays the groundwork for many future analytics possibilities for Alethio. Alethio hopes to expand this model to a more general fraud model in the near term.

More generally, the feature extraction pipelines built during this model development effort can be reused to classify any account according to one of the labels in the Rakr database, including whether an account/contract is an exchange, an art DAO, a casino, a DOS-related account, and much more. As the set of interesting metadata provided by Rakr continues to grow, more new models will become possible. And as the analytics capabilities of Alethio grow and more useful features are created, these models will become more powerful and versatile.

Being able to know whether a given account is a fraud or related to a DOS attack is crucial for managing financial and network risk on the Ethereum network. If we want to productionize models that provide actionable insights about new accounts and very recent behavioral data, they must satisfy special requirements. For example, we must make sure that they are being updated in real time, and that the features being used for classification and prediction are reliable and complete at the time the model is run. This means that certain features that can be used for classification of “old” accounts, such as “whether a contract eventually self-destructed,” cannot be applied to accounts in real-time. Since the value of the feature may change in the future, it’s true value is not really known at the time the model is run.

Real-time machine learning models present unique challenges and opportunities that go beyond those of historical modeling techniques. With that said, the ability to classify accounts as frauds goes beyond real-time risk management; classification models can still be valuable even if they are applied “in the past”. Being able to accurately classify historical frauds is useful for research purposes, even if those accounts are no longer active. More generally, attaching tags to accounts on the blockchain allows users to define semantically interesting subsets of accounts on the blockchain (such as “ICOs” or “exchanges”), rendering the blockchain searchable based on criteria that humans care about.

Creating a database of empirical human knowledge about on-chain entities is already a valuable and challenging task, and a necessary foundation for many other products and services. But with over 30,000,000 Ethereum accounts and contracts to date and roughly 100,000 new accounts created every day, it is simply impossible for humans to tag the entire history of ethereum accounts, most of which have no useful information (such as contract source, a website, or any other identifying information) that could be used by humans to classify or tag them. This is why the machine learning models are crucial: because they are infinitely scalable, and can be used to classify accounts using only the raw data characterizing their on-chain behavior.

By augmenting human knowledge about the blockchain with powerful analytics and machine learning, we envision a blockchain where every account and entity is enriched with useful classifications and properties, whether empirical and created by humans, or predicted and created by statistical models. This will be a major step forward for the transparency and accessibility of knowledge on the blockchain, which are an essential aspects required for blockchain technology to flourish.

Keep an eye out for the next article by Paul Lintilhac, which will give an exposition of one of Alethio’s recent data science initiatives: the Ponzi Model.