The Ethereum network generates a whopping amount of data — protocol-level data, application-level data, account interactions, smart contract deployments, token velocities. For an analytics suite to effectively handle this volume of data, its load-bearing infrastructure must be resilient, and scalable. And so Alethio is thrilled to announce our collaboration with SANSA, the major open source solution for large scale distributed RDF based querying, reasoning, and machine learning.

Alethio is building an analytics dashboard that strives to provide transparency and an “archaeology” of what’s happening and has happened on the Ethereum node network, in the transaction pool, and on the blockchain itself. Our 6 billion triple data set contains blockchain transaction data modeled as RDF according to the structure of the Ethereum ontology and custom extensions to it. EthOn, the Ethereum Ontology, is a formalization of concepts, entities, and relations of the Ethereum ecosystem represented in RDF and OWL format. It describes all Ethereum terms including blocks, transactions, contracts, nonces, etc, as well as their relationships. Custom extensions to EthOn add the application layer — the decentralized applications that make Ethereum so versatile, and powerful.

Large-scale Processing in Real Time

Alethio is using SANSA as a scalable processing engine for large-scale batch and stream processing tasks, such as querying blockchain data in real time via SPARQL and performing related analytics on a wide range of subjects (asset turnover for sets of accounts, attack pattern detection, token flow, etc). Besides SPARQL, SANSA is also working to support other graph languages like Gremlin. SANSA is interested in running further industrial pilot applications for testing the stack’s scalability on large datasets. The SANSA team also looks forward to maturing its code base and gaining experience on running the stack on production clusters.

The initial goal of Alethio was to load a 2TB EthOn dataset containing more than 5 billion triples and then perform several analytic queries. The queries are used to characterize movement between groups of Ethereum accounts (for payment, consumer, and investment token sales) and aggregate their in and out value flow over the history of the Ethereum blockchain.

Core model of the fork history of the Ethereum Blockchain in EthOn

Going Forward: Increasing the Load

Alethio is excited to see that SANSA works and scales well with our data. Now, we want to experiment with more complex queries and tune the Spark parameters to gain the optimal performance for our dataset. After the success of our first few experiments, the data science team of Alethio and the SANSA team are eager to continue collaborating.

Beyond our initial tests, Alethio and SANSA are jointly discussing possibilities for efficient stream processing in SANSA, fine-tuning aggregate queries, as well as suitable Apache Spark parameters. In the future, we want to work together to optimize the performance of loading the data (reducing the disk footprint of datasets using compression techniques by allowing then more efficient SPARQL evaluation), handling the streaming data, querying, and running analytics in real time. We’re very excited to see where our research will lead us as we continue to develop these tools for processing data at new scales.

To learn more about our solutions, visit aleth.io, and check out ethstats.net for our first foray into Ethereum node network analytics.