We are going to create a multi-client asynchronous, highly concurrent, enterprise blockchain indexer as a microservice architecture on top of a sharded MongoDB infrastructure. We’re going to use NodeJS, MongoDB, Bcoin and modern Javascript.

To skip to the end, here’s the Github repo.

I’ll demonstrate using Bitcoin but this can work with Ethereum’s web3 too, any NodeJS adapter to a blockchain can be dropped in place of Bcoin — Dash, Litecoin, Bitcoin Cash, etc.

The majority of Bitcoin clients rely on Leveldb under the hood. Leveldb is really great and has super fast writes. MongoDB offers some fantastic features too, but is not the most performant DB for the high number of writes/deletes that take place maintaining the blockchain UTXO set. Making MongoDB an impractical drop-in replacement for Leveldb in a full node client.

The Problem

A large organization ultimately cannot rely solely on the RPC API provided by a single node. A large organization has its own infrastructure needs and requirements. It makes more sense to keep the full nodes as they are and index the blockchain data into a working set.

A simple architecture relies on a full node. It serves as a boundary to the Blockchain network.

The Boundary Node (BN) acts like a kind of firewall for the data. It’s a full node that you control. It provides full validation of the block data. In practice there should be more than one.

The more connections and nodes on the boundary make them more difficult to subvert.

This is exactly the design of the RPC interface. A full node connects to the Bitcoin network and provides an API to some application or infrastructure. The drawback is that as the organization scales it’s left relying on the same underlying Leveldb.

The difficulty in a blockchain like Bitcoin is processing and maintaining the UTXO set — the Unspent Transaction Outputs. Generally every transaction creates two outputs. One for a specific amount to some receiver and a second that refunds the change to the sender of the transaction.

When spent, they are included as inputs to a transaction. To validate them the miner will lookup the historic transaction and ensure the UTXO is spendable and signature is valid.

Our full node is already taking care of the validation. We are going to trust the data that comes from our node. What we need is something fast, scalable and simple to move the data from Leveldb to another datastore.

New Architecture

We’ll borrow from the Map/Reduce architecture and simplify it. As an added bonus, we want this to be as agnostic as possible. Working with new blockchains should be as simple as dropping in a new interface.

We’re going to add the concept of a Trust Node.

Under this architecture, the boundary nodes become multiple different clients, each connected over a peer to peer interface and also connected to the Bitcoin network.

The Trust Node ONLY connects to the organizations Boundary Nodes.

This gives additional redundancy in the event that one client fails. It also helps decentralize client and node development. Multiple competing clients become a benefit. If one Boundary Node suffers a catastrophic failure, the others may not.

To make the Trust Node we’re going to create a Tokenizer service and add it to a regular full node. It will treat each block as a unit of work, pass work to a “M” mapper that will process and persist it to mongo.

Each mapper is a separate virtual machine. To scale, we spin up new Mappers and connect them to the Tokenizer. Regardless of the size of the blockchain, we can continue to scale up by increasing the number of mappers. Fine tuning would involve finding the ideal ratio of TN’s to M’s.

The Mongo Infrastructure will be a sharded replica set.

Tokenizer

'use strict'; const app = require('express')();

const server = require('http').Server(app);

const io = require('socket.io')(server);

const bcoin = require('bcoin'); const FullNode = bcoin.fullnode; class Tokenizer {

constructor() {

if (!(this instanceof Tokenizer)) {

return new Tokenizer();

} this.idx = 0; // block index;

this.mappers = []; this.node = new FullNode({

network: 'main',

db: 'leveldb',

checkpoints: true,

workers: true,

logLevel: 'info',

'coin-cache': 100,

'max-inbound': 3,

'max-outbound': 3,

'http-port': 8332,

'nodes': 'ip1,ip2,ip3' // Replace with Boundary Node IPs

});

} async start() {

await this.node.open();

await this.node.connect();

this.node.startSync(); server.listen(3000, () => {

console.log('Tokenizer listening on port 3000');

}); io.on('connection', async (socket) => {

console.log(socket.handshake.query.name + ' connected');

this.mappers.push(socket.id);

await socket.emit('start'); socket.on('getBlock', async (data) => {

this.idx++;

await this.sendBlock(socket, this.idx);

}); socket.on('disconnect', () => {

// Remove from tracked mappers

this.mappers = this.mappers.filter((socketId) => {

return socketId !== socket.id;

});

})

});

} getBlock(height) {

return this.node.chain.getBlock(height);

} async sendBlock(socket, index) {

let block = await this.getBlock(index);

if (block) {

const view = await this.node.chain.getBlockView(block);

await socket.emit('block', {

raw: block.toRaw(),

view,

idx: index

});

}

}

} module.exports = Tokenizer;

We start by instantiating a Bcoin full node — the Trust Node. We provide a list of boundary nodes and limit our incoming and outgoing connections to only those nodes. When the Tokenizer starts, it will start the full node. Begin syncing and listen for socket connections from Mappers.

The Tokenizer will track its current block height as its current index. But this could easily be changed to start at any given height.

The Tokenizer’s only responsibility is to get blocks from the node and pass them to the Mappers.

The Mappers

'use strict'; const app = require('express')();

const server = require('http').Server(app);

const DB = require('bcoin-mongo-api');

const io = require('socket.io')(server);

const tokenClient = require('socket.io-client');

const inputClient = require('socket.io-client');

const bcoin = require('bcoin');

const Block = bcoin.primitives.Block; class Mapper {

constructor() {

if (!(this instanceof Mapper)) {

return new Mapper();

} this.inputsCache = [];

this.coinbaseHash = '0000000000000000000000000000000000000000000000000000000000000000';

this.db = new DB({

dbhost: '127.0.0.1',

dbname: 'mapperData'

});

this.inputClient = tokenClient.connect('

} this.tokenizer = tokenClient.connect(' http://localhost:3000' , { query: 'name=mapper' });this.inputClient = tokenClient.connect(' http://localhost:3002' , { query: 'name=mapper' }); async start() {

await this.db.open(); this.tokenizer.on('block', async (block) => {

console.log(block.idx);

await this.processBlock(block);

await this.getBlock();

}); this.tokenizer.on('start', () => {

this.getBlock();

}); server.listen(3001, () => {

console.log('Mapper listening on port 3001');

});

} async processBlock(block) {

...

} async getBlock() {

await this.tokenizer.emit('getBlock')

} revHex(hexString) {

let out = ''; for (let i = 0; i < hexString.length; i += 2)

out = hexString.slice(i, i + 2) + out; return out; } } module.exports = Mapper;

The Mapper makes two connections. One to the Tokenizer and another to an input store. The input store is not shown the architecture. It’s just a in memory store that will track each mappers safe height.

Since the architecture is highly asynchronous you cannot necessarily guarantee that the Inputs for a given tx exist in the database already or not.

100 Mappers will process 100 blocks concurrently. The Mapper that is farthest ahead may encounter a tx whose UTXO is waiting to be processed in another Mapper. The input store will allow us to safely commit spent UTXOs to the database by only committing UTXOs that are below the lowest Mapper Height.

Below is the processBlock function for a Bcoin Block.

if (!block) {

return;

} let b = await Block.fromRaw(block.raw);

// Block Entry Data

let chainEntry = {

height: block.height,

hash: b.hash().toString('hex'),

time: b.time

};

try {

// Save Coins, then Txs, then block

b.txs.forEach(async (tx) => {

let idx = 0;

// Save Outputs

tx.outputs.forEach(async (output) => {

let txHash = this.revHex(tx.hash().toString('hex'));

await this.db.saveCoins(txHash, idx, output);

idx++;

});

// Cache each input

tx.inputs.forEach((input) => {

let json = input.toJSON();

let isCoinbase = input.prevout.hash === this.coinbaseHash;

if (!isCoinbase) {

this.inputClient.emit('input',

{

height: block.idx,

spentTxId: this.revHex(tx.hash().toString('hex')),

prevoutHash: json.prevout.hash

});

}

})

// save tx

await this.db.saveBcoinTx(chainEntry, tx); });

// Save Block

await this.db.saveBcoinBlock(

chainEntry,

b);

} catch (e) {

console.log(e);

}

this.inputClient.emit('safeHeight', {

height: block.idx

});

This part of the code is separated because the optimization is removing this function and providing the Mapper with a configuration object that contains different processBlock functions for each blockchain.

E.g. ‘A Bitcoin Mapper’ or ‘A Dash Mapper’ — the config object will contain the appropriate processing function.

For now, this is hard coded and saves UTXOs, Transactions, and then Blocks.

The Input Store

'use strict'; const app = require('express')();

const server = require('http').Server(app);

const io = require('socket.io')(server);

const DB = require('bcoin-mongo-api'); class InputStore { constructor() {

if (!(this instanceof InputStore)) {

return new InputStore();

} this.db = new DB({

dbhost: '127.0.0.1',

dbname: 'mapperData'

}); this.inputs = [];

this.safeHeights = {};

this.lastHeight = 0;

this.mappers = [];

this.server = server;

this.io = io;

} async start() {

await this.db.open(); const safeHeights = this.safeHeights;

const getSafeHeight = this.getSafeHeight; this.server.listen(3002, () => {

console.log('Input Store listening on port 3002');

}); this.io.on('connection', (socket) => {

this.mappers.push(socket.id);

this.startListeners(socket);

}); this.io.on('disconnect', (socket) => {

this.mappers = this.mappers.filter((socketId) => {

return socketId !== socket.id;

});

});

} startListeners(socket) {

console.log(socket.handshake.query.name + ' connected'); socket.on('input', (data) => {

if (!this.inputs[data.height]) {

this.inputs[data.height] = [];

}

this.inputs[data.height].push(data);

}); socket.on('safeHeight', async (data) => {

this.safeHeights[socket.id] = data.height;

await this.checkInputs();

});

} async checkInputs() {

let curHeight = this.lastHeight;

let height = this.getSafeHeight();

while (curHeight < height) {

if (this.inputs[curHeight]) { this.inputs[curHeight].forEach(async (input) => { try {

await this.db.updateSpentCoins(input.prevoutHash, input.spentTxId)

} catch (e) {

console.log(e)

}

});

// Free the memory

this.inputs[curHeight] = null;

}

curHeight++;

}

this.lastHeight = curHeight;

} getSafeHeight() {

let safeHeight = 999999999999; this.mappers.forEach(mapper => {

if (this.safeHeights[mapper] < safeHeight ) {

safeHeight = this.safeHeights[mapper];

}

});

return safeHeight;

} } module.exports = InputStore;

After each mapper finishes saving a block, it sends the last block height to the input store. After receiving the Safe Height the store will move its tip up to the Safe Height, checking if it needs to commit any inputs on the way.

To control everything we are going to make a simple app.js:

const express = require('express');

const cluster = require('cluster');

const Tokenizer = require('./lib/tokenizer');

const Mapper = require('./lib/mapper');

const Store = require('./lib/inputStore'); const indexer = {

tokenizer: Tokenizer,

mapper: Mapper,

store: Store

} init(process.argv[2], process.argv[3]); function init(type, numCPU) {

if (cluster.isMaster) {

console.log(`Master ${process.pid} is running`);

for (let i = 0; i < numCPU; i++) {

cluster.fork();

} cluster.on('death', function(worker) {

console.log(`Worker ${process.pid} has died`);

});

} else {

const service = new indexer[type]();

service.start();

}

}

We can start individual services and assign them CPUs with the following commands in separate terminals:

$ node app.js tokenizer 1

$ node app.js store 1

$ node app.js mapper 4

This starts a tokenizer with 1 CPU, 4 mappers with one CPU each, and one input store with 1 CPU. This would start the full system on a single machine, useful for testing.

You’ll easily push one machine to its limit:

The mappers will immediately begin processing blocks. You can spin up more mappers and they will join the rotation seamlessly. An optimization for protection - mappers need to listen for Ctrl+C and exit gracefully after processing a block.

To scale, add more mappers. To optimize, split block responsibility between multiple Trusted Nodes but always connect the Mappers to the same input store.

A necessary feature not included is reorg handling. In the event of a blockchain reorg the most recent blocks change, but their UTXO references can occur anywhere in the blockchain. To handle the reorg, you will need to detect the change in the Trust Node, have the mappers rewind by marking the transaction UTXOs in those past blocks unspent and then removing the reorg-d blocks.

Or… Have your devops team orchestrate the sync process, drop the database and re-index back into Mongo. The blockchain is already downloaded so this will not take days or weeks.

Hundreds or thousands of mappers can work in concert and when optimized reduce the blockchain processing time to minutes or possibly seconds.

Sharding MongoDB

You can connect directly to a local mongoDB for everything above or you can setup a sharded infrastructure.

Before getting started you will need:

9 Virtual Machines. I recommend Vultr $2.50 offerings for testing.

We will build:

3 Config Servers; 2 Query Servers; 4 Shards

optional — setup /etc/hosts for servers (if no dns, your IPs will be different):

/etc/hosts

104.156.254.137 config1 45.32.215.153 config2 45.32.221.127 config3 45.32.216.7 query1 45.32.213.50 query2 45.32.211.213 shard1 45.32.211.107 shard2 45.32.211.107 shard3 45.32.216.252 shard4

This allows you to ssh root@config1 for example.

Golden Image

Each server is nearly identical. Create one golden image and clone it nine times. Using Ubuntu 16.04 ensure the following is installed:



$ echo "deb [ arch=amd64,arm64 ]

$ sudo apt-get update

$ sudo apt-get install mongodb-org -y

$ apt-get install screen $ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6$ echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list$ sudo apt-get update$ sudo apt-get install mongodb-org -y$ apt-get install screen

Each image will have the same hostname so:

hostname <nameOfHost>

sudo vi /etc/hostname

And change its name.

1. Setup Config Servers

Throughout the guide use the screen command to persist your terminal session on remote boxes. Assume this guide does so with each login.

Use ctrl A+D to detach from the screen session before logging out of the box and that terminal will continue to run while disconnected. To reattach to a running screen use screen — list to view active screen sessions and screen -r <nameOfSession> to attach. As always man screen or screen — help will give you the manual.

Login to config1

$ screen -t mongo

$ mkdir /mongo-metadata

$ mongod --dbpath /mongo-metadata/ --replSet set0 --configsvr --port 27019

Ctrl A+D to detach from your screen session before logging out.

Setup the other two config servers the same way before moving forward



# Connect to one of the config servers

$ mongo --host localhost --port 27019



# Give this command, replace with your IPs: rs.initiate(

{

_id: "set0",

configsvr: true,

members: [

{ _id : 0, host : "45.76.250.42:27019" },

{ _id : 1, host : "45.32.216.181:27019" },

{ _id : 2, host : "45.32.219.175:27019" }

]

}

)

Expected Results:

{‘Ok’} set0:OTHER>

If you have a console open to the other config servers they’ll start building the replication set.

Setup the Shards

Shards can be started from the command line without a configuration file. Once the shards are ready and the config servers are running we will connect a query server to the configs and add the shards to the mongos process on the query server.

Note* replSets are not required but should be used in prod.

$ screen -t mongo

$ mongod --shardsvr

# Mongo may complain about a missing data directory. Create the directory if it crashes out # use --replSet <replSetname> to create a replSet

# To setup multiple shards on each machine: $ mongod --shardsvr --port <diffPort> --dbpath <diffDbDir> --replSet <replSet> For each server, then: $ mongo --host <hostname> --port <port> rs.initiate(

{

_id: 'set1',

members: [

{ _id : 0, host : "45.32.211.213:27018" },

{ _id : 1, host : "45.32.223.10:27018" },

{ _id : 2, host : "45.32.211.107:27018" },

{ _id : 3, host : "45.32.216.252:27018" },

]

}

)

Expected Results:

{“ok”}

Setup Query Servers

The query server is what you’ll use to connect to the mongo cluster. You could connect to a shard individually, but you would only have access to that shard and would probably break the replication set. So don’t do that. Connect to your query servers, they are lightweight. They can run on their own servers, on the shards themselves, on the config servers or as part of your application.

$ service mongod stop

Add config servers to the Query Server

If you configured /etc/hosts or have DNS:

mongos --configdb config1:27019,config2:27019,config3:27019

Or you can use static IPs but this is less flexible

mongos --configdb set0/45.76.250.42:27019,45.32.216.181:27019,45.32.219.175:27019

The configuration commands must be identical on each Query Server. If successful, mongo will output a bunch of connection information

Expected Results:

2017-10-25T03:30:20.879+0000 I NETWORK [thread2] waiting for connections on port 27017

Add shards to the query server

sh.addShard( "set1/45.32.211.213:27018")

sh.addShard( "set1/45.32.223.10:27018")

sh.addShard( "set1/45.32.211.107:27018")

sh.addShard( "set1/45.32.216.252:27018")

If you missed it up top, you can find the project at this Github Repo.

There you go. That’s a start to getting Blockchain data out of the client and into the beginning on an enterprise infrastructure.

Thanks for reading.