The real data size for the graph was slightly smaller than expected at the beginning, though still challenging. Whole graph consisted of 830 millions of vertices and almost 7 billion relation. Final subsets for fulltext searching varied in size from few thousand to a few million.

Existing app

The existing application stored all data, including 750 million profiles, 80 million companies and 7 billion relations in Amazon Redshift. Generating the subset took few days from the enduser perspective. Technically the whole process was handled manually with set of batch scripts and required pulling subsets from Redshift to JSON files, pushing to MongoDB and further to Elastic Search. We have decided to suggest something much more reliable and corresponding to real product requirements.

Proof of Concept

To provide the impression what can be achieved with Scala and validate our concept we have decided to build PoC implementation on fully anonymized data.

The goal

The primary objective was dead simple: handle 1 billion social network automatically. We have specified the Definition of Done for the PoC as working application, exposing the REST endpoints that allow us to run fulltext searches on the subset. It has to be generated on the fly for an arbitrarily chosen customer. Additional solution requirements that we have decided to impose:

First results from the subset must be available in under 1 minute

Entire subset has to be ready for the end user in few minutes

Architecture concept

We have decided to build a simple application that would store data in two database engines: graph database for all vertices and relations, without any properties and document database with fulltext search with all profiles' data. On demand we would run graph traversal in the database and fetch vertices ids. Response for the graph query will be streamed to the api application and used to tag existing documents with a unique token in the Search Engine. Finally, the token can be used to select only profiles belonging to the subset during the search queries.

Weapon of choice

We have decided to use Play Framework with Scala to build the application. We already had an idea which document database will suit us the most. Having significant experience with ElasticSearch though the choice was quite obvious. Elastic perfectly scales horizontally and allows extremely resilient cluster configuration. Additionally the documentation is very clear and exhaustive.

On the other hand the graph database engine was not an easy choice. We had to make few experiments. The database engine we have considered were Titan, OrientDB and Neo4j. We pulled a fully anonymized graph from the source Redshift database into CSV files, cleaned them up and started the Graph Databases evaluation.

Titan, which is built on top of Cassandra and Elastic Search, performed quite well. Unfortunately it seems that the project developments have been put on hold. OrientDB was not very performant during initial load, additionally it refused to work after 200-300 million vertices and relations - we could read data but write requests were not parsed and no errors were reported. Only database restart could temporarily resolve the issue. We have dug through a plethora of blog posts describing various horror stories with OrientDB. The last man standing and an implicit winner was Neo4j.