Claudius Weinberger, CEO ArangoDB

TL;DR Native multi-model databases combine different data models like documents or graphs in one tool and even allow to mix them in a single query. How can this concept compete with a pure document store like MongoDB or a graph database like Neo4j? I myself and a lot of folks in the community asked that question.

So here are some benchmark results: 100k reads → competitive; 100k writes → competitive; friends-of-friends → superior; shortest-path → superior; aggregation → superior.

Feel free to comment, join the discussion on HN and contribute – it’s all on Github.

The latest edition of the NoSQL Performance Benchmark (2018) has been released. Please click here

Please note: There are more benchmark results online.

Introduction

To start with, what actually is a “native multi-model database”? From my perspective it’s a document store (JSON documents), a key/value store and a graph database, all in one database engine and with a unifying query language and API that covers all three data models and even allows to mix them in a single query.

In this post I want to demonstrate that a native multi-model database can compete with specialised solutions with respect to performance and memory usage. To support this claim, my team and I have conducted performance tests that compare the same types of queries in different databases.

For these tests, we’ve used a dataset typical for a social network site, with user profiles and a friendship relation. We won’t measure every possible database operation. Rather, we focus on queries that are sensible for a social network: for example, we perform single reads and writes of profiles, we compute an aggregation to get an overview of the age distribution, we ask for friends of friends, or we ask for shortest friendship paths. These queries are run for all tested databases, irrespective of the data model they are using internally. As a result, we have a performance comparison between specialised solutions and a multi-model database.

Surely, we could have investigated a lot of different questions and scenarios, and certainly, other cases could well have seen different winners and losers. In the end, every user has to test herself considering the concrete use case at hand. This is the only way to get valid statements for production use.

I want to demonstrate the competitiveness of multi-model databases and don’t want to blame any particular player. Rather, I want to invite everyone who could improve the test setup for any of the databases involved, to contribute. To directly address an active community member and much appreciated advocate of Neo4j: Michael Hunger – you might probably have some ideas how to improve the results for your database, don’t you?

For the comparison we have selected MongoDB as the current champion amongst the pure document stores that recently switched to the new, superior storage engine WiredTiger, and Neo4j, the leader amongst the graph databases. However, we plan to add more in the near future to get a better overview.

Overview of the tests

Our test data are a snapshot from a social network in Slovakia and provided by the Stanford University SNAP collection. It contains 1 632 803 vertices describing people and 30 622 564 edges describing their friendship relation. We use the vertex data for the document tests and the combined vertex and edge data for the graph tests. We wanted to study the standard client/server setup rather than embedding the database into the application, because this model is used more widely and therefore more relevant in practice. We used JavaScript/node.js as client language and environment, see appendix for a discussion of this choice.

We performed the following tests, all implemented in JavaScript running in node.js:

single read : single document reads of profiles (100 000 documents)

: single document reads of profiles (100 000 documents) single write : single document writes of profiles (100 000 documents)

: single document writes of profiles (100 000 documents) aggregation : aggregation over a single collection (1 632 803 documents) Here, we compute statistics about the age distribution for everyone in the network, simply counting which age occurs how often.

: aggregation over a single collection (1 632 803 documents) Here, we compute statistics about the age distribution for everyone in the network, simply counting which age occurs how often. neighbors : finding direct neighbors plus the neighbors of the neighbors (for 500 vertices)

: finding direct neighbors plus the neighbors of the neighbors (for 500 vertices) shortest path: finding 19 shortest paths (in a highly connected social graph) This answers the question how close to each other two people are in the social network.

We have used a 16 core virtual machine with 60 GB main memory and 256 GB SSD, for more details see the appendix.

Test results

The throughput measurements on the test machine for ArangoDB define the baseline (100%) for the comparisons. Lower percentages point to higher throughput and accordingly, higher percentages indicate lower throughput.

Overall test results:

The tests show that multi-model databases can compete with single model databases. MongoDB is faster at single document reads/writes but couldn’t compete when it comes to aggregations or 2nd neighbors selections. Note: the shortest path query was not tested for MongoDB as it would have to be implemented completely on the client side.

Neo4j’s results in graph queries were a bit unexpected to me. I would have expected more competitive results in a 2nd neighbors search. Here I really would love to see that a Neo4J expert double checks the Cypher query to see if we missed any optimization. Neo4J uses caches on different layers that speed-up queries that run repeatedly – so one might get better results in a production environment with a mixed and somehow repeated workload. As we don’t wanted to benchmark caches we didn’t repeat the tests several times.

For our tests we run the workloads 5 times (from scratch), averaging the results.

UPDATE: Hans-Peter Grahsl has pointed out that underscore union is incredibly slow. We have replaced that in the MongoDB neighbors test with pure Javascript.

UPDATE 2: Hans-Peter Grahsl has rewritten the neighbors tests using the aggregation framework in MongoDB. This is faster than requesting the neighbors using single “find” calls. Thanks a lot for the improvement.

UPDATE 3: Michael Hunger pointed out, that we should replace the index by a constraint. Jacob Hansson suggested to trigger the JIT compilation by executing a few thousand shortest paths computation before the tests. That has been added in the current version.

UPDATE 4: Following suggestions by Aseem Kishore and Michael Hunger, we switch to the “neo4j” 2.0.0-RC1 driver. With this driver we are able to use keep-alive connections by adding a custom HTTP agent with a connection pool of 25 connections. We still restrict the number of single reads and writes to 20.000 (instead of 100.000) as we were experiencing crashes with node. We were using the system node.js version of Ubuntu 14.10. It was suggested to upgrade to io.js. This has not yet been done, because we would need to rerun the tests for all databases.

Appendix – Details about data, machines, software and tests

The data

Pokec is the most popular online social network in Slovakia. We used a snapshot of its data provided by the Stanford University SNAP. It contains profile data from 1 632 803 people. The corresponding friendship graph has 30 622 564 edges. The profile data contain gender, age, hobbies, interest, education etc., but the individual JSON documents are very diverse, because many fields are empty for many people. Profile data are in the Slovak language. Friendships in Pokec are directed. The uncompressed JSON data for the vertices need around 600 MB and the uncompressed JSON data for the edges requires around 1.832 GB. The diameter of the graph (longest shortest path) is 11, but the graph is highly connected, as is normal for a social network. This makes the shortest path problem particularly hard.

The hardware

All benchmarks were done on a virtual machine of type n1-standard-16 in Google Compute Engine with 16 virtual cores (on these, a virtual core is implemented as a single hardware hyper-thread on a 2.3 GHz Intel Xeon E5 v3 (see Haswell)) and altogether 60 GB of RAM. The data was stored on a 256 GB SSD drive, directly attached to the server. The client was an n1-standard-8 (8 vCPU, 30 GB RAM) in the same network.

All databases were installed on the same machine, we have done our best to tune the configuration parameters best, we have for example switched off transparent huge pages and configured up to 40 000 open file descriptors for each process.

The software

We have used

ArangoDB V2.6.0 alpha3 (pre-release) for x86_64

MongoDB V3.0.3 for x86_64, using the WiredTiger storage engine

Neo4j Community Edition V2.2.2 running on JDK 1.7.0_79

We wanted to use a client/server model, thus we needed a language to implement the tests, and we decided that it has to fulfill the following criteria:

Each database in the comparison must have a reasonable driver.

It is not one of the native languages our contenders are implemented in, because this would potentially give an unfair advantage for some. This ruled out C++ and Java.

The language must be reasonably popular and relevant in the market.

The language should be available on all major platforms.

This essentially left JavaScript , PHP , Python and Ruby . We decided to use JavaScript with node.js, because it’s popular and known to be fast, in particular with network workloads.

For each database we used the most up-to-date JavaScript driver that was recommended by the respective database vendor. Here are the versions of the npm-modules:

arangojs in version 3.8.0

in version 3.8.0 mongodb in version 2.0.33, which builds on top of

in version 2.0.33, which builds on top of mongodb-core in version 1.1.32

in version 1.1.32 node-neo4j in version 2.0.0 RC1 (Update 4: We’ve switch from seraph 0.11.3 as suggested)

The tests

We have made sure for each experiment that the database has a chance to load all relevant data into RAM. Some DBs allow explicit load commands for collections, others not. Therefore, we increased cache sizes accordingly where relevant and used full collection scans as a warm-up procedure. For the single document tests, we use individual requests for each document but use keep-alive and allow multiple simultaneous connections, since we wanted to test throughput rather than latency. Whenever the driver allowed to configure this, we chose to use a TCP/IP connection pool of up to 25 connections. Note that the ArangoDB driver does not use HTTP pipelining, whereas the MongoDB driver seems to do a corresponding thing for its binary protocol, which can help to increase throughput. For more detailed information about each individual database see below.

We discuss each of the five tests separately:

single document reads (100 000 documents)

In this test we store 100 000 ids of people in the node.js client and try to fetch the corresponding profiles from the database, each in a separate query. In node.js, everything happens in a single thread but asynchronously. To fully load the database connections we first submit all queries to the driver and then await all the callbacks using the node.js event loop. We measure the wallclock time from just before we start sending queries until the last answer has arrived. Obviously, this measures throughput of the driver/database combination and not latency, therefore we give as a result the complete wallclock time for all requests.

single document writes (100 000 documents)

For this test we proceed similarly: We load 100 000 documents into the node.js client and then measure the wallclock time needed to send all of them to the database, using individual queries. We again first schedule all requests to the driver and then wait for all callbacks using the node.js event loop. As above, this is a throughput measurement.

aggregation over a single collection (1 632 803 documents)

In this test we do an aggregation over all 1 632 803 profile documents and count how often each value of the AGE attribute occurs. We did not put a secondary index for this attribute on any of the databases, so they all have to perform a full collection scan and do a counting statistics. We only measure a single request, since this is enough work to get an accurate measurement. The amount of data scanned should be more than any CPU cache can hold, so we should see real RAM accesses but usually no disk accesses because of the above warm-up procedure.

finding the neighbors of the neighbors (for 500 vertices)

This is the first graph test. For each of altogether 500 vertices we find all neighbors and all neighbors of all neighbors, which achieves finding the friends of the friends of a person. This is a typical graph matching problem considering paths of length 1 or 2. For the non-graph database MongoDB, we have to emulate this with several requests for each vertex.

finding 19 shortest paths (in a highly connected social graph)

This is a pure graph test with a query that is particularly suited for a graph database. We ask the database in 19 different requests to find a shortest path between two given vertices in our social graph. Due to the high connectivity of the graph, such a query is hard, since the neighborhood of a vertex grows exponentially with the radius. Shortest path is notoriously bad in more traditional databases, because the answer involves an a priori unknown number of steps in the graph, usually leading to an a priori unknown number of joins. Originally we picked 20 random pairs of vertices but it turned out that for one of the pairs there is not path in the graph at all. We excluded that one for the measurements because Neo4j, which did altogether quite well at shortest paths, was exceedingly slow to notice that there is no such path. Again, the network communication between client and database was negligible and therefore 19 paths are enough to get an accurate measurement. Note however, that the time for different pairs varies considerably, because it depends on the length of the shortest path as well as sometimes on the order in which edges are traversed.

We finish the description with a few more detailed comments for each individual database:

ArangoDB:

ArangoDB allows to specify the value of the primary key attribute _key, as long as the unique constraint is not violated. It automatically creates a primary hash index on that attribute, as well as an edge index on the _from and _to attributes in the friendship relation (edge collection). No other indexes were used.

MongoDB:

Since MongoDB treats the edges just as documents in another collection, we helped it for the graph queries by creating two more indexes on the _from and _to attributes of the friendship relation. Due to the absence of graph operations we did neighbors of neighbors using 3 requests and did not even try to do shortest paths.

Neo4j:

(Update 4: please read the update information above on keep-alive and driver changes)

In Neo4j the attribute values of the profile documents are stored as properties of the vertices. For a fair comparison, we created an index on the _key attribute. Neo4j claims to use “index-free adjacency” for the edges, so we did not add another index on edges. We have configured the cache sizes so that Neo4j should be able to hold all data in RAM. We were not able to verify whether or not this really happened. Furthermore, we have tried different node.js drivers. In the end, we settled for the most stable one (see above). We have observed that with this driver the time per request was occasionally smaller if we did fewer requests (e.g. 5 000 instead of 100 000). This might have to do with the fact that this driver seems to open a new TCP/IP connection for each request, opening a very large amount of simultaneous connections. However, we did not find a possibility to configure a connection pool or anything similar. In our opinion this is really not a problem of the database. Either we missed the configuration option to enable keep-alive or the driver has not yet implemented this option. In order to avoid distortions caused by the driver, we have reduced the number of tests case and extrapolated the result. Any help how to solve this issue is welcome.

Resources and Contribution

All code used in this test can be downloaded from my Github repository and all the data is published in a public Amazon S3 bucket. The tar file consists of two folders data (database) and import (source files).

Everybody is welcome to contribute by testing other databases and sharing the results.

[]: https://snap.stanford.edu/data/soc-pokec.html