Editor’s note: Full disclosure — the author is a developer and software architect at ArangoDB GmbH, which leads the development of the open source multi-model database ArangoDB.

In recent years, the idea of “polyglot persistence” has emerged and become popular — for example, see Martin Fowler’s excellent blog post. Fowler’s basic idea can be interpreted that it is beneficial to use a variety of appropriate data models for different parts of the persistence layer of larger software architectures. According to this, one would, for example, use a relational database to persist structured, tabular data; a document store for unstructured, object-like data; a key/value store for a hash table; and a graph database for highly linked referential data. Traditionally, this means that one has to use multiple databases in the same project, which leads to some operational friction (more complicated deployment, more frequent upgrades) as well as data consistency and duplication issues.

Learn faster. Dig deeper. See farther.

Figure 1: tables, documents, graphs and key/value pairs: different data models. Image courtesy of Max Neunhöffer.

This is the calamity that a multi-model database addresses. You can solve this problem by using a multi-model database that consists of a document store (JSON documents), a key/value store, and a graph database, all in one database engine and with a unifying query language and API that cover all three data models and even allow for mixing them in a single query. Without getting into too much technical detail, these three data models are specially chosen because an architecture like this can successfully compete with more specialised solutions on their own turf, both with respect to query performance and memory usage. The column-oriented data model has, for example, been left out intentionally. Nevertheless, this combination allows you — to a certain extent — to follow the polyglot persistence approach without the need for multiple data stores.

At first glance, the concept of a multi-model database might be a bit hard to swallow, so let me explain this idea briefly. Documents in a document collection usually have a unique primary key that encodes document identity, which makes a document store naturally into a key/value store, in which the keys are strings and the values are JSON documents. In the absence of secondary indexes, the fact that the values are JSON does not really impose a performance penalty and offers a good amount of flexibility. The graph data model can be implemented by storing a JSON document for each vertex and a JSON document for each edge. The edges are kept in special edge collections that ensure that every edge has “from” and “to” attributes that reference the starting and ending vertices of the edge respectively. Having unified the data for the three data models in this way, it only remains to devise and implement a common query language that allows users to express document queries, key/value lookups, “graphy queries,” and arbitrary mixtures of these. By “graphy queries,” I mean queries that involve the particular connectivity features coming from the edges, for example “ShortestPath,” “GraphTraversal,” and “Neighbors.”

Aircraft fleet maintenance: A case study

One area where the flexibility of a multi-model database is extremely well suited is the management of large amounts of hierarchical data, such as in an aircraft fleet. Aircraft fleets consists of several aircraft, and a typical aircraft consists of several million parts, which form subcomponents, larger and smaller components, such that we get a whole hierarchy of “items.” To organise the maintenance of such a fleet, one has to store a multitude of data at different levels of this hierarchy. There are names of parts or components, serial numbers, manufacturer information, maintenance intervals, maintenance dates, information about subcontractors, links to manuals and documentation, contact persons, warranty and service contract information, to name but a few. Every single piece of data is usually attached to a specific item in the above hierarchy.

This data is tracked in order to provide information and answer questions. Questions can include but are not limited to the following examples:

What are all the parts in a given component? Given a (broken) part, what is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure? Which parts of this aircraft need maintenance next week?

A data model for an aircraft fleet

So, how do we model the data about our aircraft fleet if we have a multi-model database at our disposal?

There are probably several possibilities, but one good option here is the following (because it allows us to execute all required queries quickly): there is a JSON document for each item in our hierarchy. Due to the flexibility and recursive nature of JSON, we can store nearly arbitrary information about each item, and since the document store is schemaless, it is no problem that the data about an aircraft is completely different from the data about an engine or a small screw. Furthermore, we store containment as a graph structure. That is, the fleet vertex has an edge to every single aircraft vertex, an aircraft vertex has an edge to every top-level component it consists of, component vertices have edges to the subcomponents they are made of, and so on, until a small component has edges to every single individual part it contains. The graph that is formed in this way is in fact a directed tree:

Figure 2: A tree of items. Image courtesy of Max Neunhöffer.

We can either put all items in a single (vertex) collection or sort them into different ones — for example, grouping aircraft, components, and individual parts respectively. For the graph, this does not matter, but when it comes to defining secondary indexes, multiple collections are probably better. We can ask the database for exactly those secondary indexes we need, such that the particular queries for our application are efficient.

Queries for aircraft fleet maintenance

We now come back to the typical questions we might ask of the data, and discuss which kinds of queries they might require. We will also look at concrete code examples for these queries using the ArangoDB Query Language (AQL).

What are all the parts in a given component?

This involves starting at a particular vertex in the graph and finding all vertices “below” — that is, all vertices that can be reached by following edges in the forward directions. This is a graph traversal, which is a typical graphy query.

Figure 3: Finding all parts in a component. Image courtesy of Max Neunhöffer.

Here is an example of this type of query, which finds all vertices that can be reached from “ components/Engine765 ” by doing a graph traversal:

RETURN GRAPH_TRAVERSAL("FleetGraph", "components/Engine765", "outbound")

In ArangoDB, one can define graphs by giving them a name and by specifying which document collections contain the vertices and which edge collections contain the edges. Documents, regardless of whether they are vertices or edges, are uniquely identified by their _id attribute, which is a string that consists of the collection name, a slash “/” character and then the primary key. The call to GRAPH_TRAVERSAL thus only needs the graph name “ FleetGraph ”, the starting vertex, and “ outbound ” for the direction of the edges to be followed. You can specify further options, but that is not relevant here. AQL directly supports this type of graphy query.

Given a (broken) part, what is the smallest component of the aircraft that contains the part and for which there is a maintenance procedure?

This involves starting at a leaf vertex and searching upward in the tree until a component is found for which there is a maintenance procedure, which can be read off the corresponding JSON document. This is again a typical graphy query since the number of steps to go is not known a priori. This particular case is relatively easy since there is always a unique edge going upward.

Figure 4: Finding the smallest maintainable component. Image courtesy of Max Neunhöffer.

For example, the following is an AQL query that finds the shortest path from “ parts/Screw56744 ” to a vertex whose isMaintainable attribute has the boolean value true, following the edges in the “ inbound ” direction:

RETURN GRAPH_SHORTEST_PATH("FleetGraph", "parts/Screw56744", {isMaintainable: true}, {direction: "inbound", stopAtFirstMatch: true})

Note that here, we specify the graph name, the _id of the start vertex and a pattern for the target vertex. We could have given a concrete _id instead, or could have given further options in addition to the direction of travel in the last argument. We see again that AQL directly supports this type of graphy query.

Which parts of this aircraft need maintenance next week?

This is a query that does not involve the graph structure at all: rather, the result tends to be nearly orthogonal to the graph structure. Nevertheless, the document data model with the right secondary index is a perfect fit for this query.

Figure 5: Query whose result is orthogonal to the graph structure. Image courtesy of Max Neunhöffer.

With a pure graph database, we would be in trouble rather quickly for such a query. That is because we cannot use the graph structure in any sensible way, so we have to rely on secondary indexes — here, for example, on the attribute storing the date of the next maintenance. Obviously, a graph database could implement secondary indexes on its vertex data, but then it would essentially become a multi-model database.

To get our answer, we turn to a document query, which does not consider the graph structure. Here is one that finds the components that are due for maintenance:

FOR c IN components FILTER c.nextMaintenance <= "2015-05-15" RETURN {id: c._id, nextMaintenance: c.nextMaintenance}

What looks like a loop is AQL’s way to describe an iteration over the components collection. The query optimiser recognises the presence of a secondary index for the nextMaintenance attribute such that the execution engine does not have to perform a full collection scan to satisfy the FILTER condition. Note AQL’s way to specify projections by simply forming a new JSON document in the RETURN statement from known data. We see that the very same language supports queries usually found in a document store.

Using multi-model querying

To illustrate the potential of the multi-model approach, I’ll finally present an AQL query that mixes the three data models. The following query starts by finding parts with maintenance due, runs the above shortest path computation for each of them, and then performs a join operation with the contacts collection to add concrete contact information to the result:

FOR p IN parts FILTER p.nextMaintenance <= "2015-05-15" LET path = GRAPH_SHORTEST_PATH("FleetGraph", p._id, {isMaintainable: true}, {direction: "inbound", stopAtFirstMatch: true}) LET pathverts = path[0].vertices LET c = DOCUMENT(pathverts[LENGTH(pathverts)-1]) FOR person IN contacts FILTER person._key == c.contact RETURN {part: p._id, component: c, contact: person}

In AQL, the DOCUMENT function call performs a key/value lookup via the provided _id attribute; this is done for each vertex found as target of the shortest path computation. Finally, we can see AQL’s formulation for a join. The second FOR statement brings the contacts collection into play, and the query optimiser recognises that the FILTER statement can be satisfied best by doing a join, which in turn is very efficient because it can use the primary index of the contacts collection for a fast hash lookup.

This is a prime example for the potential of the multi-model approach. The query needs all three data models: documents with secondary indexes, graphy queries, and a join powered by fast key/value lookup. Imagine the hoops through which we would have to jump if the three data models would not reside in the same database engine, or if it would not be possible to mix them in the same query.

Even more importantly, this case study shows that the three different data models were indeed necessary to achieve good performance for all queries arising from the application. Without a graph database, the queries of a graphy nature with path lengths, which are not a priori known, notoriously lead to nasty, inefficient multiple join operations. However, a pure graph database cannot satisfy our needs for the document queries that we got efficiently by using the right secondary indexes. The efficient key/value lookups complement the picture by allowing interesting join operations that give us further flexibility in the data modeling. For example, in the above situation, we did not have to embed the whole contact information with every single path, simply because we could perform the join operation in the last query.

Lessons learned for data modeling

The case study of aircraft fleet maintenance reveals several important points about data modeling and multi-model databases.