Hypergraph data model

Hypergraphs generalise the common notion of graphs by relaxing the definition of edges. An edge in a graph is simply a pair of vertices. Instead, a hyperedge in a hypergraph is a set of vertices. Such sets of vertices can be further structured, following some additional restrictions involved in different possible definitions of hypergraphs.

The hypergraph data model (HDM) that we have developed and proposed as the formal foundation of Grakn, is based on a specific notion of hypergraphs, the structure of which can be derived from three basic premises:

a hypergraph consists of a non-empty set of vertices and a set of hyperedges; a hyperedge is a finite set of vertices (distinguishable by specific roles they play in that hyperedge); a hyperedge is also a vertex itself and can be connected by other hyperedges.

For instance, the figure below depicts one such hypergraph consisting of four vertices (bob, alice, m-1, df-1), two of which happen to be also hyperedges:

m-1, describing a binary marriage relationship between bob and alice playing the roles of husband and wife, respectively;

df-1, describing a ternary divorce filing relationship involving three role-players in the roles of certified marriage, petitioner and respondent.

In other words, a hyperedge can be simply seen as a collection of role-role-player pairs of arbitrary cardinality. The role-players in such hyperedges can be simple vertices, such as bob or alice, but also other hyperedges, as is the case with m-1 playing the role of certified marriage in df-1. In terms of data modelling, hypergraphs of this structure offer a very natural and straightforward data representation formalism, closely aligned with popular conceptual modelling frameworks such as entity-relationship diagrams. The basic modelling proviso in HDM is this: entities are simple vertices, while relationships of any arity are modelled uniformly as hyperedges.

While several alternative generalisations of hypergraphs are known and have been studied, the definition adopted here has been recognised and received particular attention in the context of knowledge representation [1], general AI [2], and graph- and object-oriented databases [3,4]. The appeal of this formulation stems predominantly from the way it naturally generalises over a number of information modelling structures ubiquitous in data engineering and AI, including: first-order relations, graphs, nesting of information patterns, meta-modelling, and others. This characteristic turns out to be invaluable also in the context of representing and reasoning over knowledge graphs.

Hypergraphs vs. relations vs. directed graphs

What then is the core value of using hypergraphs over other prominent data representation structures, such as the relational model (underlying SQL databases) or variants of the directed graph model (underlying graph databases and RDF triple stores)? Interestingly, HDM fits naturally within this landscape. While being essentially equally expressive as those, it endorses a somewhat differently balanced data modelling approach, and thus offers a largely missing link between these existing alternatives. In the following paragraphs, we explain these key differentiating aspects and the benefits of modelling data using HDM.

Relations

Let us consider a typical representation of the same data set as used in the previous example in SQL tables (or formally speaking relations).

The correspondence to the hypergraph above is evident. Simple vertices in the hypergraph (bob, alice) correspond to the records of the “entity” tables (Man, Woman), while the hyperedges (m-1, df-1) directly reflect the contents of each record from the “relationship” tables (Marriage, Divorce_filing). Thus, the structuring of information characteristic and critical to the SQL-style modelling, where relevant pieces of data are grouped together in records, is faithfully preserved in hypergraphs. In general, the hypergraph data model can seamlessly accommodate any relational information.

Yet, the connected nature of this information (i.e., the fact that the presence of foreign keys in the records indicates the existence of actual relationships between entities) is made much more prominent in the hypergraph representation. This is a critical difference not only on the conceptual level, when just exploring the data, but also in a deeper, technical sense, when expressing and executing complex traversal queries, which are well-known bottlenecks of SQL databases.

Furthermore, the strictness of the SQL schema makes it very hard to cater for irregular or incomplete data instances, e.g., an odd instance of polygamous marriage showing up in the data set, or an instance with missing information about one of the spouses. Similarly, it is cumbersome to revise the structuring of data once the SQL database has been defined and populated. As has been often argued by graph data advocates, this shortcoming is naturally alleviated in much more flexible graph-based models, in which aspect hypergraphs are no different. All three hyperedges (husband:bob, wife:alice), (husband:mark) and (husband:jacob, wife:gloria, wife:gertrude) can naturally coexist as instances of the marriage relation in the same hypergraph.

Finally, the native graph-oriented structure of hypergraphs make them also an obvious ground for applying advanced graph computing techniques, such as shortest path finding or network analysis, which we will explore in depth in future posts.

Directed graphs

Is, then, a hypergraph anything more than just a graph? Arguably, the hypergraph depicted in the beginning of this post could be seen as a plain labelled directed graph. That’s actually a very desirable characteristic. Every hypergraph can simply be mapped to the corresponding directed graph. For instance, in the RDF model, we could represent our marriage example as the following RDF graph, where entity and relation types are also explicitly encoded as RDF resources, in the typical RDF style.

Virtually the same mapping could be applied to achieve a direct reduction of hypergraphs to the property graph model. Because of this close relationship to directed graphs, HDM can be naturally implemented over any graph-based data storage, such as increasingly popular graph databases or RDF triple stores. In fact, Grakn is mounted on top of TinkerPop — an open-source interoperability layer that exposes a uniform graph data model (property graph) over any TinkerPop-compliant data management system. This allows us to greatly reduce the cost of developing a robust and mature system and grants a good level of vendor-agnosticity in terms of the choice of the underlying storage platform.

The central differentiator between hypergraphs and directed graphs, however, is the introduction of hyperedges as first-class modelling constructs in HDM. Firstly, hyperedges have a significant impact on the data modelling practice and possibilities, when compared to directed graph models. In principle, it is possible one could arrive immediately at the RDF graph depicted above when modelling our example dataset, thus achieving the same clean and uniform representation without explicitly employing HDM. In a real-life scenario, however, when the complete conceptual model is not fully foreseen at the outset of the process, and the data is injected to the graph gradually over time, the actual outcome may be considerably different. In fact, RDF practitioners should not be surprised to see, as the result of this exercise, the RDF graph below instead.

To start with, the binary marriage relation would typically be represented using a directed edge involving the married_to predicate, following the standard “good practice” recommended by popular RDF tutorials. Later on, however, once the married couple files for divorce, and the graph must account for this new fact, the RDF data modeller faces two problems:

Modelling ternary divorce_filing relationship has to follow a different route than that employed in case of binary relations; instead of asserting a link between two entities, one needs to use a dedicated ontology pattern for n-ary relations, which would typically mimic the hyperedge structure: one resource representing the relation object connected with a set of outgoing links towards respective role-players. There’s no simple way to connect the original binary marriage relationship with other domain elements. In this respect, RDF offers a notorious triple reification mechanism, which allows for introducing a dedicated resource (of type rdf:Statement) representing the entire triple and linked to all its constitutive components via predicates rdf:subject, rdf:predicate and rdf:object. RDF reification has attracted a lot of criticism in the past, essentially by being a crude tool immediately damaging the transparency of the model and its usability on the query level.

The second benefit that hyperedges can bring to the data management table, as compared to the binary directed edges, is the potential improvements to query planning and query optimisation mechanisms. Arguably, the data grouped together in the same structural “containers”, precisely such as hyperedges or SQL records, is also often retrieved in similar groupings by users and applications. By acknowledging the structure of these collections in advance of querying, the information retrieval process can be more optimally planned and executed.

Summary

The hypergraph data model underpinning the knowledge representation system implemented in Grakn, presents a novel alternative in the data modelling space, providing a viable middle-ground between the relational and directed graph-based models — i.e., “the best of both worlds”. The key points supporting this view, which we have touched upon in this post, can be summarised as follows:

Benefits over the relational model

native graph-oriented structure:

—relationships (connections) in data are first class-citizens;

— graph-oriented data modelling frameworks (e.g., ER diagrams) can be more easily applied and linked to the actual data model;

— graph-oriented computation techniques can be efficiently applied (e.g., path queries, network analysis);

the flexibility of data modelling is not impeded by restrictive relational schemas (e.g., multiple values of the same role can be involved in the same hyperedge, without an a priori modelling decision).

Benefits over the directed graph model