Posted on December 25, 2018 and last updated on December 7, 2019

In which we implement a simple Datalog engine in not many lines of Haskell to understand its semantics.

Datalog is arguably the simplest logic programming language there is. Depending on your background, you can see it as a principled SQL or a Prolog with manners. It has recently become popular among the scalable program analysis crowd. Semantically, it is simple and clean, so it doesn’t take many lines of Haskell to produce an (inefficient) implementation. I’ll follow Richard Feynman’s maxim: “What I cannot create, I do not understand.” Hopefully, this post should give you some feel for logic and relational programming and also demonstrate that λ-calculus is not the only language you can implement with ease in Haskell.

We start by looking at a simple program and implement an evaluator for it. Along with the implementation, we give a lightweight discussion of semantic concerns, in particular about its termination. Later, we try quenching the thirst for efficiency a little by giving some avenues for optimisation and conclude with where to go next for those interested. There is a copy of the full implementation at the end of the post.

I want this post to be accessible to everyone, not only to the programming language people. For that reason, all uses of theory terms are auxiliary. If you feel that’s not the case at any point, give me a shout on Twitter or email me (both available in my homepage) and I’ll fix it.

Acknowledgements

Thanks to Dominic Orchard for pointing out that he did not advise Andrew Rice. Also thanks to Daniel Kröni for providing a less restrictive isRangeRestricted implementation and to Gilbert G for pointing out a broken link.

A crash course in logic programming

The litmus test for any logic programming language is to be able to compute a variation of the following ancestry program.

This program encodes some facts about my academic genealogy encoded in the adviser relation and has some basic rules that define what it means to be an academicAncestor . The facts effectively constitute a database table. Rules allow us to deduce new knowledge. The first rule says that an adviser is an ancestor. The second says the adviser of a known ancestor is also an ancestor. This ability to use recursion to specify relationships in data was one of the original selling points of Datalog.

We want to query the data that is represented explicitly and implicitly. Here are some example queries:

The first one says “is there an academic ancestorial connection between Robin Milner and I and if so who?”. The second one says “is Alan Turing my academic ancestor?” and the third one is the same question but for David Wheeler.

Another appeal of Datalog can be seen in these queries. Datalog is mostly relations (hence the name relational programming). This means there are no inherent inputs and outputs. That’s why we can fill in or leave out either parameter of academicAncestor . You leave a variable in place of the knowledge you want to know and the system fills it for you. This also means you can write a program for a particular input and output relationship in mind and you get the opposite program for free.

Evaluating a Datalog

Datalog is similar to λ-calculus in the sense that there are as many variants of both as there are sands in a beach. The variation we are going to implement today is the simplest one there is. The bits and pieces you see in the previous section are all you get.

We’ll implement the engine the way we evaluate Datalog programs: initially, one rule at a time, then all at once. Since Datalog is deeply rooted in database theory, it is usually implemented as relational algebra concepts such as joins and selections. I have a programming language background, so I’ll use substitutions instead. This will be entirely equivalent, but also lead to simpler code.

Representing Datalog

A rule consists of a head and a body. A head is an atom and so are the comma separated items in the body. An atom then consists of a predicate name like adviser and a list of terms. We have two kinds of terms: variables and symbols. This corresponds to few simple datatype declarations.

The following rule

corresponds to

Facts are just bodiless rules. For example,

corresponds to

You might be thinking but what about the queries, the Rule datatype doesn’t allow headless bodies. That’s true, but the dirty secret about queries is that they get rewritten as rules with the head atom generated from the query. The predicate symbol for each query is fresh and the terms of the head are just the variables in the body. For example,

gets fresh head atom query1(Intermediate) and can be thought of as

An entire Datalog program is nothing but a collection of rules.

We also need a way of representing the known facts about the universe. We can call this our knowledge base. For simplicity, we use a horribly inefficient list of atoms.

Since we’ll use substitutions to evaluate atoms, we can define that here too.

Our substitutions are simpler than those that are used in λ-calculus. A Haskell tuple (x,c) represents the substitution [c/x] meaning we replace c for x . Here, x is always a variable and c is always a constant (symbol). The variable restriction is usual; it is just not reflected in the type. The constant restriction, however, is not. Datalog doesn’t have function symbols and substituting a variable is not necessary for its evaluation. Like the variable restriction, the constant restriction is also not reflected in the type.

Applying a substitution to an atom or a variable (potentially) makes it ground that is it replaces the variable with a constant.

We can also define the only constant that will be used in the evaluator: the empty substitution.

Mind-blowing, I know.

One rule at a time

Let’s focus on evaluating a single rule from our example program.

We know some facts about the world and using this rule, we want to know some more. There are two ways. One is to gather everything we know about adviser and academicAncestor separately and join them together making sure the second parameter ( Y ) of adviser matches the first parameter ( Y ) of academicAncestor . Another way is to look at any one of the atoms and find assignments to its variables and substitute them in the other atom, only then we look for assignments to the remaining variables of this second atom. This approach would work for the example above as follows: find possible assignments to X and Y through adviser , substitute the values of Y in academicAncestor , and finally look for values of academicAncestor in the knowledge base that agree on the newly substituted value of Y to obtain values for Z .

In the context of databases, the first is called join-before-select and the second, not so surprisingly, select-before-join. We’ll implement the latter because it is simpler.

Since we need unification to obtain substitutions from facts & body atoms and substitution to ground further atoms. We start by implementing those.

Substitution is pretty much what you would expect looking up what to substitute for variables and do it whenever it can find a binding otherwise leaving the variable alone.

Unification is also simple, in fact, too simple. For one thing, we cheat and unify whenever we have two atoms that have the same predicate symbol. In Datalog, a predicate is determined by its predicate symbol and arity (the number of terms). Here, we assume each predicate symbol determines the arity. More importantly, we throw an error when a term from the second atom is a variable. The reason is unification only occurs between a body atom and a fact. Since facts cannot have variables, this is safe. This is consistent with our earlier assumptions about the form of substitutions.

We take special care of unification of atoms with repeated variables. It requires failing in case of a contradictory variable assignment. Consider unifying p(X,X) with p("a","b") .

Next we evaluate an atom into a list of substitutions by finding facts that fit the template provided by a body atom and capturing the assignments to its variables. Those assignments to variables are just the substitutions we are looking for.

Here we do exactly what is described above but build the substitutions by accumulation. This means we have substitutions that we use on the body atom to ground its variables, but then unifying this more down to earth atom (it’s more ground, get it?) with the facts we know gives us extensions to the substitution we started with. Thus, we extend and accumulate substitutions.

Now all we need to do is to walk the body and accumulate substitutions, then using these substitutions we deduce new facts based on the head of the rule.

Here’s something to think about. We said the facts we deduce will be ground, but evalRule does not check if the substitution has a binding for all the variables that appear in the head. Does that mean we are potentially concluding non-ground facts? Don’t worry if you’re not sure, we’ll come back to it below.

All at once

All we need now is to evaluate all of the rules together. We’ll do one of the simplest possible things. We’ll evaluate each rule independently and then concatenate the newly derived facts together and bundle them with what we already know. Then repeat the process until we can’t produce new facts any more.

The immediateConsequence function does the bundling step and solve computes the fixpoint from an empty set of facts.

The only thing left unexplained is the isRangeRestricted predicate which ensures the program is well-formed.

Semantic considerations

Now that we have something that almost compiles we can talk about semantics. Starting with the last missing piece: the range restriction predicate followed by termination.

Range restriction and domain independence

In a rule, if every variable in the head appears somewhere in the body, we call the rule range-restricted. We check if every rule is range-restricted before evaluating them in solve . This is why we didn’t have to check if there were any variables left in the head after substitution in evalRule . If we can get to the stage of substituting into head, we are guaranteed to find values for each head variable.

There still remains the question of why we need range restriction at all? After all, the ability to deduce generic facts seems useful. For example, stating p(X,X) as a fact is a compact way of saying p is reflexive.

The problem is something called domain independence. It basically means if the set of values a variable can take changes from one database to another (but not the instance itself), queries still compute the same thing. In other words, it prevents your program’s result to change under your feet if the values in your database are the same. Since checking domain independence in general is undecidable, range restriction is a safe syntactic approximation.

In this implementation what this means is if I change the definition of datatype Term so that the symbols do not use String but instead a type for strings up to length 10, if my initial ground facts were all strings of length up to 10, then the results to all queries remain the same. We can’t guarantee the same thing without domain independence.

Another problem is that when a query is posed with free variables, we expect values to be filled for that variable. If we can deduce a generic fact such as p(X,X) , then asking for the values of X would enumerate the set of infinite strings.

Bear in mind, there are Datalog variants that lift this restriction. In fact, it wouldn’t be too much work to implement it here. We just need a notion of α-equivalence, a store that can handle non-ground values, and a unification algorithm that handles variables.

Every good thing must come to an end

Does this procedure terminate? Yes, it does. When? So long as you don’t have infinite programs. It would be a cheeky thing to do though. One of the appeals of Datalog is that it is not Turing-complete and every query to the system terminates.

The termination argument is pretty simple. Immediate consequence function is monotone that is the facts it produces encompasses the facts it starts with. Further, the number of facts that can be produced is bounded if we start with a finite database. Our initial set of facts used to kick-start solve is empty. Hence, as long as our program is finite, we have finite number of facts. As an upper bound, if we have \(N\) constants throughout the program, then for each relation of arity \(k\), we can have at most \(N^k\) facts. So the number of facts we can produce is also finite.

One place I’m being a bit loose is the monotone bit. What I said makes sense with sets, but we’re working here with lists. Although [1,2,3] and [2,1,3] encompass each other, they don’t compare equal using == . The reason this implementation never gets in a cycle is because nub function called on the amalgamation of facts in immediateConsequence preserves the first occurrence of each element in the list. Since we prepend the already known facts before calling nub , == behaves as if it is set equality.

If you want to sound smart explaining all of this and reduce the size of your audience to a group of people who would understand this without you mentioning anyway, you can just say the following. We have a non-empty finite lattice which means it is complete and Knaster-Tarski theorem (which is a very cool theorem ❤️) applied to the immediate consequence monotone function over this complete lattice implies that it has a least fixpoint.

Quality assurance

Time for the litmus test. We can now translate the ancestry program into Haskell.

We can make querying a bit more pleasant with a function that returns possible bindings.

All it does it to run the program, find the named query and return the bindings to its arguments. Let’s execute all three queries.

> query "query1" ancestor [[(Intermediate,"Dominic Orchard")],[(Intermediate,"Alan Mycroft")]] > query "query2" ancestor [] > query "query3" ancestor [[]]

The empty list of variable bindings, [] , mean there are no possible assignments that satisfy the query, whereas the singleton empty binding, [[]] , means the query is satisfied. It just didn’t have any variables that need to be bound to values.

All three queries return the expected results. Dominic Orchard and Alan Mycroft are between me and Robin Milner. I am not an academic descendant of Alan Turing, but I am of David Wheeler.

Since these three queries run fine, we can safely conclude that the evaluator here is bug-free.

Addressing the turtle in the room

As I mentioned before, this evaluator is inefficient. Discussing the reasons why and what we can do to make it better is not only a software engineering exercise, but also another way of highlighting why Datalog is a good language. The optimisations I discuss here are quite major, but they are not a complete list. There are many other optimisations that draw both from programming languages and database theory literatures.

Optimise for the query

Basically, all I told you about Datalog semantics is incomplete. You’ll struggle to find a resource that discusses Datalog evaluation the way I do because the semantics are always defined with respect to a query. So what we really need is a program query pair. What we compute here is much stronger and is also unnecessary. For example, if the queries were interested only in the adviser relation, computing academicAncestors would be a waste of time.

One way of dealing with this is to use a top-down evaluator which starts from the query and uses resolution which is a proof technique. This allows only the relevant facts to be derived. Most Prolog interpreters use a similar top-down evaluation strategy. It is not preferred in Datalog for various reasons. One of which is that the naïve implementation of it brings non-termination.

We can achieve the same in a bottom-up evaluator (like ours) using magic set transformation. The very long story short, based on the input program, it generates magic rules based on the dataflow of the program and inserts atoms defined by these rules into the original rule bodies to restrict what can be derived. Its effectiveness varies from program to program and there are alternatives to it. If you want to learn more about it, you can look at On the Power of Magic (you must admit, it is a catchy title) by Beeri and Ramakrishnan.

Semi-naïve evaluation

While discussing range restriction, we mentioned that this implementation rederives all the known facts in each iteration. This is awful and is why it is called the naïve evaluation. There is also a modestly named semi-naïve evaluation. It exploits a simple observation. In order to derive a new fact from a rule, at least one new fact of the previous iteration needs to be used.

The way this gets implemented is that we maintain a delta of facts as well as an accumulator and rewrite the rules to make versions of them that use the deltas.

This is particularly effective for the so called linear rules such as the recursive academicAncestor rule which only has one derived predicate in its body. In that case, the semi-naïve evaluation turns a quadratic computation into a linear one.

Incremental evaluation

Datalog is amenable to incremental evaluation. Even in this version of the evaluator, you can see that when a fixpoint is reached, we can enrich the resulting knowledge base with additional facts and apply the fixpoint algorithm again. This performs at least as good as starting from scratch and often much much better.

The situation gets more complicated when there is negation in the language because additional facts invalidate facts that depend on the negation of those facts. This is a particular instance of non-monotonic reasoning. However, Datalog, overall, is still a good language if you’re after incremental evaluation.

Furthermore, the idea of incremental evaluation is intricately connected with maintaining a delta of facts. If your evaluator uses semi-naïve evaluation already, having an incremental evaluator is not much extra effort.

Data structures

Using lists to compute over sets is a bad idea. Depending on the application, hash tables, proper databases, in memory key-value stores, or at the very least of Set and Map modules from the containers package will certainly make things better. Use of sets in general is a good idea because then we don’t have to rely on nub function’s internals for termination as discussed earlier.

Dependency graphs

We evaluate all rules in one pot but that is also very inefficient. We can partition the program by determining dependencies between predicates. This allows us to treat evaluated dependencies as static knowledge, so we have to deal with a smaller collection of changing facts at any one time.

For example in the ancestor program, we would have a graph with adviser and academicAncestor nodes where the former points to the latter and the latter loops around. Meaning we can compute the fixpoint for adviser rules first, then forget about those rules and compute the fixpoint of academicAncestor .

Also if your Datalog variant has stratified negation which is a popular way of incorporating negated atoms, you already have to do the dependency analysis and partition your program. So engineering-wise, this optimisation comes for free.

Parallelisation & distributed computation

Datalog evaluation is great for data-parallel computation. Even by inspecting our implementation, it is evident that evaluating all the rules in a given iteration is an embarrassingly parallel problem. The dependency graph can be used to further parallelise the evaluation using work queues. Similarly, the workload can be separated over multiple machines, a bit like MapReduce.

Closing remarks

Short program, long prose is the theme. I hope I managed to give some ideas about how Datalog works. I tried to squeeze in as much semantics and tips for more efficient implementation as possible which is probably more useful than a short program.

If you want to look at Datalog further. Soufflé is a modern variant geared towards program analysis and is blazingly fast.

If you are interested in a more formal treatment of Datalog as well as some of the optimisations I mentioned chapters 12 to 15 of Foundations of Databases by Abiteboul, Hull, and Vianu are probably the best resources which collect everything together in such detail and with modern exposition.

Full program

Here’s the full program for your convenience.