I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string".

This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API:

It does not support disjunctions, ie. "OR" clauses. It does not support filtering using inequalities on more than one property. It does not support a not-equal operation.

So in this post, I will describe the design which achieves the following goals:

A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR"). Implementations of this backend that are amenable to unit testing the parser and the query logic. A simple implementation of this backend for Google Datastore supporting the full query capabilities.

The Query Language

The query language is supposed to support the following filtering operations:

Clauses on properties should support: =, !=, <, >. Supported data types are: numbers, dates, strings. Clauses can be combined by "and" and "or" operators. Precedence must be specified by parenthesis and compound expressions, ie. "(expression)" and "(expression) and (expression)" are valid, but "expression and expression" is not.

Anyone who follows me probably knows I'm a fan of tagless interpreters, also known as object algebras, so I started by defining the query language in this fashion:

public interface IQueryLanguage<T> { T Id(); T Const(string val); T Const(DateTime date); T Const(double num); T Property(string propertyName); T And(T lhs, T rhs); T Or(T lhs, T rhs); T LessThan(T lhs, T rhs); T GreaterThan(T lhs, T rhs); T Equal(T lhs, T rhs); T NotEqual(T lhs, T rhs); }

You can clearly see the language structure here and how queries can be composed into larger compound expressions. Another good practice is to factor all backend operations into a repository, so this is roughly how it might look:

public interface IRepository<T> { ... IQueryLanguage<T> CreateQuery<TEntity>() where TEntity : class; Task<IEnumerable<TEntity>> Execute<TEntity>(T query, int limit, int page) where TEntity : class; }

So our repository instances will expose a method to create a query for an entity type, and a method used to execute that query with some extra parameters for paging.

The grammar for the language looks roughly like this in pseudo-EBNF:

number := [0-9]+[.]?[0-9]* quoted := "'" datetime "'" | "'" char* "'" e := number | quoted | (e) | (e) "or" (e) | (e) "and" (e)

Now that we have a general interface and a grammar for the query language, we can build a simple and generic recursive descent parser against that interface to generically construct a query:

public static class QueryParser { public static T Parse<T>(this IQueryLanguage<T> provider, string input) { if (string.IsNullOrEmpty(input)) return provider.Id(); int pos = 0; var e = provider.Parse(input, ref pos); if (pos < input.Length) throw new ArgumentException("Malformed query"); return e; } static T Parse<T>(this IQueryLanguage<T> provider, string input, ref int pos) { var last = pos; if (Match(input, "(", ref pos)) { var e = provider.Parse(input, ref pos); return !Match(input, ")", ref pos) ? Fail<T>("expected ')'", pos): pos < input.Length ? provider.Operation(e, input, ref pos): e; } else { return provider.Operation(provider.Terminal(input, ref pos), input, ref pos); } } static T Operation<T>(this IQueryLanguage<T> provider, T lhs, string input, ref int pos) { Skip(input, ref pos); var last = pos; if (Match(input, "=", ref pos)) { return provider.Equal(lhs, provider.Terminal(input, ref pos)); } else if (Match(input, ">", ref pos)) { return provider.GreaterThan(lhs, provider.Terminal(input, ref pos)); } else if (Match(input, "!=", ref pos)) { return provider.NotEqual(lhs, provider.Terminal(input, ref pos)); } else if (Match(input, "<", ref pos)) { return provider.LessThan(lhs, provider.Terminal(input, ref pos)); } else if (Match(input, "and ", ref pos)) { return provider.And(lhs, provider.Parse(input, ref pos)); } else if (Match(input, "or ", ref pos)) { return provider.Or(lhs, provider.Parse(input, ref pos)); } else if (Match(input, ")", ref pos)) { --pos; return lhs; } else { return Fail<T>("Unrecognized token", pos); } } static T Terminal<T>(this IQueryLanguage<T> provider, string input, ref int pos) { Skip(input, ref pos); var last = pos; if (Match(input, "'", ref pos)) { var end = input.IndexOf('\'', pos); if (end < 0) return Fail<T>("Unrecognized value", pos); else if (DateTime.TryParse(input.AsSpan(pos, end - pos), out var date)) { pos = end + 1; return provider.Const(DateTime.SpecifyKind(date, DateTimeKind.Utc)); } else { var x = input.Substring(pos, end - pos); pos = end + 1; return provider.Const(x); } } else if (Match(input, (x, i) => x[i] == '.' || char.IsDigit(x, i), ref pos)) { return double.TryParse(input.AsSpan(last, pos - last), out var i) ? provider.Const(i) : Fail<T>("Expected integer", last); } else if (Match(input, char.IsLetter, ref pos)) { return provider.Property(input.Substring(last, pos - last)); } else { return Fail<T>("Unrecognized token", pos); } } static T Fail<T>(string error, int pos) => throw new ArgumentException($"Parse error @ {pos}: {error}"); static void Skip(string input, ref int pos) { while (pos < input.Length && char.IsWhiteSpace(input[pos])) ++pos; } static bool Match(string input, ReadOnlySpan<char> expect, ref int pos) { if (pos + expect.Length > input.Length || !input.AsSpan(pos, expect.Length).Equals(expect, StringComparison.OrdinalIgnoreCase)) return false; pos += expect.Length; return true; } static bool Match(string input, Func<string, int, bool> expect, ref int pos) { var start = pos; while (pos < input.Length && expect(input, pos)) ++pos; return start < pos; } }

This isn't the most efficient implementation, but it's hopefully understandable since the parser doesn't implement operator precedence and doesn't really do any backtracking. It recursively descends into a bracketed expression, moving a position pointer along the string as it matches characters, and then invokes any query language methods that matched.

The important part is that this parser is agnostic to the query language implementation, so we can reuse it for any query implementation like so:

IRepository<T> repo = ...; ... var query = repo.CreateQuery ().Parse("(Foo = 'some string') or (Bar > 99)"); var results = repo.Execute(query, limit: 10, page: 1);

The Datastore Implementation

Since the Datastore filtering API has the limitations described above, we need to transform the query into a form that we can actually execute. Here is an outline of the choices I made, which I will then explain in more detail:

No disjunctions: translate a compound query into disjunctive normal form. This consists of a list of queries with no "OR" operators that are executed individually, and whose results are then merged. No inequalities on more than one property: keep track of the property that was referenced in an inequality, and build a parallel filter using delegates. We can then execute one inequality filter server-side and post-process the results with the delegates to ensure all inequalities are satisfied. No != operation: translate "x != y" into "(x < y) or (x > y)". All of the supported types have strict orderings, so not-equal is equivalent to a compound inequality defined by greater-than or less-than.

So let's start with a basic filter in our query language:

public struct QueryFilter { // the datastore filter query public Google.Cloud.Datastore.V1.Filter Filter { get; set; } // the property referenced by any inequality test; null if no inequality test public string InequalityProp { get; set; } public bool HasInequality => InequalityProp != null; // post-processing filter that applies any extra inequalities public Func<Google.Cloud.Datastore.V1.Entity, bool> Post { get; set; } public QueryFilter(Filter f, string inequalityProp, Func<Google.Cloud.Datastore.V1.Entity, bool> post) { Filter = f; InequalityProp = inequalityProp; Post = post; } }

This groups the Datastore filter, the post-processing filter, and the property used for inequality tests (if any). So a simple property test like "Foo = 'hello world!" would generate:

var foo = new QueryFilter(Filter.Equal("Foo", "hello world!"), null, e => CompareTo(e["Foo"], "hello world!") == 0);

The other basic clauses look similar, just expecting different results from CompareTo . The post-processing filter need not be used for basic queries, but compound queries with multiple inequalities need this post-processing filter since only one inequality filter will be applied server-side.

The CompareTo method implements the idiomatic IComparable for Datastore values:

static int CompareTo(Value x, Value y) { if (x.ValueTypeCase != y.ValueTypeCase) throw new InvalidCastException( $"Values are not of the same type: {x.ValueTypeCase} != {y.ValueTypeCase}."); switch (x.ValueTypeCase) { case Value.ValueTypeOneofCase.DoubleValue: return x.DoubleValue.CompareTo(y.DoubleValue); case Value.ValueTypeOneofCase.StringValue: return x.StringValue.CompareTo(y.StringValue); case Value.ValueTypeOneofCase.TimestampValue: return x.TimestampValue.ToDateTime().CompareTo(y.TimestampValue.ToDateTime()); default: throw new InvalidCastException($"Can't compare type: {x.ValueTypeCase}"); } }

QueryFilter suffices for basic queries that Google Datastore already supports, but it's insufficient for "OR" queries. So we need another level of indirection to keep track of lists of QueryFilter that represent the query in disjunctive-normal form:

public struct Term { public Term(IEnumerable<QueryFilter> filter) : this() { Filters = filter; } public Term(params QueryFilter[] filter) : this() { Filters = filter; } public Term(Value value) : this() { Value = value; } public Term(string prop) : this() { Property = prop; } public IEnumerable<QueryFilter> Filters { get; private set; } public Value Value { get; private set; } public string Property { get; private set; } }

These are the real terms generated by our query language. In reality, this should be a set of classes representing the sum term = property | value | query , but it's simpler to keep everything in a struct to avoid casts. The list of QueryFilter values represents the list of queries that themselves contain no "OR" clauses. Implementations of our query combinators must preserve this property, and we'll see later on that it's pretty straightforward to do this. Let's start with the basic value constructors:

public struct DatastoreQueries : IQueryLanguage<Term> { public Term Id() => new Term(); public Term Const(DateTime date) => new Term(date); public Term Const(double num) => new Term(num); public Term Const(string val) => new Term((Google.Cloud.Datastore.V1.Value)val); public Term Property(string propertyName) => new Term(propertyName); ... }

Pretty straightforward, values are converted into instances of Google.Cloud.Datastore.V1.Value, string property names are tracked as properties. Now let's look at the basic filters:

public struct DatastoreQueries : IQueryLanguage<Term> { ... public Term Equal(Term lhs, Term rhs) { if (lhs.Filters != null || rhs.Filters != null) throw new ArgumentException("Operations can only compare properties against values."); if (rhs.Property != null && lhs.Property == null) return Equal(rhs, lhs); if (lhs.Property == null || rhs.Property != null) throw new ArgumentException("Operations can only compare properties against values, not two properties."); var filter = Filter.Equal(lhs.Property, rhs.Value); return new Term(new QueryFilter(filter, null, e => CompareTo(e[lhs.Property], rhs.Value) == 0)); } }

Here we ensure that the terms being compared for equality are a property and a value, and we also ensure that the property name is on the left hand side so it's easier to operate on. Equality is symmetric, so we can just reverse the order of the parameter and the meaning is preserved. This is not quite the case for inequality checks, but two inequality operators can be applied in terms of each other:

public struct DatastoreQueries : IQueryLanguage<Term> { ... public Term LessThan(Term lhs, Term rhs) { // insert validations as above return rhs.Property != null ? GreaterThan(rhs, lhs): lhs.Property != null ? new Term(new QueryFilter(Filter.LessThan(lhs.Property, rhs.Value), lhs.Property, e => CompareTo(e[p], v) < 0): throw new ArgumentException("A property is required for comparisons."); } public Term GreaterThan(Term lhs, Term rhs) => { // insert validations as above return rhs.Property != null ? LessThan(rhs, lhs): lhs.Property != null ? new Term(new QueryFilter(Filter.GreaterThan(lhs.Property, rhs.Value), lhs.Property, e => CompareTo(e[p], v) > 0): throw new ArgumentException("A property is required for comparisons."); } }

I've omitted the validations for brevity, but the basic idea is the same. We ensure the term property is on the left-hand side by transforming greater-than and less-than into each other if they're backwards. As before, we also create a Datastore filter and a post-processing filter for the same operations. And now we get to the composite query combinators. The "OR" combinator is very simple:

public struct DatastoreQueries : IQueryLanguage<Term> { ... public Term Or(Term lhs, Term rhs) => new Term(lhs.Filters.Concat(rhs.Filters)); }

OR-ing two queries that themselves lack "OR" clauses is simply merging them into a single list. Given two lists of queries each element of which contain no "OR" clauses is then simply concatenating the two lists. This preserves the invariant that each individual sub-query contains no "OR" clauses. This leaves only the AND combinator:

public struct DatastoreQueries : IQueryLanguage<Term> { ... public Term And(Term lhs, Term rhs) { var permutations = from x in lhs.Filters from y in rhs.Filters select !x.HasInequality || !y.HasInequality || x.InequalityProp.Equals(y.InequalityProp) ? new QFilter(Filter.And(x.Filter, y.Filter), x.InequalityProp, e => x.Post(e) && y.Post(e)): x.HasInequality ? new QFilter(x.Filter, x.InequalityProp, e => x.Post(e) && y.Post(e)): new QFilter(y.Filter, y.InequalityProp, e => x.Post(e) && y.Post(e)); return new Term(permutations); } }

There's a lot going on here, but here's the gist: every filter in the lhs is AND-ed with every filter in the rhs . The various cases you see being checked are to handle inequalities in either the lhs or the rhs . If neither side contains an inequality filter, or if they refer to the same property, then we can combine the queries as a server-side filter. Otherwise, we take one of the filters with an inequality and post-process the other one. Technically one branch is redundant here, I'm just showing it for completeness. I also execute the post-processing for all branches for safety while I was prototyping and testing (see below).

The reason we can just permute the individual lhs and rhs expression this way is because boolean algebra is distributive. That means, given an expression in disjunctive normal form (a series of OR-ed expressions), if you then apply an AND-ed clause to that whole expression, that's equivalent to applying AND to each individual clause:

A and (B or C or D or ...) = A and B or A and C or A and D or A and ...

This is exactly what we're doing above: we're tracking a set of OR-ed clauses and then combining them with another expression via AND, which amounts to distributing the AND-ed expression across the list. But what if A itself is an OR-ed expression? Just do it all over again:

A = (X or Y or Z or ...) A and B or A and C or ... = (X or Y or Z or ...) and B or (X or Y or Z or ...) and C or ... = (X and B or Y and B or Z and B or ...) or (X and C or Y and C or Z and C or ...) or ...

Every element of A and each element of the original "OR" expression were permuted to produce the final expression, still in disjunctive normal form. This is precisely what the LINQ expression above is doing, with some extra logic to handle inequalities.

We can also see here the necessity of carrying around the post-processing filters through all the terms. When an inequality is applied to two different terms in the same filter, only one of them can be applied server-side, and the remainder has to be applied as a delegate filter when we get the results back.

Let's see how executing a query happens in a repository implementation:

public class DatastoreRepository : IRepository<Term> { DatastoreDb db; ... public async Task<IEnumerable<Entity>> Execute<TEntity>(Term query, int limit, int page) where TEntity : class { if (page <= 0) throw new ArgumentOutOfRangeException(nameof(page), "Page number must be greater than or equal to 1."); if (limit <= 0) throw new ArgumentOutOfRangeException(nameof(limit), "Number of results per page must be greater than or equal to 1."); // execute each OR-free filter server-side, then apply the post-filter to ensure all // inequalities are satisfied, then merge all results into a sorted set to remove duplicates // and ensure results are always in the same order for paging purposes var results = new SortedSet<Entity>(ecompare); foreach (var f in query.Filters ?? Enumerable.Repeat(new QueryFilter(), 1)) { // generate the entity kind from type var q = new Query(Kind<TEntity>()) { Filter = f.Filter, }; var r = await db.RunQueryAsync(q); var filtered = f.Post != null ? r.Entities.Where(f.Post) : r.Entities; foreach(var x in filtered) results.Add(x); } // rough paging implementation that's not particularly efficient, but suffices for now return results.Skip((page - 1) * limit).Take(limit); } static readonly EntityComparer ecompare = new EntityComparer(); sealed class EntityComparer: IComparer<Entity> { public int Compare(Entity x, Entity y) => Id(x.Key).CompareTo(Id(y.Key)); } }

The gist is simply that each QueryFilter is executed individually, then the post-processing filter is applied to ensure that all remaining inequalities are satisfied, and the results from all individual OR-ed queries are merged into a single sorted set. This set is then paged through using the given paging parameters and the limited results returned.

This is probably among the simplest possible implementations for the approach I took, and so is not particularly efficient, but it suffices for prototyping purposes or programs whose filters would return moderately sized result sets.

Possible Extensions

If you've ever wondered what a query planner in a relational database does, you can start with the above naive implementation and consider:

what choices you can make about which inequality to preserve in order to make the server-side query return the smallest data set, and what sort of metadata you might need to track in order to make such choices stricter clauses can eliminate other clauses, ie. "(Foo < 'hello') or (Foo = 'hell')" => "Foo = 'hell' a more complex equality query might sometimes be preferable to a simpler inequality query, so sometimes the equality query should be executed server-side instead of the inequality whether you can exploit the paging parameters to limit the results of each OR query and interleave them somehow ...and so on...

There's also a way to make the parsing and query construction less error-prone. I currently use dynamic checks where I could make a new type distinction that would eliminate these errors statically. Consider the following interface for the query language and see if you can modify the query parser to accommodate this new distinction and what sorts of errors it statically prevents:

public interface IQueryLanguage<TClause, TQuery> { TClause Id(); TClause Const(string val); TClause Const(DateTime date); TClause Const(double num); TClause Property(string propertyName); TClause LessThan(TClause lhs, TClause rhs); TClause GreaterThan(TClause lhs, TClause rhs); TClause Equal(TClause lhs, TClause rhs); TClause NotEqual(TClause lhs, TClause rhs); TQuery Lift(TClause clause); TQuery And(TQuery lhs, TQuery rhs); TQuery Or(TQuery lhs, TQuery rhs); }

Finally, my actual prototype also passes in a delegate that compares the property name passed in against a type-specific whitelist of allowed property names, so this too would be a simple extension to what I've presented.

Testing

The abstract interfaces used here make testing pretty straightforward. Here's a simple implementation of a pretty-printer for parsed expressions:

public delegate StringBuilder Eval(StringBuilder buf); public class PrettyPrinter : IQueryLanguage<Eval> { public Eval Id() => buf => buf; public Eval Const(DateTime date) => buf => buf.Append('\'').AppendFormat("{0:yyyy-MM-dd}", date).Append('\''); public Eval Const(double num) => buf => buf.Append(num); public Eval Const(string val) => buf => buf.Append('\'').Append(val).Append('\''); public Eval Property(string propertyName) => buf => buf.Append(propertyName); public Eval And(Eval lhs, Eval rhs) => Op(lhs, "and", rhs); public Eval Or(Eval lhs, Eval rhs) => Op(lhs, "or", rhs); public Eval LessThan(Eval lhs, Eval rhs) => Op(lhs, "<", rhs); public Eval GreaterThan(Eval lhs, Eval rhs) => Op(lhs, ">", rhs); public Eval Equal(Eval lhs, Eval rhs) => Op(lhs, "=", rhs); public Eval NotEqual(Eval lhs, Eval rhs) => Op(lhs, "!=", rhs); static Eval Op(Eval lhs, string op, Eval rhs) => buf => rhs(lhs(buf.Append('(')).Append(' ').Append(op).Append(' ')).Append(')'); }

This generates a nicely formatted output from a parsed expression which should be semantically equal to the input. This enables writing simple tests of the parsing algorithm like this:

using Xunit; ... public class QueryParseTests { static PrettyPrinter provider = new PrettyPrinter(); static string Parse(string input) => provider.Parse(input)(new StringBuilder()).ToString(); [Fact] static void SimpleQueryParse() { Assert.Equal("(foo = 3)", Parse("foo = 3")); Assert.Equal("(foo != 3)", Parse("foo != 3")); Assert.Equal("(foo < 3)", Parse("foo < 3")); Assert.Equal("(foo > 3)", Parse("foo > 3")); Assert.Equal("(foo > '2019-08-08')", Parse("foo > '2019-08-08'")); Assert.Equal("('2019-08-08' = foo)", Parse("'2019-08-08' = foo")); Assert.Equal("('2019-08-08' = foo)", Parse("(('2019-08-08' = foo))")); Assert.Equal("", Parse("")); } [Fact] static void CompoundQueryParse() { Assert.Equal("((foo = 3) and (20 != bar))", Parse("((foo = 3) and (20 != bar))")); Assert.Equal("((foo != 3) or (20 < bar))", Parse("((foo != 3) or (20 < bar))")); Assert.Equal("((foo != '2018-08-08') or ('2018-08-08' < bar))", Parse("((foo != '2018-08-08') or ('2018-08-08' < bar))")); } [Fact] static void CompoundFailParse() { Assert.Throws<ArgumentException>(() => Parse("((foo = 3) and (20 != bar)")); Assert.Throws<ArgumentException>(() => Parse("((foo = 3) not 20 != bar))")); } [Fact] static void SimpleFailParse() { Assert.Throws<ArgumentException>(() => Parse("foo = 3)")); Assert.Throws<ArgumentException>(() => Parse("(foo = 3")); Assert.Throws<ArgumentException>(() => Parse("(foo bar 3)")); } ... }

I can even test the Datastore query implementation in a similar way, since the implementation above carries all of the post-processing clauses that must be satisfied. For example:

public class QueryTests { static DatastoreQueries provider = new DatastoreQueries(); static Term Parse(string input) => provider.Parse(input); [Fact] static void TestComplex() { var e = new[] { new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 20.0 }, new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 23.0 }, new Entity { ["Date"] = new DateTime(2016, 04, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 23.0 }, new Entity { ["Date"] = new DateTime(2016, 04, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 20.0 }, new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 2.0 }, new Entity { ["Date"] = new DateTime(2016, 04, 30, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 4.0 }, }; var t0 = Parse("(Date = '2016-05-01') AND ((Distance > 20) OR (Distance < 10))"); Assert.Equal(2, t0.Filters.Count()); var matches = t0.Filters.SelectMany(x => e.Where(x.Post)).Distinct().ToList(); Assert.Equal(2, matches.Count); Assert.Contains(matches, x => 23 == x["Distance"].DoubleValue); Assert.Contains(matches, x => 2 == x["Distance"].DoubleValue); var t1 = Parse("((Date != '2016-04-01') or (Date = '2016-05-01')) AND ((Distance > 20) OR (Distance < 10))"); Assert.Equal(6, t1.Filters.Count()); var match1 = t1.Filters.SelectMany(x => e.Where(x.Post)).Distinct().ToList(); Assert.Equal(3, match1.Count); Assert.Contains(match1, x => 23 == x["Distance"].DoubleValue); Assert.Contains(match1, x => 2 == x["Distance"].DoubleValue); Assert.Contains(match1, x => 4 == x["Distance"].DoubleValue); } ... }

Final Words

Apologies for any typos or if something appears to be missing from the above, these are slightly adjusted code fragments from a larger operational project. Let me know if anything needs clarification!

My intent here is to illustrate the value of applying some ideas from academia to real-world programming challenges that crop up in our daily work. Object algebras/tagless final interpreters are a powerful idea for organizing powerful, reusable abstractions. In particular, it let me design, build and test individual parts of the querying API, which ensured that they would all work when combined into a larger system. Hopefully you too can get some use out of this!