For a while now, CrateDB has supported subselects in the FROM clause, as long as the subselect can be collapsed into as a single select.

In one of our recent batches of work, my team addressed this limitation and now CrateDB supports non-collapsable subselects.

In this post, I will introduce you to the problem in a little more detail, and show you the steps we took to solve it.

Introduction

Collapsable Subselects

Here’s an example of a very simple query using a subselect:

SELECT column_a, column_b FROM ( SELECT * FROM my_table WHERE column_a > 10 ) AS subselect_a ORDER BY column_b

This is functionally the same as:

SELECT column_a, column_b FROM my_table WHERE column_a > 10 ORDER BY column_b

That’s because, in the original query, the inner select applies the WHERE operator and the outer select applies the ORDER BY operator. And these two operators can be merged and executed in a single step.

We do not have to produce an intermediate result set from the inner select and then apply the operations of the outer select.

These are the only sorts of subselects that CrateDB was able to handle before we started our work.

Non-Collapsable Subselects

Let’s take a look at a more complex example of a subselect:

SELECT * FROM ( SELECT * FROM my_table ORDER BY column_a DESC LIMIT 10 ) AS subselect_a ORDER BY column_b LIMIT 5

Here, the inner select applies ORDER BY and LIMIT operators, and the outer select applies ORDER BY and LIMIT operators. There is no way to merge these operators and execute them in a single step. Instead, we must first produce an intermediary result set by applying the operators from the inner select to the table, and then apply the operators from the outer select to the intermediary result set.

The first select operation looks like this:

SELECT * FROM my_table ORDER BY column_a DESC LIMIT 10

If the result set produced by this query was a table called subselect_a , the final select operation would look like this:

SELECT * FROM subselect_a ORDER BY column_b LIMIT 5

It is the combination of these two queries that is needed.

In the past CrateDB did not support any sort of subqueries that required an intermediary result set, so these sorts of queries would raise an error.

Here’s another example:

SELECT * FROM ( SELECT * FROM ( SELECT column_a, column_b FROM my_table WHERE column_c > 10 ) AS subselect_a ORDER BY column_a DESC LIMIT 10 ) AS subselect_b ORDER BY column_b LIMIT 5

This can be partially collapsed, to this:

SELECT * FROM ( SELECT column_a, column_b FROM my_table WHERE column_c > 10 ORDER BY column_a DESC LIMIT 10 ) AS subselect_a ORDER BY column_b LIMIT 5

Here, the innermost select has been collapsed. But we cannot collapse it any further, for the same reasons as before.

Adding support for non-collapsable subselects (which includes partially non-collapsable subselects) was our primary goal for this batch of work. And in the next section, I will show you how we did it.

How We Implemented Non-Collapsable Subselects

Changes to Query Rewriting

To support non-collapsable subselects, we had to make some changes to our SubselectRewriter . This is the module responsible for collapsing subselects, per the examples in the previous section.

Previously, the SubselectRewriter would construct a nested hierarchy of objects, modeling the nested hierarchy of selects. QueriedSelectRelation objects are used to model selects that operate on the results of a select operation, and QueriedDocTable objects are used to model selects that operate directly on a table.

Take this select:

SELECT column_a, column_b FROM ( SELECT * FROM my_table WHERE column_a > 10 ) AS subselect_a ORDER BY column_b

This would be modelled with the following nested object tree:

The top level object is always a QueriedSelectRelation , because the outermost select of a nested collection of selects is always operating on the result set of another select. And the innermost select is always operating directly on a table.

Previously, the SubselectRewriter would recursively collapse and rewrite this nested hierarchy of objects until only the QueriedDocTable object was remaining. This corresponds to a single select operating directly on a table.

The final tree would look like this:

But let’s take this select:

SELECT * FROM ( SELECT * FROM my_table ORDER BY column_a DESC LIMIT 10 ) AS subselect_a ORDER BY column_b LIMIT 5

We would model this with the following nested object tree:

But we cannot collapse it any further.

The SubselectRewriter collapses the object tree from the inside out. Every time it encounters a QueriedSelectRelation above the innermost QueriedDocTable , it replaces both of them with a new QueriedDocTable .

If it was unable to complete one of replacements, it would give up, and the end result would be a partially collapsed tree, with a QueriedDocTable on the inside and at least one uncollapsed QueriedSelectRelation object wrapped around it. CrateDB was unable to handle QueriedSelectRelation objects, so an error would be raised.

To add support for QueriedSelectRelation objects, we first had to change something about the way this collapsing was being handled.

In the past, we never bothered to update the parent object’s references to the fields of the child object. Because we assumed that the only useful end result would be a single collapsed QueriedDocTable table. So a hierarchy of object with broken references was fine. Because this was an error condition.

But now we want to support a hierarchy of objects, every time we collapse an object pair, we need to update the parent object’s references.

We start with something like this:

To collapse this, we need to replace the references to xx with references to (x+x) :

We can collapse again by replacing x with (x+x) + (x+x) :

Implementing this object rewriting was a major chunk of the work we undertook.

Implementing this object rewriting was a major chunk of the work we undertook.

Changes to Data Fetching

The most complex part of the implementation though has to do with optimizing the fetch operations that take place during the execution of non-collapsable subselects.

What Is a Fetch Operation?

Query execution fetches row data in two steps:

CrateDB fetches the minimum amount of data necessary for doing distributed filtering and ordering on specific columns Once CrateDB knows precisely which rows need returning to the client, it requests the rest of the data needed for those rows only

Consider this query:

SELECT column_a, column_b, column_c FROM my_table WHERE column_a > 10 ORDER BY column_b LIMIT 100

This is what happens when CrateDB executes this query:

Each my_table shard is asked to produce a set of _id , column_a ) tuples where column_a is greater than 10. These tuples are then sent to the query handling node, which merges them together into a single set. This is the first fetch operation. The merged set of tuples is then ordered. The set of tuples is then limited to 100. The 100 document IDs we now have are used to request the rest of the data for each row. This is the second fetch.

This approach avoids unnecessary disk operations and unnecessary network traffic, thereby improving the performance of both overall.

What We Changed

To support non-collapsable subqueries, we need to modify this behaviour in a way that continues to optimise for both disk and network performance.

So, let’s take this modified example:

SELECT * FROM ( SELECT column_a, column_b, column_c FROM my_table WHERE column_a > 10 ORDER BY column_b DESC LIMIT 10 ) AS subselect_a ORDER BY column_a DESC LIMIT 5

The inner query, if it was a standalone query, would fetch a set of ( _id , column_a ) tuples, and then fetch the values of column_b and column_c as the final step.

As the inner query of a subselect, this process is modified so that the outer query is notified that there’s a pending fetch operation for the values of column_b and column_c .

Because column_b and column_c are also not required for the execution of the outer query, this fetch can be postponed. Whether or not this fetch needs to be done later depends on whether the outer query's final fetch would fetch this data.

Because column_b and column_c are also not required for the execution of the outer query, but only for the final response, this fetch can be postponed. Whether the fetch needs to be done immediately or not, depends on whether the outer query operates on those columns (e.g. uses them in the WHERE clause, ORDER BY clause, and so on). In some cases, the fetch can be ignored entirely, because the fetch that the outer query produces fetches all the necessary data.

Additional Work

Group By on Subselects

A GROUP BY on a subselect is more complex than on a table.

Consider this non-collapsable subselect:

SELECT column_a, count(*) FROM ( SELECT column_a FROM my_table ORDER BY column_a LIMIT 100 ) AS subselect_a GROUP BY column_a HAVING count(*) > 10 ORDER BY 2 LIMIT 10

Let’s take a look at how CrateDB now handles this.

First of all, the inner select is planned as if it was an independent query.

If that plan only touches the node that is handling the client request, then the GROUP BY of the outer select will be handled with a non-distributed grouping plan. However, if the plan for the inner select touches additional nodes, the GROUP BY of the outer select will be handled with a distributed grouping plan.

In the case of non-distributed grouping plan, the grouping operation takes place on the handler node.

In the case of distributed grouping plan, the grouping operation takes place on all nodes that hold the data being operated on.

In both cases, the grouping operation also involves calculating aggregate functions like min , max , avg , and so on.

If the grouping operation was performed across multiple nodes, the handling node fetches that data and merges it together. This includes merging the results of aggregate functions.

HAVING , ORDER BY , and LIMIT clauses are performed the GROUP BY is done (which for distributed grouping plans means after the merge is complete).

Group By on Joins

A GROUP BY on a join requires similar handling.

Consider this query:

SELECT table_a.column_b table_b.column_b, count(*) FROM FROM table_a INNER JOIN table_b ON table_a.column_a = table_b.column_a GROUP BY table_a.column_c, table_b.column_c HAVING count(*) > 10 ORDER BY 3 LIMIT 10

To handle this, we need to rewrite the query so that the join conditions are separated from the GROUP BY operation and any subsequent operations.

We can do that, like so:

SELECT subselect_a.ta_column_b, subselect_a.tb_column_b, count(*) FROM ( SELECT table_a.column_b AS ta_column_b, table_b.column_b AS tb_column_b, table_a.column_c AS ta_column_c, table_b.column_c AS tb_column_c, FROM table_a INNER JOIN table_b ON table_a.column_a = table_b.column_a ) AS subselect_a GROUP BY subselect_a.ta_column_c, subselect_a.tb_column_c HAVING count(*) > 10 ORDER BY 3 LIMIT 10

The inner select is planned as if it was a independent select and a distributed or non-distributed plan is produced. On top of this plan, the grouping plan is created along with any subsequent operations.

This whole process functions like it does with grouping on subselects, as detailed in the previous section.

Implement DISTINCT for All Queries

The DISTINCT operator is implemented by actually adding a group by operation on top of the existing query. Example:

When you use a DISTINCT operation, like this:

SELECT DISTINCT column_a, column_b, column_c FROM my_table WHERE column_a > 10 ORDER BY 2

Internally this is rewritten by adding a GROUP BY operation, like so:

SELECT column_a, column_b, column_c FROM my_table WHERE column_a > 10 ORDER BY 2 GROUP BY column_a, column_b, column_c

Previously doing work already covered in this post, it was not possible to use DISTINCT with queries that already had a GROUP BY clause.

Why?

Consider this query:

SELECT DISTINCT column_a, max(column_b) FROM my_table GROUP BY column_a, column_b

Internally, this would be rewritten as:

SELECT column_a, max(column_b) FROM ( SELECT DISTINCT column_a, max(column_b) FROM my_table GROUP BY column_a, column_b ) AS subselect_a GROUP BY column_a, max(column_b)

This is a non-collapsable subselect, and so wasn’t supported.

With the addition of support for non-collapsable subselects and grouping on subselects, DISTINCT operations can now be rewritten internally and are thereby now supported for all queries.

Wrap Up

In this batch of work we:

Implemented nested subquery class attribute rewriting, finally ;)

Rewrote how row fetching works so that subqueries could be supported without intermediary result sets prematurely fetching data (which would degrade disk and network performance, increasing infrastructure costs)

Modified CrateDB’s grouping logic (which includes aggregate functions) so that subselects are supported

Added support for grouping on joins, because now they can be internally rewritten as subselects

Added support for DISTINCT for all queries, because the queries we rewrite them to internally are now also supported

This isn’t the end of our work on subselects.

In my next lab notes post I’m going to take a look at some additional subselects improvements we’ve been working on.

Stay tuned.