Archery: an immutable R-Tree for Scala

Archery: an immutable R-Tree for Scala

At Meetup, we often need to search large amounts of geographic data, such as where our members live or where a Meetup is happening. This data is continually being added to as members join and events are scheduled, and also updated as events happen and move into the past.

We wanted a structure that would allow us to quickly search this data, and that would be able to handle the very dense data that we have in major cities (such as New York). After some research we settled on the R-Tree.

Archery is an immutable R-Tree implementation written for Scala. Someone misheard “R-Tree” ('är\ 'trē\) as “archery” ('är-chə-rē\) and the name stuck.

History

The R-Tree structure was first described by Antonin Guttman in 1984. Since then, it has been widely used in industry and research. There are many different R-Tree variants, including the more complex R+-Tree and R*-Tree variants.

Requirements

We had previously been using an open-source R-Tree written in Java. It performed adequately but due to mutable state there were problems with concurrent access and updates. Rather than try to fix this with locking strategies, we decided to design our own.

Our implementation had the following requirements:

Efficiency: We wanted the R-Tree to be able to search millions of points efficiently. We also needed to be able to quickly add and remove points as our data changes. It should also not use an “unreasonable” amount of memory (left fuzzy on purpose). Immutability: Many different threads needed to be able to work with the same R-Tree (and to be able make changes) without problems. Immutability and structural sharing meant that we could take references to the tree without worrying about who would modify it later.

These requirements (particularly immutability) shaped Archery’s development.

How it works

Most developers are familiar with using balanced trees to represent ordered data. The R-Tree generalizes this notion to allow searching of an N-dimensional space (although for our purposes we will treat this as a 2D space, i.e. map). As new entries are added to the tree, it will re-balance itself. This means that it is able to handle the fact that some areas will potentially have millions of points per square mile, while other areas may be very sparse.

Terminology

Since the R-Tree is somewhat different from a standard tree, it’s important to carefully define the terms that we’ll use to describe Archery’s implementation:

Entry: Entries represent a single piece of data in the tree. It represents a point paired with non-geographic data (an ID, an object, etc.).

Point: A point on the map, defined as a pair of coordinates (x, y) .

. Box: A rectangle, defined by its lower-left point (x, y) and its upper-right point (x2, y2) .

and its upper-right point . Node: Nodes make up the tree, and are defined by a bounding box, and a list of children. There are two kinds of nodes: branches and leaves.

Branch: A node whose children are also nodes.

Leaf: A node whose children are entries.

These definitions should be pretty straightforward, but since “entry”, “leaf”, and “node” are somewhat overloaded terms, it’s good to be clear about what they mean in this context.

Structure

The R-Tree can be thought of as having levels. The top node (i.e. the root node) is at level 1. This node has a bounding box that encompasses every point added to the tree. So, if a user searches an area which is completely outside the root node’s bounding box, there will be no matching entries.

Let’s assume that our tree has three levels. This means that our root node is a branch, its children (at level 2) are leaves, and its grandchildren (at level 3) are entries. The following diagram shows how this arrangement might look:

This root node contains six children (which are leaves), and each of the children contains 7-25 points (which are entries). The diagram actually doesn’t make it clear how many points each leaf contains, since points that appear in an overlapping leaves could be contained in either leaf (but not both). This is an important point: the R-Tree does not guarantee that a node’s children do not overlap.

Searching the R-Tree

Let’s imagine we want to find all entries existing in a given box (which we’ll call the “search space”). We can informally describe this via the following rules:

For a given node:

Check if the node intersects with the search space. If so, search its children, return their results combined. Otherwise, return an empty list.

For a given entry:

Check if the search space contains the entry’s point. If so, return the entry. Otherwise, return nothing.

So starting with the root node, we will recursively build up a list of all results. Ideally, our search space will be much smaller than the bounding box of our root node, and so this process will let us ignore most of the nodes (since their bounding boxes will be entirely outside the search space).

Effective structure

From the previous example, its clear that overlapping nodes can have a huge impact on performance. In the worst case, imagine a situation where each child of a node had the same bounding box as its parent. This node would always have to search all its children, defeating the benefits of the tree structure.

Trees that are unbalanced will also hurt the performance of our search algorithm: if a tree is too narrow and deep (i.e. the branching factor is too low) then all searches will take much longer.

To avoid these situations, the R-Tree uses heuristics when adding and removing nodes to minimize overlap, and to keep the tree balanced.

Adding entries

To add entries to the R-Tree, there are two main steps:

Determine which leaf should contain the entry. Check if the leaf should be split.

The first step can be thought of as traversing down the tree to the leaves. In each case, we want to find the node whose bounding box already contains the new point to be added. If the point is outside all of these boxes, we will determine which box will gain the least amount of area by expanding to contain the point.

(In this diagram, the red node “B” will gain less area than the blue node “A”.)

Once we’ve found the leaf to add to, we checked to make sure the leaf isn’t already “full”. Archery splits nodes that contain more than 50 children. While splitting nodes is an important part of the balancing strategy, in most cases, we won’t need to split leaves.

Splitting a node means assigning its children to new nodes:

Find the two children whose bounding boxes are the “farthest apart”. These will be our seeds. (See “Smarter seeding heuristics” for more information.) Create two new nodes, inserting a seed into each. Add the remaining children to one of the two nodes, using the previous rules to determine which.

Once a node is split, it is removed from its parent and its two halves are added. This can cause the parent to split as well, continuing the process up the tree.

If the root node splits, a new, higher-level root is created containing the two halves of the previous root.

Removing entries

Removing entries from the R-Tree also involves two steps:

Finding the entry’s parent node, and removing the entry. Check if the parent should be shattered.

Finding the entry is easy (we use the same strategy we do for searching). Once the entry is removed, we want to make sure that the parent node is still viable. Archery considers a node that has fewer than two children to be “non-viable” and will shatter that node.

Shattering a node means removing a node from its parent, taking the orphaned children (if any), and later readding them to the tree. Since a shattered node is removed from its parent, it may also cause its parent to shatter (which will shatter any remaining children it has). After all shattered nodes are removed, all the orphaned entries are readded.

(In this diagram, the lower-right point is removed. This shatters its immediate parent (the green node), its grandparent (the yellow node), and its uncle (the blue-ish node). The uncle’s points are readded, stretching the remaining nodes.)

The important thing to note is that we can only add entries to a tree, so shattering can cause a chain reaction that produces a large collection of entries to be readded.

Handling immutability

We’ve been talking about modifying the R-Tree as if it was a mutable data structure. But since immutability was one of our main requirements, we are actually going to return a new tree after every operation. We will rely on structural sharing to minimize the amount of copying that actually takes place.

To power our tree, we’re going to use Vector an immutable data structure with (effectively) constant-time access and update.

Here’s the essential structure of our nodes:

sealed trait Node[A] case class Branch[A](children: Vector[Node[A]], box: Box) extends Node[A] case class Leaf[A](children: Vector[Entry[A]], box: Box) extends Node[A]

Any method that wants to modify the tree will need to return new node instances, rather than modifying them. The two main methods for “modifying” the tree, insert and remove , are described here. In both cases, the return value encompasses all the changes that were made to the node in question (and its descendents).

Node.insert

Here’s the signature of the Node.insert method:

def insert(entry: Entry[A]): Either[Vector[Node[A]], Node[A]]

The return type handles two cases:

Right(node) This is the most common case: we have constructed a new node (containing entry ) to replace this node. Left(nodes) This case handles the situation where the node has been split. We’re returning a collection of new nodes to replace this node (one of which will contain entry ).

Instead of encoding these cases in Either we could have just returned a collection in both cases and checked the length. However, it’s nice to let the type system help us keep these cases separate, rather than hoping that we remember to check the length correctly.

Node.remove

And here’s the Node.remove signature:

def remove(entry: Entry[A]): Option[(Joined[Entry[A]], Option[Node[A]])]

This method’s return type is also a bit complicated. The return value breaks down into three cases:

None This is the most common case. It means that the entry was not found withinin this node (or its descendents). Some((entries, None)) In this case, the entry was found in this node (or a descendent of this node), and after removing it, the node was shattered. So, we are returning a sequence of entries to be re-added, but no node. Remember that entries might be relatively large, if many of our descendents have shattered. Some((entries, Some(node))) The entry was found, but after removing it this node is still viable. Since some of our descendents may have shattered, we still return a collection of entries but also a replacement node .

Recall that the entries to be re-added will be passed all the way back up the tree. They will be inserted after the tree structure has been updated from the deletions.

The RTree wrapper

You might be wondering what happens if the root node is split (or shattered). Who handles that case? The answer is that we have an RTree class that wraps the root node, and provides wrapper methods for insert , remove , search , and so on.

Its main job is to handle these cases, creating a new root node. Like the nodes, it is immutable, so it ensures that all modifications result in a new copy. It can also hide some of the complexity of the tree structure (users don’t need to worry about Leaf and Branch instances).

Here’s the basic prototype:

case class RTree[A](root: Node[A], size: Int) { def insert(entry: Entry[A]): RTree[A] def remove(entry: Entry[A]): RTree[A] def search(space: Box): Seq[Entry[A]] ... }

Future Work

There’s a lot left for us to do with Archery:

Entries with area

R-Trees can support storing rectangles, not just points. For the initial version of Archery we’ve stuck with just points, but we plan to add this feature very soon.

Better filtering support

There are many additional flavors of search result filtering we could support. Users might want to provide functions to narrow results, such as A => Boolean or Entry[A] => Boolean .

We could also generalize search , count , and other methods with some kind of fold method.

Flexible geometries

Right now users have to use Archery’s Point , Box , and so on. It would be nice to make these pluggable via type classes, so that users with existing geometry objects could be used instead.

This could also make it easier to use Archery to support 3D data.

Polygons

In addition to using bounding boxes, Archery could potentially support searches and entries with arbitrary polygon data. In these cases, Archery would use the R-Tree for fast bounding box tests, and only use more expensive point-in-polygon and polygon-in-polygon tests on entries that could plausibly be a match.

Less boxing

Right now Archery performs very well for the cases we’re interested in. But it still allocates a lot of Entry and Point instances, and will also box primitive types like Int .

There are strategies we could use to reduce the number of allocations necessary.

Smarter seeding heuristics

Archery is using a simple (but relatively effective) “linear” seeding strategy. This strategy looks for the two bounding boxes with the largest distance in one dimension. This strategy is fast ( O(n) ) but doesn’t always produce optimal results.

There are other strategies (including “quadratic” as well as the more involved strategy that R*-Trees use). These are often a bit more expensive but can produce better results. It would be nice if this strategy was pluggable, or at least if some work was done to determine the optimal strategy for Archery.

Bulk loading and packing

In many cases users will have a large collection of points. While archery doesn’t use a special loading/clustering strategy in this case, many exist. It would be nice to implement something like a Hilbert R-Tree for this case.

Contributing

Meetup has released Archery under the MIT license. This means you are free to use and improve Archery in your own projects. If you make any improvements, feel free to open a pull request to send them back upstream. You can also open an issue to report a bug or request a feature.