In 2008 I was getting paid to work on my creation, Apache CouchDB. IBM had hired me early that year to put it into the Apache Incubator and by then it was a top level project. I was living in Asheville NC with my wife and kids, in the house 23hrs a day working or taking care of kids or goofing off (a lot). Code, change diapers, exercise, play Guitar Hero, code, argue on the internet, etc.

I reported the CTO of Information Management (DB2, IMS, Informix, etc). As such I was invited to talk at Almaden Research Center to give an introduction to the design of CouchDB.

The talk was really well attended and there were at least 2 IBM Fellows there and a bunch of distinguished engineers. That’s what my sponsor guy I was with all day said anyway.

Almaden Research Center is so damn swank. It’s at the top of a nature preserve in the San Jose hills. All the engineers had offices with doors that shut. And nice offices. With windows, hardwood desks and cabinets instead of grey plastic and fiber. I call it Fort Awesome.

Anyway while I was giving my talk some scraggly bearded dude started asking pointed questions and being a real prick about it. Where’s the write ahead log? How can you have atomicity with out it? Questions that basically amounted to “Why doesn’t this work just like our RDBMS?”

At one point I zinged him. I said “because it’s a MODERN DESIGN” and I enunciated like a real prick. The whole place FELL OUT! The guy meekly shut up after that.

I have no idea if the rest of talk went well. But damn that zinger felt good. Worth it.

After the talk I had a meeting about an experimental project looking at Cassandra as a email store. I’m still not sure why they were telling me about that. It was a weird meeting.

Then another meeting showed me two IBM projects for JSON manipulation: Jaql (pronounced jackel) and JSearch.

The Jaql syntax I found confusing. Nothing looked familiar or intuitive. I had the guy who wrote it explaining it to me. And still couldn’t figure out what the code he showed did. But he seemed to like it a lot.

But JSearch however was really cool. It had a query syntax that mimiced the JSON data structure. And allowed you to combine predicates with boolean operators and parentheses for grouping. Everything about it was intuitive and expressive.

This was something that CouchDB could really use, we had no way to do adhoc queries. I almost immediately had plans to get it into the project, but it never happened. Always smaller tasks with more importance got in the way. A while later I left IBM to found Couch.io. I was in way over my head running a startup and I just couldn’t push JSearch forward.

Here’s the bug to add it to Apache CouchDB. This bug, with source code with JSearch source attached and a patent for it are the only places I know of with information about it.

Anyway, that was long time ago. But since that meeting I always wanted to build it in something other than Java. I tried get it into Couchbase Server, but then got sidetracked on this UnQL thing with Richard Hipp, SQLite author. We collaborated on design of a query language that I ended up not liking. At. All. It was SQL for JSON. I felt like it made you have to understand SQL way too well to be able to do simple things with JSON. I gave up on that effort. Besides SQL is for sets, not hierarchical data.

Later I tried again to get another form of it into Couchbase but I left the company before that was possible.

Fast forward to late 2015, divorced father of 3 kids and I was taking time off work to spend more time with them and learn to be a better dad. I fixed up an old Buick to occupy the part of my mind that loves problem solving and tinkering. Once it was running smoothly and reliably I was getting bored again. I wanted to work again but I didn’t want a job. Jobs means you work on on someone else’s vision. If I can avoid one I will. But I love working. A little too much sometimes. Software projects are a lot of work, with less job.

And so I remember that cool little JSearch project from so many years ago that I always wanted to build, but never had the time. Now I had the time and inclination.

So with that I started Noise: Nested Object Inverted Search Engine. A search library that allows you to index any valid JSON whatever it looks like (schema-less), and query it and be able to pick-off and aggregate information about objects nested in arrays nested in objects nested in more arrays.

I had years ago abandoned the JSON example-based syntax of JSearch (at the behest of the SQL guys) and instead used a Javascript dot path syntax. {foo: {bar: == "thing"}} vs doc.foo.bar = "thing" The path syntax looked more SQLish and the whole language looked like sql. For some reason I started with the path syntax this time too. But there are advantages the example-based approach, especially as the queries for fields get more deeply nest and complex. Plus it looks like the JSON you are querying.

I started writing it in early 2016, first in C then I remembered how painful error handling and clean up is. And the C api for RocksDB was a wrapper around a C++ api, but wasn’t completely faithful in respect to errors. So I switched to C++. Then I remember C++’s header madness issues. But I got the basics of the index and path syntax version of the query language written and working with RocksDB (I can’t say enough good things about RocksDB for this project. It’s well designed and fast).

I wrote a document shredder and indexer, lifted a Porter stemmer in C, wrote a query parser and runtime filters. I got a bare minimum prototype running.

And then I stopped. I had a health issue I needed to focus on. Don’t worry, I’m healthy now. But I wasn’t for almost a year.

Enter Volker Mische. Volker blew me away when during a internship at Couch.io he built a prototype indexing engine for CouchDB in about 2 weeks in Erlang, a language he barely knew. You don’t let talent like that get away, so I hired him ASAP for the startup, and he’d been working on GeoCouch for CouchDB and Couchbase since.

Last year during a visit to the US we got together to catch up. He wanted to build a LSM R-Tree into RocksDB. I showed him what I was building, also using RocksDB. I said it would go great with his, Noise could really use a geospatial query engine. Help me with the project and if things take off you’ll get paid eventually to build this out as a co-founder.

So he joined the project, which for me was like “Yes! No wait. Shit, now I’ve got Volker wrapped up in this mess too.” But it’s been great, he’s been pushing me as much as I’ve been been pushing him.

He wasn’t happy about my choice to write Noise in C++. I had originally talked about writing things in Rust.

But then a funny thing happened, I tried to learn Rust. And it was confusing as hell. I think coming to Rust I expected it to be more C++ish. All I could think is this thing is going to have a damn hard time going mainstream. So I dropped it, I had a lot of frustration with Erlang in the past as too niche and different (though I loved the weirdness) and suffering for lack of tooling and all the profiling and tuning the VM you need to do that seems to make little difference in performance. And I spent weeks tracking down a crashing race condition bug in the VM. Those memories still fresh, I was afraid Rust would have a similar fate obscure fate.

But Volker didn’t program C++. He wanted to write the project in Rust. So he started porting my C++ code over to Rust. Ok, I didn’t like it but I won’t stop an engineer working his way. Volker has strong convictions. And I’m glad he does.

When he came to visit for 3 weeks last November (I had him staying in the garage of my ex-wife!) and he showed me his progress so far. He had ported everything over except the query parser and query filters. So while he was here we set 2 goals for the trip:

1. Finish porting work. 100% Rust.

2. Design a full query language syntax and semantics.

We worked side by side pair programming. Him usually at the keyboard, me watching and explaining the C++ code and us translating line by line the code.

Then we fought with the borrow checker, for like a whole day or more. We couldn’t figure out the right way to annotate the lifetimes until we did some weird thing of passing a struct by value and returning the same struct by value, instead of just passing it in by reference. If that makes no sense to you don’t worry, it still makes no sense to me. But it worked.

We accomplished the goals just barely, but it was a crazy productive 3 weeks and I remembered how great it is to work to brilliant people. It lit a fire in me.

So we had the full text search stuff working for exact string match and term match and the boolean && operator. But that’s it.

We also designed a new syntax that looked much more like the JSearch stuff. Volker got to work writing a new peg grammar for the new language and I started investigating details how to get comparison operators to work (> >= < <=) for numbers and text. And I discovered I was very wrong about how hard it would be. The more I thought about it the more I realized it was incompatible with how I was doing the full text search. Oops. What now? Wait, Volker's R-Tree stuff it turns out is a perfect fit for the project. Each piece of data is just a value on an dimension, with the sequence # of the document on another dimension. I thought it would work perfectly with the first_result(startdoc) and next_result() scheme of the query runtime filters.

On our next Skype meeting I explained to Volker where I had gone wrong with the comparison support. I took a lot to explain why but eventually he agreed the original way we planned to do this wouldn’t work. Then I explained the R-Tree work for geospatial support would actually be ideal for regular scalar values. I asked him a few questions about the semantics of how they work and everything lines up.

So he got busy working on the R-Tree backend for RocksDB (in C++, oh the irony!) and I continued pushing the Noise code forward.

Since last November I’ve implemented the new query language parser, returning document values (not just ids), proximity search, phrase search, sorting and aggregates (group, min, max, avg, sum, etc), bind variables (extract array nested values and objects that match criteria), relevancy scoring, term boosting, boolean or (||) and logical not (!) and a command based test suite for testing the query language. And in-process Node.js integration.

I became really productive in Rust. It’s very explicit and verbose, but as it eliminates a huge class of bugs from C++ (buffer overruns, use after free, iterator invalidation) or even Java (race conditions) and the compiler gives useful error messages, it becomes very productive and lets me code quickly and confidently. And it’s performance is on par with C++.

The standard library is well designed, fairly complete and the online documentation is easy to navigate and understand.

For anything not in the standard library the cargo build and package system has an impressive number of modules that seem to be reasonably high quality. Or maybe I’ve just been lucky in my needs and choices.

A lot people complain about the borrow checker in Rust. And I’m one of them. It’s not that I don’t think it’s great. I do. It’s the way you have to annotate lifetimes when the borrow checker can’t figure stuff out. I think I understand the borrow checker and lifetimes. I’ve been able to annotate them when necessary. But I also suspect my understanding of lifetimes is shallow and I’m missing some bigger truth. I’m really not sure. I definitely don’t understood why I need to annotate the lifetimes sometimes. Fortunately it’s a rare occurrence so far.

The first challenge I faced was rewriting the query parser for the JSearch style query syntax. At first I wrote an easy-to-grok grammar using a peg parser. Then I used that grammar as a guide to write a recursive descent parser. As the language got larger, the parser got larger. What started out simple now feels complicated. Eventually the hand-coded parser and the peg grammar diverged and couldn’t easily be reconciled. I feel like the current parser code needs some simplification, but I’m not quite sure how.

The thing is every non-trivial language parser seems arbitrarily complex, even as a grammar for a parser generator (I’ve written them in antlr and yacc/lex), especially once the action code is added. So maybe that’s just how it goes. With recursive descent simple stuff is a little harder, it’s often much easier to do complex things. The Noise parser can definitely use some improvements, especially our error handling and messages need some work, it should always show some context of the input where the parse error is occurring.

When using a new language with new features, you aren’t always sure where and when to use them. One mistake I’ve made is over-using tuples as return types. Sometimes that’s really nice and readable, but sometimes it just makes it so the significance of the return values is obfuscated. With function input parameters you give each parameter a name, which helps greatly with understanding the significance of each parameter. But tuples only specify the types, so the significance of each tuple member is often lost unless you write explicit documentation, which is sub-optimal in many cases (doesn’t show up in editor typeahead/hints). You can create a new a struct type just for returning things, but that creates another place to look while trying to understand the code. Like I said, still trying to figure out when it’s most appropriate.

One of the pieces of code I’m the least happy with is my implementation of relevancy scoring. I implemented tf*idf the same way Lucene does it, but the code for it ended up being scattered around too many places. I have ideas how to fix that, but I’m not even sure I got the scoring completely right. It seems right and ranks things as expected, but I also suspect it’s a little off somehow. Scores seem lower then they should. But again I’m not sure. What I need is to compare the scores with what Lucene gets for the same input documents and fields and with similar boolean queries. But I haven’t gotten around to that yet.

But today I just learned that Lucene switched over to BM25 as the default scoring system and it’s regarded as generally superior. Soooooo I guess we’ll implement that one as well.

Another problem with the current code structure is with the single LSM keyspace. I’m using key prefix characters to simulate multiple tables in the RocksDB LSM, which is fairly standard when using an LSM. But instead of creating an central place to map symbols to prefix characters, I just use the prefix character literals when constructing keys. This worked fine before I implemented scoring, there were only small handful of key types, but after I added scoring the key types doubled, and now the literals are confusing. I definitely need to refactor that code for clarity.

I was about to implement fuzzy search and prefix search and read about the amazing story of how Lucene does fuzzy search. I started reading a bunch of Lucene code and figured out how it works in Lucene (well except for the really mathy parts. Read the story!). I got the python script that emits javacode that uses another python file from a spell checker library. I was planning to adapt it to emit Rust code. But ultimately decided we had enough full text features for now. We need to get what we have into developers hands.

The current project phase of Noise is mostly about developer ergonomics and solid design. We need to see if users find the syntax and semantics useful and intuitive and where they get confuses and what features they need before they can deploy. The best way is to actually deliver a working project and see what users think.

We decided that targeting Node.js integration would be good to give us exposure to mainstream developers. So I started working on Node.js bindings for Noise, to run in the same process with Node code.

Not familiar with Node.js or NPM, my learning curve was steep. I had to hook up Rust code to V8 code, which means I had to figure out how to build Rust code on demand within NPM and link to Node.

Thankfully there is a really great Rust project called which takes a lot of the work out of integrating Node and Rust.

But still there were a lot of challenges. The Neon project didn’t support async calls, it was strictly for calling into Rust and immediately returning an answer (though that might have changed since, don’t know). I looked into implementing it myself, but it appeared to need a C++ compiler and then link that to both Rust and Node native code and my tolerance for build issues was already exceeded. So instead I figured out a way to get long running queries to notify the main event loop there was a result ready.

I use a simple named pipe between the Node.js code and the background worker threads (this was the same approach I took when I implemented thread pools in MySQL using libevent). When Node needs to talk to Noise, it calls into Rust Native code to put data into a mutex protected “mail slot”, then pushes a single byte through the pipe, which wakes up the thread and gets the message out of the mail slot.

When the thread performs the action, it puts the result back into the same mail slot and sends a single byte back through the pipe, which then notifies the Node event handler for the pipe so it can then call into native code and retrieve the result out of the mail slot.

I also had to do a bunch of work allowing multiple threads to access the same index instance, to allow for concurrency of the querying. The more instances of the index you open, the more concurrent threads that can process queries.

But now I’m also second guessing that design decision. Maybe only one instance of the index should be open at a time, with an argument to specify the number of worker threads that should be spawned. That seems easier for the user.

Anyway, the Node integration now is functional and seems solid. I’ve also gained a newfound respect for Node. However I’ve been told my Javascript coding style could use a some work. That I shouldn’t use let and var in the same code base. Sheesh.

Anyway, while I was doing the Node work Volker dropped a bomb on me! The R-Tree wouldn’t be a good fit after all for the comparison operators, because of the way most number sets have a normal distribution. Or something like that. So instead he coded comparison matches up in a fairly simple way brute force way that for now is acceptably fast. We are looking into more advanced datastructures to make it faster.

I then wrote the first draft of the query language and Node documentation (it sucked), Volker helped clean it up. Hoping for more help with that soon. MC Brown where are you?

Most recently we’ve built a Try Noise tutorial site that lets you try noise queries directly in your browser against a TV Show dataset. You can find it here.

The homepage for the project is here.