Transcript

Lemire: My name is Daniel Lemire, I'm from the University of Quebec, and I'm going to talk to you about Parsing JSON Really Quickly. I'm one of the authors of what might be the fastest JSON parser in the world, and so I'm going to try to tell you about the strategies we've used and give you some examples of the tricks that we've been building up to make this possible.

How Fast Can You Read a Large File?

I'm going to start with a relatively simple and naïve question to motivate my talk. If I give you a relatively large file, and I ask you to read it in software, and then I ask you what is your limit, are you limited by the performance of your disk, or are you limited by the performance of your processor? I would guess that most people would say that you're strictly limited by your disk. I could reframe this same question by talking about network.

I would argue that the story is a little bit more complicated than people sometimes think. In preparing this talk, I just benchmarked the disk on my iMac, which is just a basic stupid iMac on my desk, on campus. My disk was rated at 2.2 gigabytes per second as a throughput. Of course, there are better disks out there and network adapters that are much faster also.

Let's compare with how quickly we can do very naïve task using software. Let's say, for example, that I take a relatively large text file, and I put it in memory. My memory is probably not going to be much of a bottleneck because I can read tens of gigabytes per second in memory easily. Let's not worry about the memory, and let's try to just go through the lines in my file and just add up their length. This is a stupid benchmark, but the point is, illustrate just about the simplest thing you could do with this sort of data.

If I benchmark it using a fairly standard CPU, I get a fraction of a gigabyte per second in speed. If I were to do the same thing, and my input was disk, I would be entirely CPU bound in this case, if the disk is all mine and the file is large enough. You can switch to C++, and again, I'm only using rather naïvely the standard APIs. In this case, at least in my test, I get slightly better, so I break the 1 gigabyte per second barrier and it's basically the same silly test. My point here is not so much that I couldn't do these things much faster, so certainly, whether you're in Java, or in C++, you can certainly beat these numbers. Here I'm just trying to illustrate that if you're writing just standard code, you're probably going to have some trouble reaching gigabytes of data per second ingesting files.

Nobody here will say this, but some people could say, "It doesn't matter because your processor or your core is going to get much faster next year or in two years." That's what people used to say in the '90s. Now, of course, we all know that our cores are not getting much faster with each passing year. If we're going to reach gigabytes per second processing data, we better have good software to do it.

JSON

Of course, nobody cares about parsing lines of text and counting their length in software, that's kind of a silly problem, but people do care about things like JSON. I would assume that most people here have heard of JSON hopefully. It's a fairly standard thing, it's well established, it's fairly simple. You have arrays, strings, numbers, it's all glued together with some text. It's all over the public APIs and so forth, and even entire systems are built around parsing JSON around.

What I hear a lot - not from everyone, but from enough people - is that they have all these cool AI stuff, but their servers are just spending all their time producing JSON and parsing JSON. I see a few people laughing so maybe this happens to you. It's kind of silly if you think about it, but that's life.

Here I'm going to focus on JSON parsing. It's maybe important to define what I mean by JSON parsing because it could mean different things to different people. What I mean here is, you read all of the content, you check that is valid as per the specification, you check, for example, that the encoding is correct so you have proper strings. You parse the numbers, and you build some kind of tree data structure, so a document object model. Arguably, this is a little bit harder than parsing lines in a text file though it should be a little bit slower. How quickly can you go? If I pick what might be one of the very best JSON library in Java - Jackson - and if I use a pretty good benchmark file that I've picked up somewhere on the web, that I'm going to use throughout this talk - it's twitter.json, was collected from the Twitter API, and it contains numbers, Unicode string and so forth. It's a pretty good benchmark because it's an all-around typical JSON file. I get that Jackson can ingest the JSON file at a fraction, about a third in my case - of course, the results will vary depending on your hardware and your software, but I post my source code - at about a third of a gigabyte per second, so we're very far from maxing out our disk with a single core.

If we switch to C++, then a very good library is RapidJSON. If you're coding in C++, and you're processing JSON, and you want really good performance, and probably you know about RapidJSON, you do a little bit better. In this benchmark, you reach about two-third of a gigabyte per second. Again, you're very far from maxing out the disk in this instance. The question is, can you do it? Can you parse JSON at gigabyte speeds? Of course, you can, otherwise it wouldn't be much of a talk - no, I have a bit more to go.

We built this library, simdjson, that can achieve on this benchmark 2.4 gigabyte per second. It's actually the only library that I know of that I can actually max out my disk. This means that on my test machine, roughly speaking - please don't do any math here, just an order of magnitude - it's about 1.5 cycles per input byte. This does not give you a lot of leeway, you cannot take each byte and start to think deeply about what to do with this byte and then switch to the next one, you only have 1.5 cycles per input byte. It's a little bit better than it sounds because our superscalar processors can do many things in one cycle, at least when they're not doing silly random access work in memory, or other notation. When you're CPU bound, you can certainly beat one instruction per cycle, so this is a bit better than it sound.

Avoid Hard-to-Predict Branches

How did we do it? I'm going to cover a few basic strategies that probably most people here know, but I'm going to go a little bit more deeply in them, and then I'm going to show how we apply them. This was mentioned today, several times people talk about measuring mispredicted branches and so on. I'm going to go deeper into this, and actually say that you should really work hard to avoid hard-to-predict branches, and I'm going to work from an example.

Let's say, for example, you're trying to write random integers to an array. You do a silly loop and if your random integers are generated using a fast, state of the art, pseudo-random number generator, you might be able to do this task using about, say, three cycles per number generated. Let's say you modify this job. This code here is silly but it's just to illustrate my point. You say, "I'm going to do the same work, but I only want to write out odd numbers." I do it in fully manner, I generate my random integer and then I check whether it's odd. This is very fast, I only have to check the least significant bit, and if it is, I'm going to write it out.

If you're a bit naïve about it, and if you're doing textbook computer science, you might look at this and say, "It's not necessarily any slower, it might even be faster than previous code because you're not writing as much to the output array." Actually, it's massively slower because what's happening here is that a modern processor rely on branch predictions. Each time the processor sees a branch, it tries to predict its output, several cycles ahead of time. Then it does the computation based on this prediction. If the prediction is false, then it needs to throw away all of this work, and start again back at the point where it mispredicted. I carefully design my silly benchmark so that it will be clearly very difficult for the processor to correctly predict the branch because it's got a random number, and how can it predict whether it's odd or not?

Thankfully, I can rewrite the same benchmark without a branch, and typically this is almost always possible. What I'm doing here is that - this is a typical trick - I'm always writing out a random integer to the array, but I only increment my index when the integer is odd. In this code, there's no more branch; actually, there's a branch due to the y loop, but inside the y loop, there's no branch. I've basically transformed the problem with the branch into one where it's just arithmetic, and the performance is back to nearly what it was originally.

What happens quite often is that when you make this point to people, they write a little benchmark, and then they say, "My code with branch is actually faster than your branchless code." Here's one scenario that might explain what happens. If I take my code with the branch, the one I showed you, the silly one, and by design, I'm using a pseudo-random number generator, this means that I can repeat the benchmark many times, but it's always going to be the same random number. To try to fool the processor, I can say use a loop that has 2,000 iterations. If I do that, and I repeat and repeat always the iteration, and I plot the misprediction rate, initially, during the first iteration, I've got a 50% misprediction rate. Very quickly, the processor adapts and actually learn - and I'm not kidding - it learned the 2,000 predictions, and it can learn them pretty well. This example with my relatively old Skylake processor, it takes a really long time before it falls down to 1%, but you see after 50 trials, I'm down to 5%. If you use the new fancy AMD processors in the same time it falls down 0.1%. Apparently, there's something about deep learning or whatever, neural networks; it's not deep learning, but anyhow, it's a very good branch predictor.

One problem is that it's really hard to benchmark code with branches because processors do all sort of crazy things to fool you. Another fact that I'm now documenting here is that sometimes by adding a branch, you can worsen the branch prediction elsewhere. Basically, it can depend on your processor, but lots of processors predict branches based on history. The history of the branch is taken, and if you're adding a branch, then this history gets more complicated to learn, and so it can worsen things, even if the new branch that you've introduced is predictable. Basically, branches can have bad effect and these bad effects are not so easy to measure.

Use Wide "Words"

I said earlier that I only have about 1.5 cycles per byte. This means I cannot go byte by byte when I process my input. I need to go with wide words, so maybe 64-bit boards, or when possible, I should be using SIMD instructions. SIMD instructions have been around for a long time, they go back to Pentium 4. They were first introduced as a motivation multimedia, sound for example. Now, people invoke machine learning and deep learning [inaudible 00:18:00] but it's the same story.

What they do is that they add wider registers, so, the normal general-purpose registers are 64 bits on most processors, but then they add 128, 256-bit, and even 512-bit registers. They also add new fancy instructions, like really fast lookup tables. Basically, the story goes like this, your mobile phone, your iPhone for example, has NEON instruction, which use registered span under a 28 bit, the same as legacy x64 processors. The more recent processors you can buy now on your servers, use AVX and AVX2, so they use 256-bit registers. The fancy new processors from Intel go up to 512-bit. For our work for simdjson, we use the first three type of systems. We do not yet go to AVX 512. Part of it is that it's not widespread yet.

For our new program for SIMD instructions, the approach that we've been using for simdjson is to use intrinsic function. These are special functions that call sometimes a very specific instruction that is specific to the processor you're using. They're higher-level APIs, higher-level function, so Swift, C++, you've got the Java vector API that is along these lines, we don't use that. You can also rely on Compiler Magic, so you compile whether it's in Java or C can take a loop, for example, and vectorize it; it's like magic. You can use optimized functions, Java has some of them. Or, for example, when you're using crypto library, then these guys typically write all of their code in assembly, which I don't recommend because it gets a little bit difficult.

Avoid Memory/Object Allocation

Another trick - again, nothing revolutionary - is that you should avoid memory and object allocation as much as you can. In simdjson, we use what we call a tape. When you're parsing the JSON document, everything gets written to one tape that's reusable, so whenever we encounter a string, we don't allocate memory for the string. Whenever we encounter a number, we don't allocate memory for the number, everything gets written consecutively.

Measure the Performance

Another strategy that we use is that we measure the performance a lot. We do what I would call benchmark driven development. It might not be practical, but it's fun. Here's a plot of our performance on Twitter file on one specific machine over time. On the x-axis, you've got the commits and on y-axis, you've got throughput. Here I'm cheating a little bit because the y-axis does not start at zero, but I just wanted to show what it looks like. You can see we have these big jumps when someone finds a new, clever way to do things. What's interesting is, in our first public release, we reached 2 gigabytes per second, and we thought we were pretty clever. Now we're at 2.4, and I think we're going to go higher. I'm planning to go to at least 2.5, if not more. Also, we use a performance test in our continuous integration. That's a kind of worm in itself, but we try to basically detect commits that cause a major problem on one type of system very quickly.

This is almost an aside, but here is a point that I'm finding that I often have to do. If you're doing CPU intensive work - I'm not talking about accessing data in RAM or something like that - if you're doing processor intensive work, then you have to worry about the fact that, no matter what you think, your processor frequency is probably not constant. Especially, if you're working on a nifty new laptop that's thin, it's probably not constant. If you want to measure performance seriously, then you probably don't want to equate the time with the number of CPU cycles. This was mentioned today, you probably need to use performance counters from your CPU if you're serious about it.

Example 1. UTF-8

Let me go into specific examples of what we do. One problem that we have when we want to parse JSON is that the input, at least on the web typically, is Unicode, so UTF-8, and we want to check that the bytes are actually UTF-8. UTF-8 is an extension of ASCII; this ASCII is valid UTF-8, but it adds extra code points that span 2, 3, or 4 bytes. If you want to write in Klingon, for example, you're going to have to use more than 1 byte.

There are only about 1.1 million valid code points, everything else is garbage. Of course, you want to stop, you don't want to ingest strings that are not valid, because then it's going to end up in your database, and maybe even show on your website, and God knows what. You want to stop it right there. Typically, the way people validate Unicode is with code like this. This is not actual code, I took real code, and then I simplified it because it's much longer, but basically it's a bunch of branches. This works really fine if your input is ASCII because you've got one predictable branch and everything is fine, but the minute you start hitting Unicode, then you've got branch mispredictions all over. You can avoid the branch misprediction by using a finite state machine, but it's complicated. You can do even better than this, you can use SIMD instructions, so you load 32 bytes of data, you use 20 magical instructions. Then you got no branch and no branch misprediction. I don't have time to go into what these 20 something instructions are. Actually, I know of three different strategies that end up with the same instruction count, just about, but I'm just going to illustrate it.

For example, in UTF-8, in the standard Unicode that we see on the web, no byte value can be larger than 244. The way we check this - you could just do a comparison, but we like to just do a saturated subtraction. Basically, we take the byte value, and we subtract 244, and obviously, if the result is not zero, then you have a value that's greater than 244. Otherwise, because of the saturation bit, it goes to zero. Saturation just means that it doesn't wrap around, it goes to zero if it's too small.

This can be written in code using one of these intrinsic functions I was talking to you about. This voids assembly but it's super ugly. In this case, with this one function, I can check 32 bytes at once, so it's really efficient, really fast. Then I could go on for about an hour to explain, everything else falls into place, but let me jump to the results. Compared to branching, if I have an input that's random UTF-8, I'm about 20 times faster using SIMD instructions than I am with using branching, so, much better.

Example 2. Classifying characters

Let me work out here in our fun problem that's more closely related to JSON. In JSON, we have what they call "Search all characters," I think is the specification that calls them that, so the comma, the colon, the colon separates the key and values, and then the braces that differentiates between objects and arrays.

Then you have white-space, and basically, outside the strings, you cannot have much else. You have the atoms, you have the numbers, but the structure is given by these characters. You want to identify them, but you don't want to identify them one by one, it's too slow. We're going to build a lookup table approach, so what do we do is we take each byte value, and we decompose it into two nibbles. A nibble is 4 byte, so the least significant 4 bits I'm going to call it the low nibble and the most significant 4 bits I'm going to call it a high nibble.

I'm going to use the fact that whether using ARM, or Intel, or AMD, you have fast instructions that can do table lookups as long as they're relatively small. Let me give you an example of what these instructions are capable of. I start with an array of 4 bit value, so nibbles, and I create a lookup table. Here, for simplicity, my lookup table is just the numbers from 200 to 215, but this could be actually entirely random. Of course here, I could just add 200 to lookup, but I wanted something people could follow. What I want to do here as a task, I want to map zero to 200, 1 to 201 and so on. This task can be done in one instruction. It's a really fast instruction on most processors.

That doesn't give me the character classification I was talking about, but the recipe is actually quite simple. I need two lookups. I take all my low nibbles, and I look them up in one table. Then I take all my high nibbles and I look them up into another table, and then I do the bitwise AND between the two. Then I choose my lookup tables carefully, so that the comma ends up mapping to 1, the colon to 2, the brackets to 4, and the white-space characters to 8 and 16.

I'm going again to show you terrible, really ugly code, so in this case, it's the implementation using ARM NEON with intrinsic functions. It looks really scary but it's not. At the top, I basically defined my consonants which are my lookup table, so I've got two lookup tables. Then my five instructions are given below. The first two instructions identify the high and low nibbles. Then the instruction three and four are just lookups, and then last instructions, I do a bitwise AND. In five instructions, in this case, I can classify 16 characters without any branches whatsoever, it's super fast.

Example 3. Detecting Escaped Characters

Here's a fun one. You all know that if you want to put a quote inside a string, you need to escape it, which means that you have to add a backspace before it. That backspace itself also needs to be escaped, of course, so it's "backspace, backspace." if I've got "backspace, quote," then it gets really confusing because it's "backspace, backspace, backspace, quote," and I could keep going. In practice, this means that you could get a JSON input that looks like this Can you tell where the string starts, where they end? You don't know, so it's really hard to figure out the structure from this.

There's a trick actually. If you've got an odd number of backspace characters before a character, then this character is escaped. If you've got an even number, then you don't need to worry because an even number of backslash characters are just going to be mapped back to a series of backslashes. Let me give you an example of how we go about it. I go back to my input string, I identify the backslashes, so I map them to a bit set. Everywhere I've got a backslash I put a one otherwise I've got zeros. Then I'm going to define two constants. You'll see where I'm going with this, or maybe not, but it will be fun. I've got one constant where I put a one at every even index, so 0, 2 and so forth. Then finally, I have a constant where I've got one at every odd index, so one, three and so forth. Then I plug in this firmly at the very top, ok. I'm not going to explain it. A student of mine asked me, "How did you get that?" I said, "Lots of hard work." You've got this formula here, you can just try it out at home, it's fun.

This actually does the right thing. In this case, it will identify the fact that my quote character right there is escaped so it's not actually a string delimiter quote. It sounds really painful, there's lots of instructions there, but again, no branches whatsoever. If I remove the escape quotes, then the remaining quotes tell me where my strings are. I can just identify all my quotes, so I put a one where I've got a quote, and I don't find my escape quotes. Then whatever remains are my string delimiter quotes. It's important for me to identify where my quotes are when I'm parsing JSON because if I want the structure of the document, I need to be sure that I remove all the braces, the colons, and so on that are inside strings because they don't count. I want to know where my strings are.

I do a little bit of mathematical magic here, so if I start with where my quotes are, and I want to turn this into a bitmap that indicates the inside of my quote, I can do a prefix XOR. I show it in the code, basically I shift by one, I XOR with the original, I take the result, I shift it by two, and I XOR again, and so forth. This looks a little bit expensive, but you can actually do this with one instruction on most processors. It's a [inaudible 00:35:55], it's used for cryptography so probably those of you who don't do crypto don't know about this instruction, but this can actually be really cheap. If you do this, you go from the location of the quote to the string regions. This means I can mask out any instructional character that is inside of a quote, again, without any branching.

If you follow all of my examples, if you put them together in your mind, you realize that the entire structure of the JSON document can be identified as a bitset - so with the ones and so forth - without any branch.

Example 4. Number Parsing Is Expensive

At this point, you're going to have to go from the bitset to the locations of the onen. I've got a nice trick on how to do this, but let me jump ahead because the time is a bit short, but maybe I can come back to it later.

Another problem we have is that number parsing is surprisingly expensive. If you take some Java code that has to ingest data in text form, and you try to benchmark, the time spent parsing the floating-point numbers is totally crazy. I built a little benchmark, so I generated that random floating-point numbers and I just wrote them to a string in memory. Then I went back, and I tried to read back these floating-point numbers using a well-optimized C function, so a string to D. Then I reach the fantastic speed of 90 megabytes per second, so this certainly should be slower than yours as these, would hope so.

I'm basically spending 38 cycles per byte, and I have a total of 10 branch misses per floating-point number. This is not fun, and this is going to end up being a bottleneck. Basically, you have to use either a fast floating-point parsing library that someone wrote or, like we did, you write your own and you hope for the best.

I'm going to go back a little bit on my original strategy which said, "Let's try to use wide words." Again, you have the same problem. If you look at most code that parses numbers, you do it byte by byte, and this is, of course, not going to ever be super fast. I'm going to do some mathematical magic and I'm not going to explain the formula. This is a formula I actually came up with probably working late on a Saturday night. If you gave me eight ASCII characters or eight characters, I take it and I map it to a 64 bit integer, I just copy it over. Then I applied this little formula, then it's going to give me true when my characters are specifically eight digits. Why is that important? Because very often the numbers that are expensive to parse are numbers that are made of lots of digits. Some people really like to throw in a lot of precision when they serialize their data and it is expensive to parse back. Very often you want to speculate, and say, "Ok, maybe here, I've got eight digits, and I want to be able to check it really quickly." In this case, I'm able to use this format, it's really cheap, only a few instructions, and I know right away whether I've eight digits. If I do, then I've got a function to sell to you.

This one I did not invent, I picked that up on the web somewhere, Stack Overflow probably. There's credit for it in the source code. What you want to do is you take these eight digits, and you turn them into an integer. You don't do it character by character, you actually do three multiplications and a few arithmetic functions, and that's probably that's the total sum of it. I'm not going to go into everything else we do for parsing, but this gives you an idea of the strategy.

Runtime Dispatch

When you write Java code or use Git, so forth, you often don't have a problem where you don't know about the hardware you're running on. We are doing optimizations that are really low-level, that are specific to the hardware. We do some optimizations for processors that have 256-bit registers. Then, to support legacy hardware, we need to also have 128-bit registers. We basically need two functions for it as well. Then we support also ARM but that's a little bit easier because [inaudible 00:42:27] is a little bit more stable. We have this support on Intel and AMD processors, we have to support two code paths. The way we do it is fairly standard. We basically build and compile two functions and then we do what they call runtime dispatch. The first time the parsing function is called it's actually calling a special function that should be called only once. There's concurrency involved, but I removed it from the code for simplicity. You check the feature of the CPU, and then you branch depending on the function you want to use.

Then you basically reassign the function pointer. The next time there's not going to be any CPU checking, because you're assuming that the person running your program is not switching the CPU under you, so you're going to call the right function right away. I've known about runtime dispatch for a long time, and everyone says it's super easy but then when you ask around you find out that few people have actually tried to implement it. When you try to implement it using portable code as much as possible, so it runs under Visual Studio, Clang, and GCC, we found it really hard to do. In part, it's because there are bugs in some of these compilers, and there were no good model on how to do it.

One of our objectives was to have a single header library. We don't know how people are going to build our code, so we don't want to depend on the build system, we really want the code to do all of the work. For this reason, this was a little bit hard. Obviously, if you're working in Java side then you don't have to worry about that because someone else is worrying about it for you.

The simdjson is a free library, it's available on GitHub. It's a single header library, so it's really easy to integrate, you can just plug it in your system. It's relatively modern C++. One of my students says that it's advanced C++. We support different hardware, so we support ARM. There's a co-author that actually wrote a version for Swift that acts as a wrapper for our stuff, and it beats Apple's parser. Then we support relatively old Intel and AMD processors.

It's under an Apache license, there's no patent because I'm poor and stupid. It's used by reasonable people at Microsoft and Yandex. We have wrappers in Python, PHP, C#, Rust, JavaScript, Ruby, there are ports also very exciting. There are ports to Rust, so there's a version that's written entirely in Rust, but apparently the keyword unsafe was used. There's an ongoing port to Go and there's a C# port also. No, Java port, sadly, that's missing.

We have an academic reference, it was published in the "VLDB Journal." I'm going to end with some credits. Not all of them, but a lot of the clever, magical algorithms with really crazy formulas were designed by Geoff Langdale, who's my primary co-author. Also, there's lots of contributors as this is GitHub, we're in 2019 so when you post something on the web, everyone comes to help you sometimes.