Last year I stumbled upon FlatBuffer, a fast serialisation library developed at Google. I liked it so much that I decided to bring FlatBuffers to Swift. If you are curious about the details, here is a talk I delivered on this manner couple of months ago.

I implemented a generic Builder and a Reader class which can write values in to a byte array and read values from byte array according to FlatBuffers specification. I also wrote a XText based code generator, to generate classes according to FlatBuffers Schema. I compared the reading and building performance to JSON serialisation and deserialisation and got some decent results. I even made a small showcase video, where I query a dataset of over 3 million cities in real time on an iPhone.

I put FlatBuffersSwift project on hold for a couple of months, because I had to wrap up some projects. A couple of weeks ago I was approached by Joakim Hassila, who become currious about the project and asked if I am still working on it. It was a good push for me to revive my efforts and we started creating issues and playing around with FlatBuffersSwift.

Even though I already had a small performance test, it was implemented merely to compare FlatBuffers against JSON. Googles FlatBuffer implementation has a benchmark test which checks how fast FlatBuffers encoding and decoding is compared to serialising structs.

Joakim ported this benchmark test to Swift and started profiling. He found a few memory issues and slow paths that could be easily resolved.

Before I continue I would like to explain what the test does. It encodes and decodes a small object graph (a bit more than 300 bytes) one million times. Here is a similar data set represented as JSON:

Now this kind of benchmark might feel artificial for App developers, but for a BackEnd developer it is actually a decent use case. Being able to decode thousands of small payloads is something that a server should be able to do very efficiently. And as the dream of Swift on the server is in the air, this is actually quite interesting topic.

After cleaning up some obvious problems we end up with around 6 seconds for decoding and 3.6s for encoding. Those are actually decent numbers if you take in to consideration that we do something one million times. During profiling, what struck us first was that we spend lots of time in retain release calls and deallocating objects. Profiling becomes much harder in this cases, because it is not directly your code. We started pooling instances and being smart about memory allocation. That gave us 20 to 30% boost. However it introduced some complications in the implementation.

One thing that I remembered from my previous benchmarks was that converting byte array to String is a very slow task in Swift. There are plenty of way to do it, but they are all slow.

Abandoning byte array to String conversion and keeping the strings as UnsafeBufferPointer<UInt8> gave us a huge boost. The decoding went from 6 seconds to 0.9 seconds. This was a big win. And by becoming even smarter about object allocation we were able to get even close to 0.5 seconds.

At this point I started to be curious about, how does it compare to the C++ implementation, I took a look at the benchmark site from Google:

And here is were my moral was crashed. Google claims that they need only 0.08 seconds for the same benchmark test. To make matters worse, I checked the C port of FlatBuffers and it’s benchmark claims to run in 0.029 seconds

Honestly I could not believe it. I checked out flatcc repo and ran the test myself: 0.025 seconds.

This was crushing. All this talks about Swift being as fast or even faster than C where a myth. I started to wildly comment out code in the benchmark test. And stumbled upon an interesting realisation. Casting a Bool to Int like following:

let i =Int(b)

is 30 ms slower than having an expression like this:

let i = b ? 1 : 0

This was interesting, but it still didn’t explain why Swift was 20 times slower than the C implementation.

Than I decided to take a step back. We saw that allocating memory and retain release calls where dominating when profiling. So why not just do the same bare bone thing that C does. Write a bunch of functions which read data from a byte array without allocating any objects.

The gist shows this implementation. The API is not nice but it is really bare bone. After I hooked it up in the benchmark, I was stunned.

One million times decoding of a small object graph took 0.35ms. This is 0.00035 seconds.

I thought I was doing something wrong. And started debugging the run. But everything seems to be just fine. The code went through the object graph added the values together and returned the same number as I got from the other “slow” runs.

I wrote Joakim that I need a second pair of eyes on it :)

And he reassured me that everything is legit. And that we can even use structs and not just bare bone functions.

This gave us the nice API that we had before, but with the speed of calling bare bone functions.

To be honest it is hard for me to say, how exactly it happened that we got 70 times faster than C, but it is a measurable fact.

Now I can also pin point all this small weird slow parts:

String conversion. If I use byte array to string conversion I move from 0.35ms to 1774.73ms. And if I do what the test needs to do (get the length of the string “s.utf8.count”), I am at 2737.4ms. Which is an additional second spend doing factually nothing. IMHO this is very worrying for Swift on the Server as WebServers are nothing but string jugglers. In order to be able to do this juggling efficiently, we need a lighter string. Int(Boolean) is very expensive. It moves the needle from 0.35ms to 79.92ms. Again this is very mysterious. I can’t imagine what happens under the hood.

If you are curious, give it a spin your self:

Maybe you will find out a way to make it even faster :)

Currently I am changing the code generator to leverage the learnings we got from this weird journey into performance optimisation.