About this talk

In this talk, we’ll see how Go’s garbage collector works (the tricolor algorithm), why it works (achieving such short GC pauses), and most importantly, whether it works (benchmarking these GC pauses, and comparing them with other languages).

Transcript

- [Jim] Hello. Right. I should mention before I start, if you haven't noticed, this is actually a comment. So I'm quite proud of that. If you want to embed it any code, you can just copy and paste it. So I'm going to be talking...Well, I and Will are going to be talking about Go's real-time GC. So before I actually start I'll just tell you how we're structuring this talk and how this evening is going to pan out. So we've got about half an hour for this talk and we'll split it up into two sections, that's me and then Will. I will be talking a little bit about Garbage Collection in theory, things like what is Garbage Collection anyway, why do we want it, how do you do it, what makes a good or bad garbage collector, and in theory how does Go's garbage collector work. And then I'll hand it over to Will who will talk about this in practice. So we've done some work to actually benchmark this and compare it with other Garbage Collectors and see whether the theory actually pans out in the real world. And then we've got some pizza arriving. So why are we talking about Garbage Collection anyway? Since this is Realtime Guild, what does that have to do with it? Well, I will start by talking about a backstory at Pusher here. So we're a real-time company, we make real-time systems. And last year we made a system which looks roughly like this, it's very high level. It's a message bus. I want to use this laser pointer to point at the screen, but we realized it's reflective. So my pointer is up here somewhere. So I'm just going to use my hand. We've got customers here who are pushing messages onto channels here, and we've got some clients here which are consuming those. Fairly standard. The important thing to us is that things get from here to here pretty quickly, so real-time. Something like 100 milliseconds or less. So we wrote this system in Haskell using GHC. And fairly late in that development we actually got to performance profiling of this and we saw this. It is probably not immediately interpretable, but it's bad news. So this is a trace of one run of our program, where you've got time along here and these pink lines here are where we're sending messages in the system. So these blocks, these are all great, but you can see these blocks here where basically nothing useful is happening, we're not sending messages. And for our customers that's bad. You can see the time units along the top. Well, you probably can't see the time units along the top, they're quite small. But this is on the order of a few milliseconds in each of these pauses. And when we increased our heap size, so the number of messages that we were dealing with, these pauses got pretty big, like hundreds of milliseconds. And we didn't find any way to resolve that. So why was this happening? Well, the answer is garbage collection. So these orange blocks here are telling us that this is a garbage collector run. And to do that, the design in GHC is to pause the program, or process, and then do some garbage collection, and then carry on. So let's back up a little bit more. What actually is Garbage Collection? Why is it pausing our process? I just want to do this useful work, so let me get on with it. Why is it useful? So at a high level, all programs that we write really create a sort of graph of objects. So you can do things like create an object, use something like "new." Or in Go you do something that's like taking the address of a literal thing. Or you can change pointers, so you can move them around from one object to another. And in doing so you have this rather unfortunate consequences that things like this happen where objects stick around even though you can't access them. So our process is happily running here, but it's only actually aware of this stuff here, and this stuff here is actually just dead. So we want to remove it, and that's what the Garbage Collector is for. So how do we do Garbage Collection? Well, there's loads of ways to do it, but this is a simple one called mark and sweep. So if we explain first how this mark-and-sweep collector works. So here's the graph again. The idea is it steps through these steps here. So what we saw were these steps here, where occasionally we pause the process and then we unpause it again. And in the middle we do these steps predictably called mark and sweep. A little animation here. Oh yeah. So pausing the process, and then we walk across all of the accessible objects, marking them. And then we move on to sweeping, which just means removing the rest. Then we unpause and the program happily continues. So that sounds quite simple, nice. It is simple, but it has some downsides. Here are three of them. The main one that we're concerned about this evening is this, that it pauses the process, and potentially pauses it for a long time. There are other considerations here, like it traverses the whole heap. So when it's marking things, it will traverse everything that it can find. And then when it's sweeping, it will traverse everything to find the things that are marked, and then sweep them away. Another problem is that it leaves the heap fragmented. So you can think of disk fragmentation as essentially the same thing, it leaves all of the objects in place where they were initially allocated. What you might want it to do is sort of shuffle them into a nice, tidy block of memory. It doesn't do that. So because we've got multiple concerns like those and others, simplicity, whether it's parallelizable, and these ones here which I'll talk about, we've got lots of trade- offs. So what Garbage Collectors are good? Well, there's not really any kind of hard definition of what's good because there are all of these trade-offs. So if we plot here two important ones to us, we've got throughput and we've got latency. So what we saw with Haskell, or GHC to be specific, was that it actually gave us very good throughput. So in our case that means something like a number of messages sent within a minute, it was good at that, but it gave us these quite bad latency metrics. So hundreds-of- millisecond pauses. Sort of in another point in this space is Go, which we'll talk about this evening. So here on this green line is supposed to be sort of good Garbage Collectors, different points in that kind of space. So Go optimizes for latency, so our use case is an ideal user for Go. In terms of throughput it's actually a little bit higher overhead, but we've found it to be very high throughput in reality. There are other points in this space. So these are basically just bad Garbage Collectors. And these are ones that are sort of good, but somehow cheat. So Python, it's not really fair to call it garbage collected because it actually uses... Well, it uses a combination of reference counting and a Garbage Collector. So if you turn off the Garbage Collector and just use reference counting, you get great low latency, great throughput, but the price you pay for that is that it doesn't actually clean up the garbage potentially. Because if you create cycles of objects, then that's something that reference counting can't identify, so they just get left there forever. - [Man] Where are the JVM and the .net ones in that? - I will defer that question to Will since that is part of some of the comparison stuff that we did work on. Roughly the answer is all over the map. - [Will] They give you a lot of options, basically. - Yeah. - And the JVM, for example, has a number of different Garbage Collectors and each of them are very [inaudible], as well. - Yeah. Yeah. So the one thing that it definitely scores low on is this one. Okay, so let's talk about how Go actually achieves that point in that diagram, how does it get that good, low latency that we want. And it starts off with the mark-and-sweep algorithm, the classic, simple mark and sweep. And the strategy that it takes for eliminating those undesirable pauses is to run that concurrent with your process. So at the same time that your process is creating things, moving pointers around, you also have your garbage collector cleaning up after it, running essentially... concurrently, potentially in parallel. So is that actually possible though? Because at first sight that sounds kind of tricky to think about. So as a precursor to explaining that I am going to just explain breadth-first graph traversal. So this is a specific way to do mark and sweep. So the two steps in your mark and sweep collector are mark and then sweep. The mark phase essentially what you're doing is starting from the process, the things that the process can access, and then traversing outwards. And there are multiple ways that you can do that. One of those is breadth-first. So the way that breadth-first search works is to maintain what I like to think of as sort of a boundary. So this light gray set here. So it starts here and then expands this outwards. So it starts with things that are close to the process. So it will start with this one which is only one hop away, and this one which is only one hop away. And it will gradually expand that out. And we can sort of think of this dark or black set as the scanned objects. You can think of this gray boundary set as the things to be scanned. And the things outside, those are the things that are just not scanned yet. So we've got a little animation of this in action. So if I go back, we've got one thing that is scanned. To scan something we pick it from the gray set. So here we picked this one and we said, "Let's scan that one." Which means moving it into our black scanned set, and then looking at all pointers out from this object and moving those into our gray sort of boundary set. And we just do that repeatedly. So here's another one, then we do this one, and presumably we do this one. Yes? And then at this point you have no more things left in your gray set. So what do we do? Well, we just sweep the rest away. I should mention what this is. Zan told me this is the tricolor, what are they called, coordinated planes of some sort? It's the Italian flag. The reason is that I'm talking about an algorithm called the tricolor algorithm. Where, we go back, we have three colors here. We have this black set, we have this gray set, and we have this white set which is everything outside. So the tricolor algorithm is essentially taking that breadth-first search and making that concurrent with your process. So see how it does that? Let's just think about why breadth-first traversal actually works and why you're able to sweep away everything once the gray set becomes empty. And the reason is that we maintain this invariant, that there are no pointers from this black inner scanned set to anything in this white outside unscanned set. So any pointers crossing this boundary would violate that invariant. So this is an example of something that doesn't happen in breadth-first search because it's crossing that boundary. So the reason that if you don't have anything crossing that boundary, you know once this set is empty that this inner set contains everything that's accessible, so you can get rid of everything else. So if you run that sort of naively in parallel or concurrently with your process, you run into problems because your process is removing things around in that graph under your feet. And in particular it can do things that break that important invariant. So this is an example. So if we just let the process run free, it could do things like this, it could move... We had a pointer from here to here previously, but now it's moved it to this object here, which, sad face, crosses the boundary and breaks our invariant. It can also create new objects, so these ones. And that can potentially cross the boundary. Or if we put it here, then it will cross that boundary and it will break our invariant. So how do we deal with that? Well, the answer is fairly simple. Whenever it breaks, we fix it. There are really only two cases where it breaks because there are only two kinds of things that your process can do. It can create new objects and it can move pointers. So if we just look at that first case where it creates objects, the question is where do we put them in terms of these three sets. Well, we can just see by process of elimination where it should go. We can't put it in the white set because things like this happen. We get this pointer crossing the boundary. We can't put it in the black set, the scanned set, either because potentially your new object has pointers in it already which can point across your boundary to things in the white set. And so we're really only left with the option of putting it in the gray set. And it turns out that works. That always works. Because you're not touching anything in your white or black set, which are the important ones in terms of your invariant. So second case, you can move pointers around. And why is that bad? Well, here we've got an example where the process took this pointer here and instead moved it to this thing here, which breaks your invariant. Again, it's crossing the boundary. So what do we do? Similar to before, we color it gray. So we move it into our set of things to scan, the gray set. And by doing so you get rid of that pointer crossing the boundary. And that's really all that there is to it in terms of the theory. Oh, I should mention one thing. How do we actually know when the process is moving these pointers around? Well, it turns out there's a little bit of code that runs every time that a pointer is moved in Go. So it's not sort of C style "just put this thing in this block of memory." There's something which runs which checks, which every time that you assign a pointer to a field will color the pointee gray. So you pay a small overhead for that. So now that really is all that there is to it. And the overview is that you just take the classic mark-and-sweep algorithm and you make that concurrent by starting with the breadth-first search and fixing things up when your process screws around with that graph. And it all works, in theory. So does it work in practice? Well, I will hand it over to Will to answer that question. - Okay, so hopefully Jim has given you a good overview of how the algorithm works and, in particular, how it can run concurrently with a program and why this enables you to achieve the relatively short pause times that are no longer proportional to the size of the heap that you're traversing. So of course we had this problem with GHC, were looking for a language to change to, or at least had a runtime that allowed us to achieve these short pause times. Go seemed very appealing to us for the reasons that Jim has outlined, but we also wanted to benchmark it using a benchmark. So we had clear requirements. We knew roughly how much we wanted to store in memory, how many objects to store in memory, and also the kind of latencies that we needed. So given these, we could create a benchmark that would test whether Go would actually achieve these kind of GC pause times that we wanted. So yeah, like I mentioned, we knew that we wanted this large working set, so a number of live objects, low latency. And the benchmark we created for this, which is basically an idealized version of what we were doing in the real program, was allocate this large array in memory, and then write items to it in a loop, basically like a ring buffer. So you just overwrite the oldest item with the newest items. And the idea here is that you create this large heap that needs traversing and you're also constantly creating garbage, as well. So there's always work for the garbage collectors to do. So Jim originally wrote this benchmark for our Haskell implementation. And then when we were evaluating languages to switch to, we just ported it to Go, and then benchmarked it with that. So following on from that we blogged about what we had done and this guy, Gabriel Scherer, also ported the benchmark to OCaml and Racket. And then he created this repository, and since then we've at least 10 different languages that it's been ported to. And I think people really like this benchmark because it's very simple, it's universal, and you get one value out at the end which you can easily compare between languages. You always need to take this with a grain of salt because it really depends on... There's a lot of factors that it depends on and we're just testing one very specific thing here. But for us it was a good proxy for the program that we wanted to write. So I ran this benchmark on a number of languages. And there's more than this, but I think these are enough to give a good variety of the kind of results that we were getting. So as you can see, we have Haskell, which we originally had. And Go up here. And as you can see, Haskell... the worst case pause time, is an order of magnitude longer than Go. And this is actually worse than it looks here because, as we know, this has stopped the world and the pause time is going to be proportional to the heap size. Whereas with Go, because of the concurrent collector, it shouldn't be proportional to the heap size, it should just be able to do as much work as it needs to before yielding to the program. I put Java in there, as well. I was kind of initially surprised at how poorly this performed in terms of worst case pause time. It turns out it's a bit more complicated than this, as I sort of mentioned previously. So there's a number of Garbage Collectors that the JVM can use. This is the G1 collector. But there's a number of other ones and they all are highly tunable. So you can do that manually, but by default the runtime system will try and figure out how you're using the heap and then try and tune that automatically. And that's kind of why you get these poor pauses at the start. But I think if we spent more time looking at this, we could really get this down by looking at how the pause times converge. They should converge on much shorter pause times. But it's something I want to look more into, I just haven't really had the time for it. There's also...so Jim mentioned, Python, which is reference counted. There's also languages like Objective-C, which are also reference counted. And I didn't run the benchmarks myself, but there's a number of people who have and you end up with incredibly short pause times for this particular benchmark. So you get down to like each iteration of the benchmark, so every creation and writing of a message into this array will take a few microseconds. So much faster. And it really shows how reference counting for this particular use case is really effective. One thing that is probably interesting to you guys is just how fast OCaml is. And that is actually a regular tracing Garbage Collector, like all the other languages here. And I saw this and I was like, "Wow, that's really fast. I'm surprised it's so much better than Go." So I decided... All right, that's not working. Strange. Oh, that's nice. When you're offline you can't present it. That's really bad. Anyway, just look at it like this, I guess. So yeah, so why are the OCaml results so good? Like the pause times were a quarter of the, the worst case pause times were a quarter that of Go's. And it actually turns out it uses pretty much exactly the same algorithm as Jim described with Go, but is also generational. So I'm not sure how familiar people are with generational Garbage Collectors, but basically the idea is that you divide the heap up into normally two heaps and they're called generations. And the reason for doing this is it's based on this idea of the generational hypothesis, which says that in pretty much all programs most objects die young. So you have a large number of objects which become garbage very rapidly, and then a smaller set of objects which last a very long time. And the idea of splitting them into two is there's no point, if you're traversing the entire heap for each collection, there's no point traversing these long-lived objects if they are very unlikely to become garbage for any given collection. So the idea is you move them into this older generation, and then collect that on a slower frequency. And it also means that you can employ different garbage collection strategies in each generation. So in this case you use tricolor mark and sweep in the old generation, which can end up becoming quite large. So in this benchmark, for example, because all of the objects live a long time, they all end up in the old generation. But OCaml, for example, in the young generation will use a copying collector. And the reason for this is it's a functional language, so this generational hypothesis is particularly true. Because you don't really mutate data, you always create new data and then update references. So you're creating a huge amount of garbage that lives a very short amount of time. And copying collectors are nice because the running time is only proportional to the data that survives a collection, which is not true of the mark-and-sweep collectors. Because you also have to find out which objects you need to free. And so the difference, sort of slight, subtle difference of Go. So the fundamental idea is the same in that you don't need to traverse the entire heap in one go while you pause the program, you can interleave a collection while a program is running. But with Go I almost think of the collector as... So it's concurrent, it almost runs as just a separate thread in the program. Whereas in OCaml basically when a thread needs more memory and there's not enough, it would just run some Garbage Collection, and then allocate, and then carry on running. So you can imagine it happening in the same thread, essentially. So based on this and based on the similarity between the two, I was very curious as to why the Go results weren't as good as the OCaml results. And this is particularly true of some of the material that the Go team mentions, they sort of say that you shouldn't really be seeing pauses longer than 100 microseconds, which is much slower than the results that we were seeing in this particular benchmark. So I decided to dig down a little bit more into this and there's this really great tool in Go which is basically a runtime visualizer of a run of a program. And I've just, somewhere here, here we go. That looks legible from where you guys are, hopefully. So this is actually a run of the benchmark and it visualizes the runtime events that happened over that run. And it's really nice, you can sort of browse it very easily. And the kind of key points here are this red... So I should mention first time is from left to right here. At the top you've got the number of objects in the heap, number of Go routines. And you can see here the heap is suddenly diminishing in size, number of objects, and that's when Garbage Collection is running. And down here, so this red section is when the program, the benchmark itself, is running. And what I was interested in is why we were seeing these longer than expected pauses. So I was looking at times when the garbage collector is running. So, for example, if we zoom in around here. And it's really interesting, you can clearly see the two cases, two phases of the collector that Jim was mentioning. So around here you have the mark phase. And you can actually clearly see it's running in parallel on all four cores, which is nice. So each of these sort of horizontal lines is one of the cores. And then after that you see the sweep phase. So interestingly it happens in these very small sort of sections of time, I don't know. And you can see them down here. So this happens in parallel on two threads, one of them is interleaved with the program and the other one is happening fully in parallel. Along here it seems fine, there's no pauses because the program is happily running while this is going on, the sweep phase. But if you look at the mark phase, if I zoom out a lot, you can see that while this section is happening the program is essentially blocked from doing any work. And if you look at the time at the top it's a good sort of 20 milliseconds almost while the program is blocked from doing any work. And it's probably a bit shorter than this in practice because of all the instrumentation that has to happen to enable these events to be omitted. But you can see a problem here. And I wrote this up in more detail in the blog post, check it out if you're interested. So this got some interest from the Go team when it was released because they are interested in these edge cases that can lead to these long pauses. And so there was a couple of issues that came out of this, you can find them there. One of them is now fixed, but another one here is ongoing and it should be fixed in Go 1.9, I think. But I'm not going to go too much into the details of them, I only very vaguely understand them. And you really have to understand really properly the exact implementation of the Garbage Collector because they're kind of edge cases, really. But the take away for me is that it just demonstrates how tricky it is to get these consistently short pause times for every single type of heap usage characteristics that you can imagine. For example, in the second issue there one of the guys in the issue mentioned how this is only really an issue for these microbenchmarks, like what we're doing here, where we're basically just running at 100% CPU constantly writing new messages into this array. That's not really the kind of heap usage you would expect in a production system where you're not going to be running at a 100% and there will be idle time when the Garbage Collector can run. So it's almost like you can see it as an edge case in some ways, but it just does demonstrate that it's very hard to cover all cases with Garbage Collectors and it's very hard to completely rule out these significant pause times. So basically in conclusion what I've really taken away from this is it's impossible to design the perfect GC that is good in all cases and they're always making trade-offs. And the main thing you can do is just work out what your requirements are in terms of latency and also throughput and so on. So decide on what kind of Garbage Collector satisfy those requirements, but also benchmark it and test that it really does. Because I think, at least what I've found, is it doesn't always satisfy the marketing that you read. So yeah, that's pretty much it. I think there's one more slide here. Yeah, Jim created it. But yeah, so we're really keen on getting people in for talks. I think what we like about this topic is that it's very general, so we can just cover anything. Like Sam mentioned, avionics, keyboard latency, anything really that has some kind of latency requirement. Because I think it's a really interesting way of looking at performance in a huge number of areas. And with that, I think, we're done.