Transcript

Stoodley: My name is Mark Stoodly, I am the project lead for an open source Java Virtual Machine project called Eclipse OpenJ9 and I work at IBM in Canada. I'm really happy to be here, not just because we've had lots of great talks and lots of great speakers to listen to at this conference, but because I left behind at home six inches of snow on my driveway and a traffic nightmare. My wife who's now in charge of shuttling my daughter around to school is not quite so happy about the situation. That's an example of two people applying a very different trade-off to the exact same situation and coming to a different conclusion about what's good and what's not so good.

In this talk, I'm actually going to talk about that same subject. I'm going to try to teach you about the trade-offs that are involved in using different types of compilers for your Java applications and try to explain why the trade-offs are what they are and what you should expect when you try to do them when you use different ones. The subtitle is "From AOT and beyond," so at the end of this talk, I'm going to be transitioning into an explanation of how we're at OpenJ9 starting to put different types of technologies together in order to address all kinds of different needs and do some pretty cool stuff.

I wanted to start off by just saying the Java ecosystem is this amazing place to work on compilers. There has been a tremendous investment over the years or more than 20 years in the Java ecosystem, looking at JIT compilers, looking at AOT compilers, looking at JIT compilers that can cache their code and load it later. I've listed a whole bunch of different projects here, not all of these are still with us but these are a lot of the major projects that have been looking at compilation over the years. These are only really the things that have been happening in industry and sort of very popular open source projects. There's a whole ton of work that's happened in academia, hundreds of graduate students have gotten degrees looking at compilation in the Java ecosystem. It's this amazing place but it's created a lot of stuff to think about.

If we fast forward to today, that long list of things is crystallized into four projects that are live with at the moment. Everyone's probably heard of this thing called Hotspot, it's pretty popular in the Java ecosystem. It has two JIT compilers and it's C1 and C2 and they're basically the default, pretty much everybody is using these things. I work on a project called Eclipse OpenJ9, which is a Java Virtual Machine that was originally built by and contributed by IBM to the Eclipse Foundation and it's now available for everyone to use. It has a JIT compiler in it that's very adaptive and has optimization levels that use temperature as a metaphor, so we have cold, warm, hot, very hot, scorching compiles, which is cool. Historically, it's been invested in AOT compilation from an embedded and real-time system space, so we come at the problem of compiling for Java from a very different direction. Azul has a Falcon JIT, which is based on the LLVM project you've probably heard of.

This provides an alternative high opt compiler to C2 and it also has the ability to stash compiles on disk and reload in subsequent runs, which I should have mentioned that Openj9 does too, when I was talking about my own project. Finally, there's this project, Oracle Graal compiler, which has the distinction on this list of being the only one that's actually written in Java and compiles for Java, which is neat. Since Java 9, it's been available as an experimental AOT compiler in this tool called jaotc, I'll talk a little bit more about that in the next few slides. Since Java 10, it's also been available as an experimental alternative to the C2 JIT compiler so you can use it as a JIT.

There is also this creating native images option that they have available using their SubstrateVM project, which takes a closed world assumption that basically says, "I'm not going to load anything other than this set of classes," and try to compile a small and native image as you possibly can. That's a very interesting project, I'm not going to be focusing on that primarily in this talk. That's almost as much as I'll say about it in this talk, there was another talk yesterday where you can learn a lot more about Graal and how it works.

The outline of my talk is I'm going to start off by just comparing some of these JIT and AOT and caching JIT, say what I mean by these things, and talk about some of the trade-offs in using them. Then I'm going to talk a little bit about how we take JITs to the cloud and we'll wrap up.

JIT

As I've talked about before, JIT stand for Just In Time, if you haven't heard the term before. Basically, JIT compiler is active at the same time that your program is running and so it can adapt to all the things that are happening in the program as it runs. It can collect profile data, it can watch classes getting loaded, unloaded, etc. It can adapt even to the platforms that you're running on, so you ship around a set of class files, you can run those on x86, you can run those on ARM, you can run those on any platform and it doesn't matter because the JIT compiler, which is the thing that's going to convert it to native code, runs at the same time the program runs so you can defer the decision about how you compile code with a JIT. After more than two decades of sustained effort, JITs are really the leader in JAVA application performance. We've proven this time and time again.

Despite multiple significant parallel efforts aimed at AOT performance, the Java, because of its dynamic language nature, just lends itself very well to JIT compilation. Why is that? There are a couple reasons, which if you squint at them, they're actually really the same reason but the first one is that JITs aggressively speculate on class hierarchy. In the Java ecosystem or in the Java language, calls are virtual by specification, that means they can be overridden by other classes. When you're making a call to one of these functions, you don't really know what the target is, it could be anything, it could be a class that hasn't even been loaded yet.

Many of these calls, as we know because we've studied many Java applications over many years, really only do have a single target at runtime. The JIT can see that because it's watching the program run, it knows which calls have only a single target and it can optimize for that. We can speculate that this one target that it seems it's only going to be the target and so, it can aggressively optimize that, it can inline the target of the call into the code around it and then it can optimize that code and make it much faster. Inlining is one of the great enablers in compilers.

It allows you to see more scope of more operations, combined things with the context out of call with the code that's inside the call and so on. That expands greatly, by being able to inline, you can expand the scope of optimization, you can generate really great code. Because of this dynamic nature of Java, if you compile too early, you can actually fool the compiler into speculating on something that doesn't only have one target. You'll end up generating code that's right for a while but then, in the real world, you have multiple targets and it's not right and so the JIT has to generate backup paths, so it can deal with that situation when it happens. That ends up having not great code, not great performance until you've re-compiled it.

JIT also use profile data as the program is running, so it might not surprise you to learn that not all code paths in your application execute equally frequently. Some code paths execute a lot more frequently than others. The profile data that the compiler can collect while the program is running tells it which paths to focus on. The simplest example here is that you don't have to compile every method in the program. If a method never gets called and only rarely executes in a particular run of your application, there's no reason to compile it and spend the time compiling it. If it's only going to run a few times, you can run that in the interpreter and you can afford to do that because it's not going to run for very much, it's not going to consume very much of your profile. Something that's running and being called all the time, you want to compile that because you get a big benefit from compiling it to native code.

Not all calls have a single possible target, there are megamorphic calls, they do exist, but profile data can help you prioritize which one of those targets you might inline and then optimize the code around with it. Profile data can help you do inlining, even for those calls that aren't monomorphic. The third point is, it's a very efficient substitute in the JIT profile data, You can actually identify a constant via profiling, not by having to do extreme analyses of lots of code. You can imagine if somebody creates a constant and then passes it through a long call chain, for a compiler to realize that that's a constant, it has to see that entire chain of calls and it has to look at all of that code and make sure that the constant is really propagating all the way down to the uses. If I'm profiling the use and I see that only one value ever gets there, I can be pretty sure that it's likely to be a constant. I might not be 100% sure, but I can still generate better code, assuming that it is a constant.

JIT compilers work really very well if the profile data is high quality. We've spent two decades making sure that that's true and that's why JITs get such great performance. This advantage doesn't come for free. Obviously, collecting profile data is an overhead for one, you have to spend time to do it. That cost is usually paid while the codes being interpreted, which means that's going to slow down your start-up and ramp-up which is something that people don't always like. If you need high quality data, that means that you have to profile for a while before you actually pulled the trigger. To decide to compile something, you have to wait until you've collected enough profile data to do a really good job compiling in, which also slows down your ramp-up and start-up.

A second part of this performance advantage that's not free is that there are resources being consumed by the JIT compiler itself, CPU cycles and memory. It takes milliseconds to seconds to do a JIT compilation for one method and it's going to consume potentially hundreds of megabytes of memory transiently, you're not going to see it for very long because JIT compiles tend to be short but it's there, it's measurable. That cost is paid while you're compiling and mostly you're compiling during start-up and during ramp-up, so all of these overheads are coming to roost in exactly the place where people will have problems with start-up and ramp-up, which is where you really want to get compiled code performance faster, or they're interfering with the ability to get compiled code performance faster.

There's also some persistent resource allocation. You have to store profile data somewhere, you have to store metadata about classes somewhere but for the most part, it's the transient stuff that really gets in the way.

If I were to collect that together in a a scorecard for the JIT and I'll assign a score, steady state code performance is really great for JIT, so that's green. It can adapt to changes at runtime, that's great. It's really easy to use. Does anybody know the command line option to disable the JIT? No hands. That's easy to use. Nobody even cares how to get rid of it. You have platform neutral deployment because you're compiling when the program is running, not when you're actually building.

However, as we've noticed, there are some issues with start-up performance and with ramp-up performance. I haven't defined those yet. Start-up performance, I call the time until an application is ready to handle the load. You spend a lot of time initializing data structures, getting stuff ready, and then eventually, you're ready to start accepting load or start solving the problem that you're trying to solve. Ramp-up is the time after that until you hit your steady state performance, so you may not be able to immediately do things as fast as you'd like to but eventually you'll get there and that's when you'll hit steady state.

We also mentioned that this runtime impacts the CPU and memory, so we'll give a red score for those things. I don't know how many people have come to talk to me about these things, but when they point to it or they realize these things and they say, "Can AOT help with these red things here? We really don't like these red things. We want to start-up fast, we want to ramp-up fast, and we don't want to pay all this extra stuff." Let's talk about Ahead of Time compilers.

AOT

The basic idea here, which you probably already know, is you introduce an extra step here at your build time to generate native code before you deploy the application to wherever it's going to be run. In the OpenJDK ecosystem, there's a tool called jaotc, which is used to convert a set of class files to a platform-specific shared object. It's very akin to the approach that's taken with less dynamic languages like C or C++ or Rust or Go or Swift, etc. It's currently still in the experimental, it has the experimental tag associated with it since JDK 9 and right now, it's x86 and 64 platforms only as far as I know. There are a couple of deployment options, so these are things you have to decide at build time before you deploy your app and that's whether or not you want a JIT to be able to top up your performance on top of what the AOT code is able to give you.

You can have no JIT at runtime, which means you get statically compiled code and nothing else, anything else runs in the interpreter slowly. Or, you can run with a JIT at runtime, in which case there's mechanisms in place built into the code so that it can trigger JIT compilations using C1 or C2 in order to upgrade the performance and get closer to what the JIT is capable of doing. You get faster performance, plus you get the ability to get higher runtime performance.

AOT has some runtime advantages over a JIT compiler. You get that compiled code performance immediately, there's no, watch the program, wait to figure out what methods are running a lot, put them in a queue, wait till they get to the head of the queue, run a compiler to compile it, generate native code, inject it into the running system, and wait for the next time for it to get invoked. It skip all of that stuff. You loaded in your process and you've got native code, it runs, everything's great. Start-up performance here can typically be 20% to 50% better, especially if you're combining it with technologies like AppCDS. It's going to reduce the CPU and memory impact of the JIT compiler, particularly if you use that first deployment option where there is no JIT.

There is quite a few big BUTs actually, you're no longer platform neutral. You have to decide which platform you're going to target when you're building it because you need different AOT codes for different platforms. Particularly, you need to package it differently. The way you package a shared object or a code on Linux, Mac, and Windows are different, even if you're just talking about x86. If you brought other platforms into the mix, it would be an even stranger mix of things that you'd have to decide upfront and ahead of time. You have to pick which processor you're going to generate code for too. You can't necessarily say, "Target the latest and greatest Skylake processor," if your code might run on something that's not a Skylake. At worst, you might have some performance impact to that but if you choose to use instructions that are only available on that CPU, then if you try to run it somewhere else, you're going to abort and cause all kinds of unhappiness.

There are a few other usability kinds of issues too, some of which are getting better as the ecosystem move forward but, basically, there's deployment options that you have to decide on build time and you're locked into them because those options change how you generate code. Different GC policies require different kinds of rate barriers and sometimes read barriers, and so you can't just arbitrarily say, "Ok, I want to use G1 now," or, "I just want to use Z1."

As I mentioned, if you want to be able to reach it or not, on different platforms, you're generating even different sets of classes and methods, so one of the other things that you have to do when you're generating AOT code is you have to tell it which classes and which methods you want it to compile. On different platforms, that might be a different list because there are classes that only get loaded on Mac when you're running on Mac and some that only get loaded on Linux when you're running on Linux. We've seen that repeatedly, but anyway, you get the point.

Now, those lists are things that you have to curate and maintain. It's all well and good to do one study and try AOT and it works great, fantastic. But my application is a changing thing and if I have a lot of applications, then I have a lot of things that continue to change. Those lists of classes and methods are things that have to be curated, you have to maintain them, you have to keep track of them as your applications evolving. New code paths are being created, new code paths are being activated, you have to remember to add those to the list and keep them there and if there are things that are no longer used, you probably want to take those out of the list because they're not worthwhile being there anymore.

Then, of course, there's always a question – what about classes that aren't there until the runs starts? You can't AOT compile it if you don't have it. There are some things that are troublesome from a usability standpoint with the AOT code. Let's look back at those two reasons that we had for JIT compilers to deliver excellent performance, speculating on class hierarchy and profile data. No, I don't have any of those. I can't do those because I don't know what classes have been loaded because I haven't run it yet. AOT compilers in their very pure form are not being combined with JITs; they have to really reason about things that are happening at runtime because they're not at runtime.

Let's take a bit of a sidebar into that discussion and let's look at the lifetime of some generic, hopefully it matches at least somebody's Java application in the room. Things start off, you're on Java and you have that big bang, the Java process has been created. Poof, it didn't exist before and now it exists, you have a process. A little while after that, the JVM gets loaded and initialized and you're about to load the first class and about to run main. By the time you get to the point where you can actually run main, about 750 classes have been loaded and there's a handful of class loader objects that are live and they're responsible for doing the remaining class loading that's going on.

In this diagram, I'm starting to show a bit of an envelope, there's these two blue lines here, which hopefully you can see my pointer down here. The more apart they are, the more classes there are and the more complex those relationships between those classes are. During this application, class loading, and initialization phase, up to hundreds of active class loaders can be loaded and tens of thousands of classes can be loaded. If you're running a big JakartaEE app, say, you can have lots of stuff flying around all the same time, so you can get very large numbers of classes and very complex class hierarchy that you have to be looking at.

Finally, your application gets to the point where it's ready to do work. This is the end of the phase I called start-up. You're now Starting to exercise the actual code paths that are going to be commonly used at runtime and you'll probably end up loading a lot more classes now, which if you did too much compilation during start-up, you're going to start invalidating some of the assumptions that you were making at that point. Eventually, your code paths stabilize and your profile stabilizes and everything is fantastic. The ramp-up ends and you're in the world of normality. Real applications will go through phases and they'll go idle and they'll do all sorts of other complicated, nasty things. I'm not trying to show that with this diagram.

Getting back to the topic of compilers, the JIT compiler is basically inside this process the whole time and at any point in time, it knows exactly which classes have been loaded, where they've been loaded, how they relate to all the other classes in the system regardless of which class loader loaded them, etc. It can see all of that complexity that's right in front of it but the AOT compiler has to view it all through the big bang of the Java process being created, it's outside of this whole process. What that means is that AOT really has to predict all of that complexity that I just described, those hundreds of class loaders potentially and tens of thousands of classes, how are those things all going to relate? It has to predict all of that and that's really hard.

Let's go through an example. Imagine two very simple classes here, B and C, where C.foo calls B.bar. This looks like a very simple opportunity to inline the call to B.bar. Bar just returns five, so I can take five and just replace the call B.bar with five and optimize it with the code around it. Cool, that would be great. How did that actually happen? How did this connection between C and B get formed? Classes C and B got loaded by a class loader, let's call it CL1 and when foo is running, it needs to figure out which B it is that I'm really talking about.

If you look at the class file format for C, B is a string, not a class, it's not an identity, it's not whether B are superclasses, it's not anything, it's a string. It's the class loader's responsibility to take that string and say, "That's this class over here, Class B, that's this guy." If you're compiling C.foo and you've resolved the constant pool entry for B, it will point to this class B and then you'll be able to hook up bar and find out that it returns five and do this magical optimization that we all really hope happens.

These class loader objects, they don't really exist anywhere but in the Java heap, they're just objects. There is no concept of a class loader really outside of the JVM process. In particular, you can have other class loader objects, which can equally load class C and look up some different class B from that string that's sitting in B's and C's constant pool and that might look up some other B that returns minus five. In that case, we probably better not inline five, we better do the right thing, which is just called B.bar. In fact, you might not even know what B is until you run C.foo. That line of code there, B b = get_a_b, that might be the first time a B object got allocate in the whole program. How is AOT supposed to figure all of this stuff out without actually constructing any of this? It's hard.

AOT probably has to hedge in this case, because maybe only class loader one exists or maybe only class loader two exist or maybe both of them. It doesn't know what the scenario is and so it probably has to hedge. Now, you might be saying, that seems like a pretty contrived example. I don't know, but it's actually modeled on how OSGi modules work. Enabling two different versions of the same library to be loaded at the same time, which, we have a great name, jar file hell, for that. We don't like it and nobody likes it but it is a reality and there's nothing in the Java specification that says it can't happen. You have to ask yourself, "What prevents this scenario if classes can be loaded dynamically or even created on the fly?" You just don't know.

Because AOT has to completely understand what's going on, it means it's probably going to have to hedge at these kinds of inlining opportunities. It makes it really hard for AOT to inline and inlining, like I said, is a great enabler for performance. The JIT is acting at runtime, it can look at exactly what's happening. If it's only class loader one, great, I'll inline five. If it's only class loader two, great, I'll inline minus five. You got both of them? Fantastic, I've got two C's and in each of those C.foo, I'll compile them independently and put five in one and minus five in the other. The JIT really has the advantage here.

These hedges that the AOT compiler has to do really increase the potential gap between the AOT and JIT performance levels and here we're talking about steady state performance levels. Now, you might be thinking, "Ok, profile directed feedback," sometimes it's called profile guided optimization, PDO, "maybe that'll help." Yes, maybe, but AOT code has to run all possible user executions. The JIT gets the advantage of knowing the profile data in this run and if you do a different run where it does something completely different, it will still get that profile data for that run and it will be able to optimize for it.

With AOT, you have to handle everything. You've only built it once, you don't have the option to rebuild it for each runtime or runtime instance. What that means is that it's really important to use representative input data that crosses all of the possible user data that you might have when you're dealing with AOT. The risk is that it can be very misleading if you use only a few input sets, which AOT compiler is going to specialize just like the JIT would if it only had that profile data. It's going to specialized for the input sets that you give it and if you give it something else, then you're going to have lower performance. That profile directed feedback approach can really lead the compiler astray if it's not representative enough.

We talked about monotonic calls where the call only has one target and that we've done lots of studies saying that calls are generally monotonic. Those studies were done 20 years ago, a lot of them, on a benchmark called Spec98, which was really just a bunch of C programs that got converted to Java. It's been borne out by lots of applications when you look at it inside the run, but when you look at it across all the possible input data sets, I'm not sure I'm as confident that it's always going to be monotonic across all user instances. That also means we have to be very careful with benchmark results when we're looking at AOT because benchmarks might not use a lot of different inputs and data sets. They might not properly reflect what you would really get using AOT.

Cross training here is really critically important, it's very important in machine learning, it's very important for AOT compiling. You want to train, you want to use PDF with one data set and try to measure it on other datasets to see just how good it's doing. These input datasets are just like those lists of classes and methods that I talked about before, they need to be curated, you need to maintain them as your application evolves and as your users evolve because the input data might change. All of that is really on the application provider. It's on the person who's going to do the AOT compile and then distributed to all of the different people who might use that AOT code.

I'll make one observation that PDF is not really been a huge success for static languages. There are cases where it's been used and to great advantage, but as a general rule, not everybody is using PDF for statically compiled languages for a lot of these same reasons. You might think I'm pretty down on AOT here and as a pure technology, I kind of am, so, I'll be clear about that.

If I put them back up on my scorecard, it's true that I turned all my red boxes green by using AOT but at the same time, I changed all my green boxes red, so that really didn't help me very much. I would call that one step forward and one step back. If I were to combine AOT and JIT, like use AOT to generate an initial set of code and then recompile things with the JIT in order to get better performance, I can start to do a better job here. I have a JIT running so I can get good steady state performance eventually. I have a JIT running so I can adapt to runtime changes and what's going on, so I get a green box there. AOT has all of the issues with curating lists of methods and classes and profile data that I mentioned, so I took off ease of use for that one. It's not platform neutral because I have to generate these AOT code and decide which class of platforms I'm going to target. I get good start-up and then I get pretty good ramp-up because ramp-up, you really need the JIT to take you all the way, but I have a JIT at runtime now, so I still have CPU and memory problems.

Is that as good as it gets? Well, not quite, I still have other tricks in my bag. What I'm describing right now is pretty much the state of OpenJ9 before it was open sourced in about 2007. The solution that we put out to accelerate, we use AOT code, basically, we generate it by the JIT, we stored in a cache and we get basically the scorecard that I showed before.

Caching JIT Compiles

Caching JITs have gone further, we've gotten better at doing this, so the basic idea here with a caching JIT is you have a JIT compiler running, why not just take the code that the JIT generated and store it someplace? Then in another run, let's take that code, I don't have to compile it again. How many times do I have to compile string.equals, really?

This is like a JIT and AOT mix here. Is it really different than AOT? No and yes. From the second JVM, looks like AOT. On the loading code, that's something generated, I'm not compiling it, I'm not doing anything at runtime to use it, so I'm just going to load it and use it. That looks like AOT to me. The first JVM, that poor sucker, has to go through that whole process of JIT compiling everything and generating the code and storing it in the cache.

On top of that, I have to generate a bunch of metadata to make sure that I don't screw things up in that second run if some different class gets loaded or some different path becomes hot and used, or if classes don't get loaded with exactly the same relationship that I relied on in the JIT and optimize the code on in that first run in that code that I stored away. I have to be able to cache that because that would be bad if I took a code that was optimized for a scenario that's not true in the current JVM and used it. Why would you guys be unhappy with me? We have to be very careful and that has a bit of an impact on making sure that all of this works very well.

We do get to return to platform neutrality because even that first run where I'm generating that code, it's the JIT compiler, it's running in the process, it sees at least the profile of the user running that code. I don't have to generalize across all user or all possible scenarios, I can focus on one, and that usually is enough to get good because not too many people like having different apps operating very differently all the time. That's just not a very nice system to work with. Different users are using different caches and so they get the benefit of the code being tailored for their environment. It's happening at runtime, so it's tailored for the processor so I can compile it on Skylake if I want to compile it on Skylake.

There are two basic implementations of this that are available out there in the wild. One is a technology in Eclipse OpenJ9, the JVM that I work on, we call it Dynamic AOT. It was originally introduced a long, long time ago in the IBM SDK for Java 6. At this point, we've got it to the point where there's about a 5% to 10% possible hit to peak JIT performance and we're getting better all the time at that and it's resilient to application changes.

Azul also has a technology which they call Compile Stashing. Azul Falcon JIT has this compile stashing ability based on LLVM, it was introduced in 2018 or '19. I'm not 100% sure on when it actually got produced but it's the same idea. You can store compiled code to disk, you can load it on in a subsequent run. There are some issues in trying to make sure that this run completely matches the scenario that you had in a previous run, so it's not 100% perfect but they do JIT recompilation to get everything up to scratch so it works really well too. Also, resilient to application changes. These things are all very Java-compliant. They run any Jave application and get it right.

In OpenJ9, we really use cache JIT code to accelerate start-up, that's been our primary use case for it. We have an option called -Xshareclasses. If you turn on that option, you will share the memory for classes, you will save time trying to initialize those classes in that load time, you will store AOT code into that cache that's been compiled by the JIT, you will store profile data, you will store hints to the JIT on what it should do going forward, and that population of the cache happens naturally and transparently at runtime, there's nothing else you have to do in order to make that work. You can name caches, you can put them in different caches, etc., there's all kinds of cool stuff that you can do with it.

Very recently, in one of our recent releases, we turned it on for bootstrap classes by default, which is comparable to hotspot sharing classes in the bootstrap classes by default as well. There's also an option -Xtune:virtualized which will use cache JIT code even more aggressively. It will even use it to accelerate ramp-up not just start-up but there may be a slight performance drop that you'll see by doing that. The JIT can top it up, but you might still see a difference. The graph on the right here shows Tomcat starting up with OpenJ9.

On the left side, if you completely disable all sharing, you get sort of the normalized 100% number. If you are using hotspot, it's doing some class sharing by default, so you get a 19% boost using hotspot to do that. However, with the new change that we've made to the default share class cache, there's actually a 28% boost that you get from using OpenJ9 by default, you don't have to turn on anything. Then if you do take that extra step and use -Xshareclasses, you actually get a 43% performance boost in terms of start-up performance for the Tomcat server, so it works pretty well.

If we add that to our scorecard, so we almost have great steady state performance. We can adapt to runtime changes, it's very easy to use, it's platform neutral. Start-up is great except for that poor first run, the ramp-up is great except for that poor first run. In the second run, we get pretty good CPU and memory because we're not doing as many JIT compiles in that second run but that first run still getting hit, so that's the downside there. Ok, there's still some boxes there that are not green. That's unfortunate. Even for caching JITs. I'm going to talk now about some of the technology that we're in the process of building at OpenJ9 and we're actually very close to having us ready for people to try out and use. Actually, it is ready for people to try out and use, it's just not in our builds by default quite yet.

Taking JITs to the Cloud

The basic question is, what if the JIT became a JIT server? We're trying to get rid of those transient resource requirements that are imposed by the JIT on the JVM client when you run your application because they're actually really hard to predict. Who knows how much memory a JIT compiler is going to take in their job application? No idea. Again, because the JIT is so lovely and transparent, you have no idea how to predict whats it's doing. We got your back. Let's dodge the problem and take this unpredictable random transparent thing that we all love, and move it somewhere else so that it can be random and transparent and somewhere else that we don't have to care about as much.

Let our applications run nice and clean and we probably have more chance of predicting what the memory requirements are for the Java application that we write than for this JIT compiler that somebody else wrote and activates randomly in random times. The basic idea here is the JVM is going to be doing some profiling and work to identify what paths need to get compiled, but the actual work of compiling those methods gets shifted off to a remote server. Then you have some wonderful orchestrator in the middle that handles load balancing and affinity and scaling and reliability and all that stuff. Gee, if only there were things that could do that for us? Wventually, they'll come along, I'm sure. Kubernetes. Istio, lots of good things.

What are the benefits? We can get all of these random, hard to predict, JIT-induced CPU spikes and memory spikes out of the client and get the performance more along the lines of what you're familiar with in the applications that you're running. That JIT server can connect to the client JVM at runtime the same time that the application is running, so, theoretically, there's no loss in performance. We can use the same profile data, we can get the same class hierarchy information of that running process. It's still adaptable to changing conditions, so that's great, and the JVM client is still platform neutral. In fact, the JIT server doesn't even have to run on the same machine as the client anymore, in principle. There's some reasons why it's usually still quite similar but in principle, you could do cross compiles.

Could it really work? Yes, I wouldn't be here telling you about it if it didn't work. That's a less interesting talk. Here's a chart that talks about a Java EE benchmark called AcmeAir, it's modeling a flight reservation system, but that's not super important. It's using the JIT server technology that we've been building at OpenJ9 and in combination with -Xshareclasses. I'm showing the cold run on the left and the warm run on the right and you can see the blue line is OpenJ9 using a local JIT and the orange line is OpenJ9 using a JIT server to do its JIT compilations. The JVM client here, I've backed into a little bit of a corner. It's running in a container and I'm only giving it one CPU and 150 Mbps of memory, which is pretty tight for this particular application. You can see, even in the cold run, by taking the compiled workload and moving it to a different system - and don't call me a cheater yet - taking this CPU load to do compilations out of the JVM client and to the server, helps the JVM client start-up faster, even in that cold run where that poor victimized guy who has to do all the work for everybody else and doesn't get the limelight. Even this cold run gets to do a better job, it ramps up faster, it starts up faster, and actually quite a bit faster.

On the right-hand side, the warm run. Ok, it's not exactly a lot better using the JIT server but it's as good and, in fact, you do hit peak performance faster and there's an interesting reason why. When you're doing a cold runs with a local JIT, the JIT can't compile methods very quickly, which means its queue of methods gets very long. We have a heuristic that says, "When the queue is long, start downgrading AOT compilations and doing them at lower op level," so I can chew through more of them and get compiled code performance faster, which is great, except that when I lower the op level, I don't get as good performance, which means that I have to recompile a bunch of stuff.

If you can see, there's this drawn out process here, it gets to a point and then it has to do some other stuff to get all the way up to peak performance. It has this delay in here. You get here faster because you've compiled a lot of AOT methods in your cold run, but then you have to do more sort of high up compilers to get you all the way up and those compilers end up happening in the process in the warm run to get you there. When you're doing a JIT server work, those JIT compilers move somewhere else so the queue doesn't get long, so we don't downgrade as many things and so we end up actually being able to compile all of the methods that are going to be used in start-up and ramp-up at warm, at our normal op level.

That means we get this nice straight line all the way up to peak performance much earlier than what the in-process JIT is able to achieve, so that's cool. Now, I have to be fair, these things are running on two different machines but with a direct cable connection. I could have put them on the same machine but it's the best option in terms of latency in connecting these two things. I do want to make the point that hotspot takes about twice as long as OpenJ9 with a local JIT to start up this application and to ramp up to peak performance, so OpenJ9 is already great and this makes it even better.

Another example I'm going to show here is running a different application called Daytrader 7. This is a bit beefier Java EE benchmark, which is simulating day trading and stock trading. Here I've shown three scenarios where I'm backing this client application into an increasingly difficult corner. The left-hand graph shows one CPU and 300 Mbps and you can see that that's really not a very tight corner. Even the local JIT is able to make enough progress on here, we get about the same ramp up and about the same performance. In the middle graph, I've reduced the memory to 256 Mbps and the local JIT is now starting to have trouble managing the workload of compiling at the same time that the application is trying to do work.

What that means is with the JIT compiler runs out of memory, it doesn't throw "oom" and take down the whole JVM, that would be stupid. What we do instead is just bail on the compile. We can't do that compile at that optimization level and so we'll back off to a lower optimization level but that means lower performance. As you can see, the local JIT is starting to have a bit of a performance impact, whereas the JIT server, because it's still sending its compiles over there, it's still able to manage things quite well.

On the right-hand side, I've reduced it even further to 200 Mbps and now you can really see that the local JIT is struggling. It's having real trouble being able to compile methods and so the throughput performance results that it manage to achieve, it doesn't fall over, it doesn't die, but it's not doing as well as it was in the other scenarios. JIT server is doing just fine. Why is it a little lower? I'm not 100% sure of that, something I'll have to look at.

What this means is that you can start to be a lot more aggressive about how you size your containers once the JIT compiler has been taken out of the process. You still have to, obviously, allocate space and memory and CPU resources for the JIT server but this simplifies the task of managing your containers for your applications.

I know what you're thinking, "That's been dedicated machines with network cables and, who knows, but what about network latency? Isn't that going to hurt start-up and ramp-up when all that compile stuffs happening? Will it really be practical in the cloud?" The title of the topic was "JIT-ing for the cloud, taking JITs to the cloud." We tried it on Amazon and it worked pretty well.

On the left, the blue line is, again, the OpenJ9 client with a local JIT and you can see that it has massive spikes in footprint on the left in order to do JIT compiles to get to its peak performance. Steady state, it's quite regular but the JIT server has managed to move all of that workload off onto a JIT server just fine and we get this nice, clean memory footprint curve, very predictable, very easy to deal with. In terms of performance, which is the graph on the right, you can see, ok, it's not quite as perfect as the other graphs that I showed but it's not all that far out of whack and we're still working on this, so we think there's still improvements that can get made.

Now, why is this the case? It's because we can trade bandwidth for latency. Yes, it's true that compiles take a lot longer to happen when they're happening across a network but we can afford to do more of them at a time because they are lower utilization even on the server. We can afford to put more resources on the server if it's doing a lot of compiles and it can do compiles across multiple applications, you don't have to have one server for one application anymore. It's not a one-to-one relationship, you can have one server feeding a whole bunch of clients if you want.

If we put that one up, I've got almost all of my boxes green. Everything is great from a performance side, everything is platform neutral, it's easy to use. We've got two stars now because it's the first run across a cluster of applications that are talking to the server the first time you try to compile this code, but you can send it to any number of clients if they want it, in principle. The only reason why I didn't do the runtime CPU and memory a full green bar here is because there's actually still some CPU that gets consumed at the client in order to satisfy these JIT compiler requests. It turns out it actually takes quite a lot of CPU cycles to send network traffic.

As a compiler guy, this shocked me because I thought compilers have to be like one of the most computationally demanding things that you could possibly do. It turns out that sending network messages is actually even more intensive, which I'm still unhappy about it, but that's the way it is.

Current status here is the code is fully open source at Eclipse OpenJ9 and the project that it builds from, Eclipse OMR. It's now been merged into our master branch, so that's new for this iteration of the presentation, that hasn't been true before. All the code is in our master branch but we're not building it in by default to the binaries at AdoptOpenJDK quite yet.

We've introduced some simple options which lend really well to any kind of Java workload deployment. If you want to start the server or the client, you're still running Java. There's an option, StartAsJITServer which causes the JVM to start up as a JIT server, predictably, and you tell it which port to listen on and it listens on that port and it happens like that, very quickly. If you're running a client application, you use the option UseJITServer, give it the port, send it the address, and whatever command line options you'd like to run your Java application with and that's it and it will do all of its compiles.

Our current focus right now is just on ensuring the stability of the code base so that we can turn it on by default in our builds for OpenJ9 and we're hoping that that's going to happen in our very next release, which is in early 2020, that will be 0.18 release. Because OpenJ9 as a JVM gets built into the same code base that gets built into every JDK release, that means that you'll be able to run JIT servers with JDK8, JDK11, and JDK13 at that point and actually, one JIT server can handle all three of those.

We're really just at the beginning of this process with JIT servers, I think there's some really cool stuff that can happen here. Our primary focus right now has been just on implementing the mechanics to move this JIT compile workload into a separate process. Once it's there, there's actually a lot of very interesting things you can do. You can figure out how to do that work more efficiently so that you can use one server to serve a whole bunch of different. If you have n JVM, you shouldn't have to do n times the compile workload of one JVM, you should be able to do that more efficiently. Again, it's that "How many times do I have to compile string.equals" question.

That's obviously a very good fit for current trends towards microservices, where you've gotten lots of JVM and lots of JVM that you want to run in small footprints, so that's a really good fit for taking the JIT out of all of those things and compiling across them. We can start using classification algorithms to try and categorize those JVM automatically, so you don't have to do very much. There's no user experience and trying to figure this out, we can make it just work the same way JITs work where nobody knows how to disable it because it's so ubiquitous and great.

We can even optimize groups of microservices together. I think there's some really exciting opportunities to use this in, say, CI/CD pipelines. You can imagine where even your sort of development experience is tied to, "I know which methods changed in this pull request," so communicate that to the JIT server so it knows exactly which ones not to send to the thing. Then, it can even start compiling those ones while you're waiting to run your tests and then your whole pipeline gets accelerated by this JIT server sitting on the side that's communicating back and forth.

It's even a good place where you can collect information; imagine all the profile data, all the class information, everything that's inside the JVM, it's now sitting someplace that I can independently query and I could find a way to present that data in the IDE maybe even. Start feeding back some of that information about how applications are running in your IDE because that server can live longer than your application does, it doesn't have to be there and then.

Wrapping Up

Quickly wrapping up, JITs continue to provide the best peak performance here but there are some really good opportunities to do even cooler things. AOT compilers, as we talked about, they're very interesting technology, they can be improve start-up performance dramatically, but there are some steady state performance issues with them and some serious usability issues if you were to just use AOT by itself. That's why in OpenJ9, we don't just use AOT by itself, we use AOT in combination with JIT and now with caching our JIT compiles and starting to now go to JIT servers. You can get to within 5% to 10% with caching JIT with excellent start-up and ramp-up, even for very large JakartaEE applications. I think there's still room here to improve both throughput and start-up and footprint without sacrificing compliance, without having to go to a closed-world model.

It would be very interesting to see if there are any intermediate steps there between full Java compliance and definitely not Java compliant. There's a spectrum in between there, I think it'd be interesting to see what solutions we can build in that space without having to sacrifice completely on Java compliance. JIT servers are coming, hopefully built into AdoptOpenJDK in early 2020. If you haven't tried Eclipse OpenJ9 yet, I don't know what you're waiting for, but go to the site adoptopenjdk.net, you'll get presented with a page like that. Make sure you pick OpenJ9 on the right-hand side here and try it out and let us know how it goes.