This post is all about speculative compilation, or just speculation for short, in the context of the JavaScriptCore virtual machine. Speculative compilation is ideal for making dynamic languages, or any language with enough dynamic features, run faster. In this post, we will look at speculation for JavaScript. Historically, this technique or closely related variants has been applied successfully to Smalltalk, Self, Java, .NET, Python, and Ruby, among others. Starting in the 90’s, intense benchmark-driven competition between many Java implementations helped to create an understanding of how to build speculative compilers for languages with small amounts of dynamism. Despite being a lot more dynamic than Java, the JavaScript performance war that started in the naughts has generally favored increasingly aggressive applications of the same speculative compilation tricks that worked great for Java. It seems like speculation can be applied to any language implementation that uses runtime checks that are hard to reason about statically.

This is a long post that tries to demystify a complex topic. It’s based on a two hour compiler lecture (slides also available in PDF). We assume some familiarity with compiler concepts like intermediate representations (especially Static Single Assignment Form, or SSA for short), static analysis, and code generation. The intended audience is anyone wanting to understand JavaScriptCore better, or anyone thinking about using these techniques to speed up their own language implementation. Most of the concepts described in this post are not specific to JavaScript and this post doesn’t assume prior knowledge about JavaScriptCore.

Before going into the details of speculation, we’ll provide an overview of speculation and an overview of JavaScriptCore. This will help provide context for the main part of this post, which describes speculation by breaking it down into five parts: bytecode (the common IR), control, profiling, compilation, and OSR (on stack replacement). We conclude with a small review of related work.

Overview of Speculation

The intuition behind speculation is to leverage traditional compiler technology to make dynamic languages as fast as possible. Construction of high-performance compilers is a well-understood art, so we want to reuse as much of that as we can. But we cannot do this directly for a language like JavaScript because the lack of type information means that the compiler can’t do meaningful optimizations for any of the fundamental operations (even things like + or == ). Speculative compilers use profiling to infer types dynamically. The generated code uses dynamic type checks to validate the profiled types. If the program uses a type that is different from what we profiled, we throw out the optimized code and try again. This lets the optimizing compiler work with a statically typed representation of the dynamically typed program.

Types are a major theme of this post even though the techniques we are describing are for implementing dynamically typed languages. When languages include static types, it can be to provide safety properties for the programmer or to help give an optimizing compiler leverage. We are only interested in types for performance and the speculation strategy in JavaScriptCore can be thought of in broad strokes as inferring the kinds of types that a C program would have, but using an internal type system purpose built for our optimizing compiler. More generally, the techniques described in this post can be used to enable any kind of profile-guided optimizations, including ones that aren’t related to types. But both this post and JavaScriptCore focus on the kind of profiling and speculation that is most natural to think if as being about type (whether a variable is an integer, what object shapes a pointer points to, whether an operation has effects, etc).

To dive into this a bit deeper, we first consider the impact of types. Then we look at how speculation gives us types.

Impact of Types

We want to give dynamically typed languages the kind of optimizing compiler pipeline that would usually be found in ahead-of-time compilers for high-performance statically typed languages like C. The input to such an optimizer is typically some kind of internal representation (IR) that is precise about the type of each operation, or at least a representation from which the type of each operation can be inferred.

To understand the impact of types and how speculative compilers deal with them, consider this C function:

int foo(int a, int b) { return a + b; }

In C, types like int are used to describe variables, arguments, return values, etc. Before the optimizing compiler has a chance to take a crack at the above function, a type checker fills in the blanks so that the + operation will be represented using an IR instruction that knows that it is adding 32-bit signed integers (i.e. int s). This knowledge is essential:

Type information tells the compiler’s code generator how to emit code for this instruction. We know to use integer addition instructions (not double addition or something else) because of the int type.

type. Type information tells the optimizer how to allocate registers for the inputs and outputs. Integers mean using general purpose registers. Floating point means using floating point registers.

Type information tells the optimizer what optimizations are possible for this instruction. Knowing exactly what it does allows us to know what other operations can be used in place of it, allows us to do some algebraic reasoning about the math the program is doing, and allows us to fold the instruction to a constant if the inputs are constants. If there are types for which + has effects (like in C++), then the fact that this is an integer + means that it’s pure. Lots of compiler optimizations that work for + would not work if it wasn’t pure.

Now consider the same program in JavaScript:

function foo(a, b) { return a + b; }

We no longer have the luxury of types. The program doesn’t tell us the types of a or b . There is no way that a type checker can label the + operation as being anything specific. It can do a bunch of different things based on the runtime types of a and b :

It might be a 32-bit integer addition.

It might be a double addition.

It might be a string concatenation.

It might be a loop with method calls. Those methods can be user-defined and may perform arbitrary effects. This’ll happen if a or b are objects.

Branch

if

Figure 1. The best that a nonspeculative compiler can do if given a JavaScript plus operation. This figure depicts a control flow graph as a compiler like JavaScriptCore’s DFG might see. Theoperation is like anand has outgoing edges for the then/else outcomes of the condition.

Based on this, it’s not possible for an optimizer to know what to do. Instruction selection means emitting either a function call for the whole thing or an expensive control flow subgraph to handle all of the various cases (Figure 1). We won’t know which register file is best for the inputs or results; we’re likely to go with general purpose registers and then do additional move instructions to get the data into floating point registers in case we have to do a double addition. It’s not possible to know if one addition produces the same results as another, since they have loops with effectful method calls. Anytime a + happens we have to allow for the the possibility that the whole heap might have been mutated.

In short, it’s not practical to use optimizing compilers for JavaScript unless we can somehow provide types for all of the values and operations. For those types to be useful, they need to help us avoid basic operations like + seeming like they require control flow or effects. They also need to help us understand which instructions or register files to use. Speculative compilers get speed-ups by applying this kind of reasoning to all of the dynamic operations in a language — ranging from those represented as fundamental operations (like + or memory accesses like o.f and o[i] ) to those that involve intrinsics or recognizable code patterns (like calling Function.prototype.apply ).

Speculated Types

This post focuses on those speculations where the collected information can be most naturally understood as type information, like whether or not a variable is an integer and what properties a pointed-to object has (and in what order). Let’s appreciate two aspects of this more deeply: when and how the profiling and optimization happen and what it means to speculate on type.

Figure 2. Optimizing compilers for C and JavaScript.

Let’s consider what we mean by speculative compilation for JavaScript. JavaScript implementations pretend to be interpreters; they accept JS source as input. But internally, these implementations use a combination of interpreters and compilers. Initially, code starts out running in an execution engine that does no speculative type-based optimizations but collects profiling about types. This is usually an interpreter, but not always. Once a function has a satisfactory amount of profiling, the engine will start an optimizing compiler for that function. The optimizing compiler is based on the same fundamentals as the one found in a C compiler, but instead of accepting types from a type checker and running as a command-line tool, here it accepts types from a profiler and runs in a thread in the same process as the program it’s compiling. Once that compiler finishes emitting optimized machine code, we switch execution of that function from the profiling tier to the optimized tier. Running JavaScript code has no way of observing this happening to itself except if it measures execution time. (However, the environment we use for testing JavaScriptCore includes many hooks for introspecting what has been compiled.) Figure 2 illustrates how and when profiling and optimization happens when running JavaScript.

Roughly, speculative compilation means that our example function will be transformed to look something like this:

function foo(a, b) { speculate(isInt32(a)); speculate(isInt32(b)); return a + b; }

The tricky thing is what exactly it means to speculate. One simple option is what we call diamond speculation. This means that every time that we perform an operation, we have a fast path specialized for what the profiler told us and a slow path to handle the generic case:

if (is int) int add else Call(slow path)

To see how that plays out, let’s consider a slightly different example:

var tmp1 = x + 42; ... // things var tmp2 = x + 100;

Here, we use x twice, both times adding it to a known integer. Let’s say that the profiler tells us that x is an integer but that we have no way of proving this statically. Let’s also say that x ‘s value does not change between the two uses and we have proved that statically.

x

Figure 3. Diamond speculation thatis an integer.

Figure 3 shows what happens if we speculate on the fact that x is an integer using a diamond speculation: we get a fast path that does the integer addition and a slow path that bails out to a helper function. Speculations like this can produce modest speed-ups at modest cost. The cost is modest because if the speculation is wrong, only the operations on x pay the price. The trouble with this approach is that repeated uses of x must recheck whether it is an integer. The rechecking is necessary because of the control flow merge that happens at the things block and again at more things.

The original solution to this problem was splitting, where the region of the program between things and more things would get duplicated to avoid the branch. An extreme version of this is tracing, where the entire remainder of a function is duplicated after any branch. The trouble with these techniques is that duplicating code is expensive. We want to minimize the number of times that the same piece of code is compiled so that we can compile a lot of code quickly. The closest thing to splitting that JavaScriptCore does is tail duplication, which optimizes diamond speculations by duplicating the code between them if that code is tiny.

A better alternative to diamond speculations or splitting is OSR (on stack replacement). When using OSR, a failing type check exits out of the optimized function back to the equivalent point in the unoptimized code (i.e. the profiling tier’s version of the function).

x

Figure 4. OSR speculation thatis an integer.

Figure 4 shows what happens when we speculate that x is an integer using OSR. Because there is no control flow merge between the case where x is an int and the case where it isn’t, the second check becomes redundant and can be eliminated. The lack of a merge means that the only way to reach the second check is if the first check passed.

OSR speculations are what gives our traditional optimizing compiler its static types. After any OSR-based type check, the compiler can assume that the property that was checked is now fact. Moreover, because OSR check failure does not affect semantics (we exit to the same point in the same code, just with fewer optimizations), we can hoist those checks as high as we want and infer that a variable always has some type simply by guarding all assignments to it with the corresponding type check.

Note that what we call OSR exit in this post and in JavaScriptCore is usually called deoptimization elsewhere. We prefer to use the term OSR exit in our codebase because it emphasizes that the point is to exit an optimized function using an exotic technique (OSR). The term deoptimization makes it seem like we are undoing optimization, which is only true in the narrow sense that a particular execution jumps from optimized code to unoptimized code. For this post we will follow the JavaScriptCore jargon.

JavaScriptCore uses OSR or diamond speculations depending on our confidence that the speculation will be right. OSR speculation has higher benefit and higher cost: the benefit is higher because repeated checks can be eliminated but the cost is also higher because OSR is more expensive than calling a helper function. However, the cost is only paid if the exit actually happens. The benefits of OSR speculation are so superior that we focus on that as our main speculation strategy, with diamond speculation being the fallback if our profiling indicates lack of confidence in the speculation.

Figure 5. Speculating with OSR and exiting to bytecode.

OSR-based speculation relies on the fact that traditional compilers are already good at reasoning about side exits. Trapping instructions (like for null check optimization in Java virtual machines), exceptions, and multiple return statements are all examples of how compilers already support exiting from a function.

Assuming that we use bytecode as the common language shared between the unoptimizing profiled tier of execution and the optimizing tier, the exit destinations can just be bytecode instruction boundaries. Figure 5 shows how this might work. The machine code generated by the optimizing compiler contains speculation checks against unlikely conditions. The idea is to do lots of speculations. For example, the prologue (the enter instruction in the figure) may speculate about the types of the arguments — that’s one speculation per argument. An add instruction may speculate about the types of its inputs and about the result not overflowing. Our type profiling may tell us that some variable tends to always have some type, so a mov instruction whose source is not proved to have that type may speculate that the value has that type at runtime. Accessing an array element (what we call get_by_val ) may speculate that the array is really an array, that the index is an integer, that the index is in bounds, and that the value at the index is not a hole (in JavaScript, loading from a never assigned array element means walking the array’s prototype chain to see if the element can be found there — something we avoid doing most of the time by speculating that we don’t have to). Calling a function may speculate that the callee is the one we expected or at least that it has the appropriate type (that it’s something we can call).

While exiting out of a function is straightforward without breaking fundamental assumptions in optimizing compilers, entering turns out to be super hard. Entering into a function somewhere other than at its primary entrypoint pessimises optimizations at any merge points between entrypoints. If we allowed entering at every bytecode instruction boundary, this would negate the benefits of OSR exit by forcing every instruction boundary to make worst-case assumptions about type. Even allowing OSR entry just at loop headers would break lots of loop optimizations. This means that it’s generally not possible to reenter optimized execution after exiting. We only support entry in cases where the reward is high, like when our profiler tells us that a loop has not yet terminated at the time of compilation. Put simply, the fact that traditional compilers are designed for single-entry multiple-exit procedures means that OSR entry is hard but OSR exit is easy.

JavaScriptCore and most speculative compilers support OSR entry at hot loops, but since it’s not an essential feature for most applications, we’ll leave understanding how we do it as an exercise for the reader.

Figure 6. Speculation broken into the five topics of this post.

The main part of this post describes speculation in terms of its five components (Figure 6): the bytecode, or common IR, of the virtual machine that allows for a shared understanding about the meaning of profiling and exit sites between the unoptimized profiling tier and the optimizing tier; the unoptimized profiling tier that is used to execute functions at start-up, collect profiling about them, and to serve as an exit destination; the control system for deciding when to invoke the optimizing compiler; the optimizing tier that combines a traditional optimizing compiler with enhancements to support speculation based on profiling; and the OSR exit technology that allows the optimizing compiler to use the profiling tier as an exit destination when speculation checks fail.

Overview of JavaScriptCore

Figure 7. The tiers of JavaScriptCore.

JavaScriptCore embraces the idea of tiering and has four tiers for JavaScript (and three tiers for WebAssembly, but that’s outside the scope of this post). Tiering has two benefits: the primary benefit, described in the previous section, of enabling speculation; and a secondary benefit of allowing us to fine-tune the throughput-latency tradeoff on a per-function basis. Some functions run for so short — like straight-line run-once initialization code — that running any compiler on those functions would be more expensive than interpreting them. Some functions get invoked so frequently, or have such long loops, that their total execution time far exceeds the time to compile them with an aggressive optimizing compiler. But there are also lots of functions in the grey area in between: they run for not enough time to make an aggressive compiler profitable, but long enough that some intermediate compiler designs can provide speed-ups. JavaScriptCore has four tiers as shown in Figure 7:

The LLInt, or low-level interpreter, which is an interpreter that obeys JIT compiler ABI. It runs on the same stack as the JITs and uses a known set of registers and stack locations for its internal state.

The Baseline JIT, also known as a bytecode template JIT, which emits a template of machine code for each bytecode instruction without trying to reason about relationships between multiple instructions in the function. It compiles whole functions, which makes it a method JIT. Baseline does no OSR speculations but does have a handful of diamond speculations based on profiling from the LLInt.

The DFG JIT, or data flow graph JIT, which does OSR speculation based on profiling from the LLInt, Baseline, and in some rare cases even using profiling data collected by the DFG JIT and FTL JIT. It may OSR exit to either baseline or LLInt. The DFG has a compiler IR called DFG IR, which allows for sophisticated reasoning about speculation. The DFG avoids doing expensive optimizations and makes many compromises to enable fast code generation.

The FTL JIT, or faster than light JIT, which does comprehensive compiler optimizations. It’s designed for peak throughput. The FTL never compromises on throughput to improve compile times. This JIT reuses most of the DFG JIT’s optimizations and adds lots more. The FTL JIT uses multiple IRs (DFG IR, DFG SSA IR, B3 IR, and Assembly IR).

An ideal example of this in action is this program:

"use strict"; let result = 0; for (let i = 0; i < 10000000; ++i) { let o = {f: i}; result += o.f; } print(result);

Thanks to the object allocation inside the loop, it will run for a long time until the FTL JIT can compile it. The FTL JIT will kill that allocation, so then the loop finishes quickly. The long running time before optimization virtually guarantees that the FTL JIT will take a stab at this program’s global function. Additionally, because the function is clean and simple, all of our speculations are right and there are no OSR exits.

Figure 8. Example timeline of a simple long loop executing in JavaScriptCore. Execution times recorded on my computer one day.

Figure 8 shows the timeline of this benchmark executing in JavaScriptCore. The program starts executing in the LLInt. After about a thousand loop iterations, the loop trigger causes us to start a baseline compiler thread for this code. Once that finishes, we do an OSR entry into the baseline JITed code at the for loop’s header. The baseline JIT also counts loop iterations, and after about a thousand more, we spawn the DFG compiler. The process repeats until we are in the FTL. When I measured this, I found that the DFG compiler needs about 4× the time of the baseline compiler, and the FTL needs about 6× the time of the DFG. While this example is contrived and ideal, the basic idea holds for any JavaScript program that runs long enough since all tiers of JavaScriptCore support the full JavaScript language.

Figure 9. JavaScriptCore tier architecture.

JavaScriptCore is architected so that having many tiers is practical. Figure 9 illustrates this architecture. All tiers share the same bytecode as input. That bytecode is generated by a compiler pipeline that desugars many language features, such as generators and classes, among others. In many cases, it’s possible to add new language features just by modifying the bytecode generation frontend. Once linked, the bytecode can be understood by any of the tiers. The bytecode can be interpreted by the LLInt directly or compiled with the baseline JIT, which mostly just converts each bytecode instruction into a preset template of machine code. The LLInt and Baseline JIT share a lot of code, mostly in the slow paths of bytecode instruction execution. The DFG JIT converts bytecode to its own IR, the DFG IR, and optimizes it before emitting code. In many cases, operations that the DFG chooses not to speculate on are emitted using the same code generation helpers as the Baseline JIT. Even operations that the DFG does speculate on often share slow paths with the Baseline JIT. The FTL JIT reuses the DFG’s compiler pipeline and adds new optimizations to it, including multiple new IRs that have their own optimization pipelines. Despite being more sophisticated than the DFG or Baseline, the FTL JIT shares slow path implementations with those JITs and in some cases even shares code generation for operations that we choose not to speculate on. Even though the various tiers try to share code whenever possible, they aren’t required to. Take the get_by_val (access an array element) instruction in bytecode. This has duplicate definitions in the bytecode liveness analysis (which knows the liveness rules for get_by_val ), the LLInt (which has a very large implementation that switches on a bunch of the common array types and has good code for all of them), the Baseline (which uses a polymorphic inline cache), and the DFG bytecode parser. The DFG bytecode parser converts get_by_val to the DFG IR GetByVal operation, which has separate definitions in the DFG and FTL backends as well as in a bunch of phases that know how to optimize and model GetByVal. The only thing that keeps those implementations in agreement is good convention and extensive testing.

To give a feeling for the relative throughput of the various tiers, I’ll share some informal performance data that I’ve gathered over the years out of curiosity.

Figure 10. Relative performance of the four tiers on JetStream 2 on my computer at the time of that benchmark’s introduction.

We’re going to use the JetStream 2 benchmark suite since that’s the main suite that JavaScriptCore is tuned for. Let’s first consider an experiment where we run JetStream 2 with the tiers progressively enabled starting with the LLInt. Figure 10 shows the results: the Baseline and DFG are more than 2× better than the tier below them and the FTL is 1.1× better than the DFG.

The FTL’s benefits may be modest but they are unique. If we did not have the FTL, we would have no way of achieving the same peak throughput. A great example is the gaussian-blur subtest. This is the kind of compute test that the FTL is built for. I managed to measure the benchmark’s performance when we first introduced it and did not yet have a chance to tune for it. So, this gives a glimpse of the speed-ups that we expect to see from our tiers for code that hasn’t yet been through the benchmark tuning grind. Figure 11 shows the results. All of the JITs achieve spectacular speed-ups: Baseline is 3× faster than LLInt, DFG is 6× faster than Baseline, and FTL is 1.6× faster than DFG.

Figure 11. Relative performance of the four tiers on the guassian-blur subtest of JetStream 2.

The DFG and FTL complement one another. The DFG is designed to be a fast-running compiler and it achieves this by excluding the most aggressive optimizations, like global register allocation, escape analysis, loop optimizations, or anything that needs SSA. This means that the DFG will always get crushed on peak throughput by compilers that have those features. It’s the FTL’s job to provide those optimizations if a function runs long enough to warrant it. This ensures that there is no scenario where a hypothetical competing implementation could outperform us unless they had the same number of tiers. If you wanted to make a compiler that compiles faster than the FTL then you’d lose on peak throughput, but if you wanted to make a compiler that generates better code than the DFG then you’d get crushed on start-up times. You need both to stay in the game.

Another way of looking at the performance of these tiers is to ask: how much time does a bytecode instruction take to execute in each of the tiers on average? This tells us just about the throughput that a tier achieves without considering start-up at all. This can be hard to estimate, but I made an attempt at it by repeatedly running each JetStream 2 benchmark and having it limit the maximum tier of each function at random. Then I employed a stochastic counting mechanism to get an estimate of the number of bytecode instructions executed at each tier in each run. Combined with the execution times of those runs, this gave a simple linear regression problem of the form:

ExecutionTime = (Latency of LLInt) * (Bytecodes in LLInt) + (Latency of Baseline) * (Bytecodes in Baseline) + (Latency of DFG) * (Bytecodes in DFG) + (Latency of FTL) * (Bytecodes in FTL)

Where the Latency of LLInt means the average amount of time it takes to execute a bytecode instruction in LLInt.

After excluding benchmarks that spent most of their time outside JavaScript execution (like regexp and wasm benchmarks) and fiddling with how to weight benchmarks (I settled on solving each benchmarks separately and computing geomean of the coefficients since this matches JetStream 2 weighting), the solution I arrived at was:

Execution Time = (3.97 ns) * (Bytecodes in LLInt) + (1.71 ns) * (Bytecodes in Baseline) + (.349 ns) * (Bytecodes in DFG) + (.225 ns) * (Bytecodes in FTL)

In other words, Baseline executes code about 2× faster than LLInt, DFG executes code about 5× faster than Baseline, and the FTL executes code about 1.5× faster than DFG. Note how this data is in the same ballpark as what we saw for gaussian-blur. That makes sense since that was a peak throughput benchmark.

Although this isn’t a garbage collection blog post, it’s worth understanding a bit about how the garbage collector works. JavaScriptCore picks a garbage collection strategy that makes the rest of the virtual machine, including all of the support for speculation, easier to implement. The garbage collector has the following features that make speculation easier:

The collector scans the stack conservatively. This means that compilers don’t have to worry about how to report pointers to the collector.

The collector doesn’t move objects. This means that if a data structure (like the compiler IR) has many possible ways of referencing some object, we only have to report one of them to the collector.

The collector runs to fixpoint. This makes it possible to invent precise rules for whether objects created by speculation should be kept alive.

The collector’s object model is expressed in C++. JavaScript objects look like C++ objects, and JS object pointers look like C++ pointers.

These features make the compiler and runtime easier to write, which is great, since speculation requires us to write a lot of compiler and runtime code. JavaScript is a slow enough language even with the optimizations we describe in this post that garbage collector performance is rarely the longest pole in the tent. Therefore, our garbage collector makes many tradeoffs to make it easier to work on the performance-critical parts of our engine (like speculation). It would be unwise, for example, to make it harder to implement some compiler optimization as a way of getting a small garbage collector optimization, since the compiler has a bigger impact on performance for typical JavaScript programs.

To summarize: JavaScriptCore has four tiers, two of which do speculative optimizations, and all of which participate in the collection of profiling. The first two tiers are an interpreter and bytecode template JIT while the last two are optimizing compilers tuned for different throughput-latency trade-offs.

Speculative Compilation

Now that we’ve established some basic background about speculation and JavaScriptCore, this section goes into the details. First we will discuss JavaScriptCore’s bytecode. Then we show the control system for launching the optimizing compiler. Next will be a detailed section about how JavaScriptCore’s profiling tiers work, which focuses mostly on how they collect profiling. Finally we discuss JavaScriptCore’s optimizing compilers and their approach to OSR.

Bytecode

Speculation requires having a profiling tier and an optimizing tier. When the profiling tier reports profiling, it needs to be able to say what part of the code that profiling is for. When the optimizing compiler wishes to compile an OSR exit, it needs to be able to identify the exit site in a way that both tiers understand. To solve both issues, we need a common IR that is:

Used by all tiers as input.

Persistent for as long as the function that it represents is still live.

Immutable (at least for those parts that all tiers are interested in).

In this post, we will use bytecode as the common IR. This isn’t required; abstract syntax trees or even SSA could work as a common IR. We offer some insights into how we designed our bytecode for JavaScriptCore. JavaScriptCore’s bytecode is register-based, compact, untyped, high-level, directly interpretable, and transformable.

Our bytecode is register-based in the sense that operations tend to be written as:

add result, left, right

Which is taken to mean:

result = left + right

Where result , left , and right are virtual registers. Virtual registers may refer to locals, arguments, or constants in the constant pool. Functions declare how many locals they need. Locals are used both for named variables (like var , let , or const variables) and temporaries arising from expression tree evaluation.

Our bytecode is compact: each opcode and operand is usually encoded as one byte. We have wide prefixes to allow 16-bit or 32-bit operands. This is important since JavaScript programs can be large and the bytecode must persist for as long as the function it represents is still live.

Our bytecode is untyped. Virtual registers never have static type. Opcodes generally don’t have static type except for the few opcodes that have a meaningful type guarantee on their output (for example, the | operator always returns int32, so our bitor opcode returns int32). This is important since the bytecode is meant to be a common source of truth for all tiers. The profiling tier runs before we have done type inference, so the bytecode can’t have any more types than the JavaScript language.

Our bytecode is almost as high-level as JavaScript. While we use desugaring for many JavaScript features, we only do that when implementation by desugaring isn’t thought to cost performance. So, even the “fundamental” features of our bytecode are high level. For example, the add opcode has all of the power of the JavaScript + operator, including that it might mean a loop with effects.

Our bytecode is directly interpretable. The same bytecode stream that the interpreter executes is the bytecode stream that we will save in the cache (to skip parsing later) and feed to the compiler tiers.

Finally, our bytecode is transformable. Normally, intermediate representations use a control flow graph and make it easy to insert and remove instructions. That’s not how bytecode works: it’s an array of instructions encoded using a nontrivial variable-width encoding. But we do have a bytecode editing API and we use it for generatorification (our generator desugaring bytecode-to-bytecode pass). We can imagine this facility also being useful for other desugarings or for experimenting with bytecode instrumentation.

Compared to non-bytecode IRs, the main advantages of bytecode are that it’s easy to:

Identify targets for OSR exit. OSR exit in JavaScriptCore requires entering into an unoptimized bytecode execution engine (like an interpreter) at some arbitrary bytecode instruction. Using bytecode instruction index as a way of naming an exit target is intuitive since it’s just an integer.

Compute live state at exit. Register-based bytecode tends to have dense register numberings so it’s straightforward to analyze liveness using bitvectors. That tends to be fast and doesn’t require a lot of memory. It’s practical to cache the results of bytecode liveness analysis, for example.

JavaScriptCore’s bytecode format is independently implemented by the execution tiers. For example, the baseline JIT doesn’t try to use the LLInt to create its machine code templates; it just emits those templates itself and doesn’t try to match the LLInt exactly (the behavior is identical but the implementation isn’t). The tiers do share a lot of code – particularly for inline caches and slow paths – but they aren’t required to. It’s common for bytecode instructions to have algorithmically different implementations in the four tiers. For example the LLInt might implement some instruction with a large switch that handles all possible types, the Baseline might implement the same instruction with an inline cache that repatches based on type, and the DFG and FTL might try to do some combination of inline speculations, inline caches, and emitting a switch on all types. This exact scenario happens for add and other arithmetic ops as well as get_by_val / put_by_val . Allowing this independence allows each tier to take advantage of its unique properties to make things run faster. Of course, this approach also means that adding new bytecodes or changing bytecode semantics requires changing all of the tiers. For that reason, we try to implement new language features by desugaring them to existing bytecode constructs.

It’s possible to use any sensible IR as the common IR for a speculative compiler, including abstract syntax trees or SSA, but JavaScriptCore uses bytecode so that’s what we’ll talk about in the rest of this post.

Control

Speculative compilation needs a control system to decide when to run the optimizing compiler. The control system has to balance competing concerns: compiling functions as soon as it’s profitable, avoiding compiling functions that aren’t going to run long enough to benefit from it, avoiding compiling functions that have inadequate type profiling, and recompiling functions if a prior compilation did speculations that turned out to be wrong. This section describes JavaScriptCore’s control system. Most of the heuristics we describe were necessary, in our experience, to make speculative compilation profitable. Otherwise the optimizing compiler would kick in too often, not often enough, or not at the right rate for the right functions. This section describes the full details of JavaScriptCore’s tier-up heuristics because we suspect that to reproduce our performance, one would need all of these heuristics.

JavaScriptCore counts executions of functions and loops to decide when to compile. Once a function is compiled, we count exits to decide when to throw away compiled functions. Finally, we count recompilations to decide how much to back off from recompiling a function in the future.

Execution Counting

JavaScriptCore maintains an execution counter for each function. This counter gets incremented as follows:

Each call to the function adds 15 points to the execution counter.

Each loop execution adds 1 point to the execution counter.

We trigger tier-up once the counter reaches some threshold. Thresholds are determined dynamically. To understand our thresholds, first consider their static versions and then let’s look at how we modulate these thresholds based on other information.

LLInt→Baseline tier-up requires 500 points.

Baseline→DFG tier-up requires 1000 points.

DFG→FTL tier-up requires 100000 points.

Over the years we’ve found ways to dynamically adjust these thresholds based on other sources of information, like:

Whether the function got JITed the last time we encountered it (according to our cache). Let’s call this wasJITed .

. How big the function is. Let’s call this S . We use the number of bytecode opcodes plus operands as the size.

. We use the number of bytecode opcodes plus operands as the size. How many times it has been recompiled. Let’s call this R .

. How much executable memory is available. Let’s use M to say how much executable memory we have total, and U is the amount we estimate that we would use (total) if we compiled this function.

to say how much executable memory we have total, and is the amount we estimate that we would use (total) if we compiled this function. Whether profiling is “full” enough.

We select the LLInt→Baseline threshold based on wasJITed . If we don’t know (the function wasn’t in the cache) then we use the basic threshold, 500. Otherwise, if the function wasJITed then we use 250 (to accelerate tier-up) otherwise we use 2000. This optimization is especially useful for improving page load times.

Baseline→DFG and DFG→FTL use the same scaling factor based on S , R , M , and U . The scaling factor is defined as follows:

(0.825914 + 0.061504 * sqrt(S + 1.02406)) * pow(2, R) * M / (M - U)

We multiply this by 1000 for Baseline→DFG and by 100000 for DFG→FTL. Let’s break down what this scaling factor does:

First we scale by the square root of the size. The expression 0.825914 + 0.061504 * sqrt(S + 1.02406) gives a scaling factor that is between 1 and 2 for functions smaller than about 350 bytecodes, which we consider to be “easy” functions to compile. The scaling factor uses square root so it grows somewhat gently. We’ve also tried having the staling factor be linear, but that’s much worse. It is worth it to delay compilations of large functions a bit, but it’s not worth it to delay it too much. Note that the ideal delay doesn’t just have to do with the cost of compilation. It’s also about running long enough to get good profiling. Maybe there is some deep reason why square root works well here, but all we really care about is that scaling by this amount makes programs run faster.

Then we introduce exponential backoff based on the number of times that the function has been recompiled. The pow(2, R) expression means that each recompilation doubles the thresholds.

After that we introduce a hyperbolic scaling factor, M / (M - U) , to help avoid cases where we run out of executable memory altogether. This is important since some configurations of JavaScriptCore run with a small available pool of executable memory. This expression means that if we use half of executable memory then the thresholds are doubled. If we use 3/4 of executable memory then the thresholds are quadrupled. This makes filling up executable memory a bit like going at the speed of light: the math makes it so that as you get closer to filling it up the thresholds get closer to infinity. However, it’s worth noting that this is imperfect for truly large programs, since those might have other reasons to allocate executable memory not covered by this heuristic. The heuristic is also imperfect in cases of multiple things being compiled in parallel. Using this factor increases the maximum program size we can handle with small pools of executable memory, but it’s not a silver bullet.

Finally, if the execution count does reach this dynamically computed threshold, we check that some kinds of profiling (specifically, value and array profiling, discussed in detail in the upcoming profiling section) are full enough. We say that profiling is full enough if more than 3/4 of the profiling sites in the function have data. If this threshold is not met, we reset the execution counters. We let this process repeat five times. The optimizing compilers tend to speculate that unprofiled code is unreachable. This is profitable if that code really won’t ever run, but we want to be extra sure before doing that, hence we give functions with partial profiling 5× the time to warm up.

This is an exciting combination of heuristics! These heuristics were added early in the development of tiering in JSC. They were all added before we built the FTL, and the FTL inherited those heuristics just with a 100× multiplier. Each heuristic was added because it produced either a speed-up or a memory usage reduction or both. We try to remove heuristics that are not known to be speed-ups anymore, and to our knowledge, all of these still contribute to better performance on benchmarks we track.

Exit Counting

After we compile a function with the DFG or FTL, it’s possible that one of the speculations we made is wrong. This will cause the function to OSR exit back to LLInt or Baseline (we prefer Baseline, but may throw away Baseline code during GC, in which case exits from DFG and FTL will go to LLInt). We’ve found that the best way of dealing with a wrong speculation is to throw away the optimized code and try optimizing again later with better profiling. We detect if a DFG or FTL function should be recompiled by counting exits. The exit count thresholds are:

For a normal exit, we require 100 * pow(2, R) exits to recompile.

exits to recompile. If the exit causes the Baseline JIT to enter its loop trigger (i.e. we got stuck in a hot loop after exit), then it’s counted specially. We only allow 5 * pow(2, R) of those kinds of exits before we recompile. Note that this can mean exiting five times and tripping the loop optimization trigger each time or it can mean exiting once and tripping the loop optimization trigger five times.

The first step to recompilation is to jettison the DFG or FTL function. That means that all future calls to the function will call the Baseline or LLInt function instead.

Recompilation

If a function is jettisoned, we increment the recompilation counter ( R in our notation) and reset the tier-up functionality in the Baseline JIT. This means that the function will keep running in Baseline for a while (twice as long as it did before it was optimized last time). It will gather new profiling, which we will be able to combine with the profiling we collected before to get an even more accurate picture of how types behave in the function.

It’s worth looking at an example of this in action. We already showed an idealized case of tier-up in Figure 8, where a function gets compiled by each compiler exactly once and there are no OSR exits or recompilations. We will now show an example where things don’t go so well. This example is picked because it’s a particularly awful outlier. This isn’t how we expect our engine to behave normally. We expect amusingly bad cases like the following to happen occasionally since the success or failure of speculation is random and random behavior means having bad outliers.

_handlePropertyAccessExpression = function (result, node) { result.possibleGetOverloads = node.possibleGetOverloads; result.possibleSetOverloads = node.possibleSetOverloads; result.possibleAndOverloads = node.possibleAndOverloads; result.baseType = Node.visit(node.baseType, this); result.callForGet = Node.visit(node.callForGet, this); result.resultTypeForGet = Node.visit(node.resultTypeForGet, this); result.callForAnd = Node.visit(node.callForAnd, this); result.resultTypeForAnd = Node.visit(node.resultTypeForAnd, this); result.callForSet = Node.visit(node.callForSet, this); result.errorForSet = node.errorForSet; result.updateCalls(); }

This function belongs to the WSL subtest of JetStream 2. It’s part of the WSL compiler’s AST walk. It ends up being a large function after inlining Node.visit. When I ran this on my computer, I found that JSC did 8 compilations before hitting equilibrium for this function:

After running the function in LLInt for a bit, we compile this with Baseline. This is the easy part since Baseline doesn’t need to be recompiled. We compile with DFG. Unfortunately, the DFG compilation exits 101 times and gets jettisoned. The exit is due to a bad type check that the DFG emitted on this . We again compile with the DFG. This time, we exit twice due to a check on result . This isn’t enough times to trigger jettison and it doesn’t prevent tier-up to the FTL. We compile with the FTL. Unfortunately, this compilation gets jettisoned due to a failing watchpoint. Watchpoints (discussed in greater detail in later sections) are a way for the compiler to ask the runtime to notify it when bad things happen rather than emitting a check. Failing watchpoints cause immediate jettison. This puts us back in Baseline. We try the DFG again. We exit seven times due to a bad check on result , just like in step 3. This still isn’t enough times to trigger jettison and it doesn’t prevent tier-up to the FTL. We compile with the FTL. This time we exit 402 times due to a bad type check on node . We jettison and go back to Baseline. We compile with the DFG again. This time there are no exits. We compile with the FTL again. There are no further exits or recompilations.

This sequence of events has some intriguing quirks in addition to the number of compilations. Notice how in steps 3 and 5, we encounter exits due to a bad check on result , but none of the FTL compilations encounter those exits. This seems implausible since the FTL will do at least all of the speculations that the DFG did and a speculation that doesn’t cause jettison also cannot pessimise future speculations. It’s also surprising that the speculation that jettisons the FTL in step 6 wasn’t encountered by the DFG. It is possible that the FTL does more speculations than the DFG, but that usually only happens in inlined functions, and this speculation on node doesn’t seem to be in inlined code. A possible explanation for all of these surprising quirks is that the function is undergoing phase changes: during some parts of execution, it sees one set of types, and during another part of execution, it sees a somewhat different set. This is a common issue. Types are not random and they are often a function of time.

JavaScriptCore’s compiler control system is designed to get good outcomes both for functions where speculation “just works” and for functions like the one in this example that need some extra time. To summarize, control is all about counting executions, exits, and recompilations, and either launching a higher tier compiler (“tiering up”) or jettisoning optimized code and returning to Baseline.

Profiling

This section describes the profiling tiers of JavaScriptCore. The profiling tiers have the following responsibilities:

To provide a non-speculative execution engine. This is important for start-up (before we do any speculation) and for OSR exits. OSR exit needs to exit to something that does no speculation so that we don’t have chains of exits for the same operation.

To record useful profiling. Profiling is useful if it enables us to make profitable speculations. Speculations are profitable if doing them makes programs run faster.

In JavaScriptCore, the LLInt and Baseline are the profiling tiers while DFG and FTL are the optimizing tiers. However, DFG and FTL also collect some profiling, usually only when it’s free to do so and for the purpose of refining profiling collected by the profiling tiers.

This section is organized as follows. First we explain how JavaScriptCore’s profiling tiers execute code. Then we explain the philosophy of how to profile. Finally we go into the details of JavaScriptCore’s profiling implementation.

How Profiled Execution Works

JavaScriptCore profiles using the LLInt and Baseline tiers. LLInt interprets bytecode while Baseline compiles it. The two tiers share a nearly identical ABI so that it’s possible to jump from one to the other at any bytecode instruction boundary.

LLInt: The Low Level Interpreter

The LLInt is an interpreter that obeys JIT ABI (in the style of HotSpot‘s interpreter). To that end, it is written in a portable assembly language called offlineasm. Offlineasm has a functional macro language (you can pass macro closures around) embedded in it. The offlineasm compiler is written in Ruby and can compile to multiple CPUs as well as C++. This section tells the story of why this crazy design produces a good outcome.

The LLInt simultaneously achieves multiple goals for JavaScriptCore:

LLInt is JIT-friendly. The LLInt runs on the same stack that the JITs run on (which happens to be the C stack). The LLInt even agrees on register conventions with the JITs. This makes it cheap for LLInt to call JITed functions and vice versa. It makes LLInt→Baseline and Baseline→LLInt OSR trivial and it makes any JIT→LLInt OSR possible.

LLInt allows us to execute JavaScript code even if we can’t JIT. JavaScriptCore in no-JIT mode (we call it “mini mode”) has some advantages: it’s harder to exploit and uses less memory. Some JavaScriptCore clients prefer the mini mode. JSC is also used on CPUs that we don’t have JIT support for. LLInt works great on those CPUs.

LLInt reduces memory usage. Any machine code you generate from JavaScript is going to be big. Remember, there’s a reason why they call JavaScript “high level” and machine code “low level”: it refers to the fact that when you lower JavaScript to machine code, you’re going to get many instructions for each JavaScript expression. Having the LLInt means that we don’t have to generate machine code for all JavaScript code, which saves us memory.

LLInt starts quickly. LLInt interprets our bytecode format directly. It’s designed so that we could map bytecode from disk and point the interpreter at it. The LLInt is essential for achieving great page load time in the browser.

LLInt is portable. It can be compiled to C++.

It would have been natural to write the LLInt in C++, since that’s what most of JavaScriptCore is written in. But that would have meant that the interpreter would have a C++ stack frame constructed and controlled by the C++ compiler. This would have introduced two big problems:

It would be unclear how to OSR from the LLInt to the Baseline JIT or vice versa, since OSR would have to know how to decode and reencode a C++ stack frame. We don’t doubt that it’s possible to do this with enough cleverness, but it would create constraints on exactly how OSR works and it’s not an easy piece of machinery to maintain. JS functions running in the LLInt would have two stack frames instead of one. One of those stack frames would have to go onto the C++ stack (because it’s a C++ stack frame). We have multiple choices of how to manage the JS stack frame (we could try to alloca it on top of the C++ frame, or allocate it somewhere else) but this inevitably increases cost: calls into the interpreter would have to do twice the work. A common optimization to this approach is to have interpreter→interpreter calls reuse the same C++ stack frame by managing a separate JS stack on the side. Then you can have the JITs use that separate JS stack. This still leaves cost when calling out of interpreter to JIT or vice versa.

A natural way to avoid these problems is to write the interpreter in assembly. That’s basically what we did. But a JavaScript interpreter is a complex beast. It would be awful if porting JavaScriptCore to a new CPU meant rewriting the interpreter in another assembly language. Also, we want to use abstraction to write it. If we wrote it in C++, we’d probably have multiple functions, templates, and lambdas, and we would want all of them to be inlined. So we designed a new language, offlineasm, which has the following features:

Portable assembly with our own mnemonics and register names that match the way we do portable assembly in our JIT. Some high-level mnemonics require lowering. Offlineasm reserves some scratch registers to use for lowering.

The macro construct. It’s best to think of this as a lambda that takes some arguments and returns void. Then think of the portable assembly statements as print statements that output that assembly. So, the macros are executed for effect and that effect is to produce an assembly program. These are the execution semantics of offlineasm at compile time.

Macros allow us to write code with rich abstractions. Consider this example from the LLInt:

macro llintJumpTrueOrFalseOp(name, op, conditionOp) llintOpWithJump(op_%name%, op, macro (size, get, jump, dispatch) get(condition, t1) loadConstantOrVariable(size, t1, t0) btqnz t0, ~0xf, .slow conditionOp(t0, .target) dispatch() .target: jump(target) .slow: callSlowPath(_llint_slow_path_%name%) nextInstruction() end) end

This is a macro that we use for implementing both jtrue and jfalse and opcodes. There are only three lines of actual assembly in this listing: the btqnz (branch test quad not zero) and the two labels ( .target and .slow ). This also shows the use of first-class macros: on the second line, we call llintOpWithJump and pass it a macro closure as the third argument. The great thing about having a lambda-like construct like macro is that we don’t need much else to have a pleasant programming experience. The LLInt is written in about 5000 lines of offlineasm (if you only count the 64-bit version).

To summarize, LLInt is an interpreter written in offlineasm. LLInt understands JIT ABI so calls and OSR between LLInt and JIT are cheap. The LLInt allows JavaScriptCore to load code more quickly, use less memory, and run on more platforms.

Baseline: The Bytecode Template JIT

The Baseline JIT achieves a speed-up over the LLInt at the cost of some memory and the time it takes to generate machine code. Baseline’s speed-up is thanks to two factors:

Removal of interpreter dispatch. Interpreter dispatch is the costliest part of interpretation, since the indirect branches used for selecting the implementation of an opcode are hard for the CPU to predict. This is the primary reason why Baseline is faster than LLInt.

Comprehensive support for polymorphic inline caching. It is possible to do sophisticated inline caching in an interpreter, but currently our best inline caching implementation is the one shared by the JITs.

The Baseline JIT compiles bytecode by turning each bytecode instruction into a template of machine code. For example, a bytecode instruction like:

add loc6, arg1, arg2

Is turned into something like:

0x2f8084601a65: mov 0x30(%rbp), %rsi 0x2f8084601a69: mov 0x38(%rbp), %rdx 0x2f8084601a6d: cmp %r14, %rsi 0x2f8084601a70: jb 0x2f8084601af2 0x2f8084601a76: cmp %r14, %rdx 0x2f8084601a79: jb 0x2f8084601af2 0x2f8084601a7f: mov %esi, %eax 0x2f8084601a81: add %edx, %eax 0x2f8084601a83: jo 0x2f8084601af2 0x2f8084601a89: or %r14, %rax 0x2f8084601a8c: mov %rax, -0x38(%rbp)

The only parts of this code that would vary from one add instruction to another are the references to the operands. For example, 0x30(%rbp) (that’s x86 for the memory location at frame pointer plus 0x30) is the machine code representation of arg1 in bytecode.

The Baseline JIT does few optimizations beyond just emitting code templates. It does no register allocation between instruction boundaries, for example. The Baseline JIT does some local optimizations, like if an operand to a math operation is a constant, or by using profiling information collected by the LLInt. Baseline also has good support for code repatching, which is essential for implementing inline caching. We discuss inline caching in detail later in this section.

To summarize, the Baseline JIT is a mostly unoptimized JIT compiler that focuses on removing interpreter dispatch overhead. This is enough to make it a ~2× speed-up over the LLInt.

Profiling Philosophy

Profiling in JSC is designed to be cheap and useful.

JavaScriptCore’s profiling aims to incur little or no cost in the common case. Running with profiling turned on but never using the results to do optimizations should result in throughput that is about as good as if all of the profiling was disabled. We want profiling to be cheap because even in a long running program, lots of functions will only run once or for too short to make an optimizing JIT profitable. Some functions might finish running in less time than it takes to optimize them. The profiling can’t be so expensive that it makes functions like that run slower.

Profiling is meant to help the compiler make the kinds of speculations that cause the program to run faster when we factor in both the speed-ups from speculations that are right and the slow-downs from speculations that are wrong. It’s possible to understand this formally by thinking of speculation as a bet. We say that profiling is useful if it turns the speculation into a value bet. A value bet is one where the expected value (EV) is positive. That’s another way of saying that the average outcome is profitable, so if we repeated the bet an infinite number of times, we’d be richer. Formally the expected value of a bet is:

p * B - (1 - p) * C

Where p is the probability of winning, B is the benefit of winning, and C is the cost of losing (both B and C are positive). A bet is a value bet iff:

p * B - (1 - p) * C > 0

Let’s view speculation using this formula. The scenario in which we have the choice to make a bet or not is that we are compiling a bytecode instruction, we have some profiling that implies that we should speculate, and we have to choose whether to speculate or not. Let’s say that B and C both have to do with the latency, in nanoseconds, of executing a bytecode instruction once. B is the improvement to that latency if we do some speculation and it turns out to be right. C is the regression to that latency if the speculation we make is wrong. Of course, after we have made a speculation, it will run many times and may be right sometimes and wrong sometimes. But B is just about the speed-up in the right cases, and C is just about the slow-down in the wrong cases. The baseline relative to which B and C are measured is the latency of the bytecode instruction if it was compiled with an optimizing JIT but without that particular OSR-exit-based speculation.

For example, we may have a less-than operation, and we are considering whether to speculate that neither input is double. We can of course compile less-than without making that speculation, so that’s the baseline. If we do choose to speculate, then B is the speed-up to the average execution latency of that bytecode in those cases when neither input is double. Meanwhile, C is the slow-down to the average execution latency of that bytecode in those cases when at least one input is a double.

For B , let’s just compute some bounds. The lower bound is zero, since some speculations are not profitable. A pretty good first order upper bound for B is the difference in per-bytecode-instruction latency between the baseline JIT and the FTL. Usually, the full speed-up of a bytecode instruction between baseline to FTL is the result of multiple speculations as well as nonspeculative compiler optimizations. So, a single speculation being responsible for the full difference in performance between baseline and FTL is a fairly conservative upper bound for B . Previously, we said that on average in the JetStream 2 benchmark on my computer, a bytecode instruction takes 1.71 ns to execute in Baseline and .225 ns to execute in FTL. So we can say:

B <= 1.71 ns - .225 ns = 1.48 ns

Now let’s estimate C . C is how many more nanoseconds it takes to execute the bytecode instruction if we have speculated and we experience speculation failure. Failure means executing an OSR exit stub and then reexecuting the same bytecode instruction in baseline or LLInt. Then, all subsequent bytecodes in the function will execute in baseline or LLInt rather than DFG or FTL. Every 100 exits or so, we jettison and eventually recompile. Compiling is concurrent, but running a concurrent compiler is sure to slow down the main thread even if there is no lock contention. To fully capture C , we have to account for the cost of the OSR exit itself and then amortize the cost of reduced execution speed of the remainder of the function and the cost of eventual recompilation. Fortunately, it’s pretty easy to measure this directly by hacking the DFG frontend to randomly insert pointless OSR exits with low probability and by having JSC report a count of the number of exits. I did an experiment with this hack for every JetStream 2 benchmark. Running without the synthetic exits, we get an execution time and a count of the number of exits. Running with synthetic exits, we get a longer execution time and a larger number of exits. The slope between these two points is an estimate of C . This is what I found, on the same machine that I used for running the experiments to compute B :

[DFG] C = 2499 ns [FTL] C = 9998 ns

Notice how C is way bigger than B ! This isn’t some slight difference. We are talking about three orders of magnitude for the DFG and four orders of magnitude for the FTL. This paints a clear picture: speculation is a bet with tiny benefit and enormous cost.

For the DFG, this means that we need:

p > 0.9994

For speculation to be a value bet. p has to be even closer to 1 for FTL. Based on this, our philosophy for speculation is we won’t do it unless we think that:

p ~ 1

Since the cost of speculation failure is so enormous, we only want to speculate when we know that we won’t fail. The speed-up of speculation happens because we make lots of sure bets and only a tiny fraction of them ever fail.

It’s pretty clear what this means for profiling:

Profiling needs to focus on noting counterexamples to whatever speculations we want to do. We don’t want to speculate if profiling tells us that the counterexample ever happened, since if it ever happened, then the EV of this speculation is probably negative. This means that we are not interested in collecting probability distributions. We just want to know if the bad thing ever happened.

Profiling needs to run for a long time. It’s common to wish for JIT compilers to compile hot functions sooner. One reason why we don’t is that we need about 3-4 “nines” of confidence that that the counterexamples didn’t happen. Recall that our threshold for tiering up into the DFG is about 1000 executions. That’s probably not a coincidence.

Finally, since profiling is a bet, it’s important to approach it with a healthy gambler’s philosophy: the fact that a speculation succeeded or failed in a particular program does not tell us if the speculation is good or bad. Speculations are good or bad only based on their average behavior. Focusing too much on whether profiling does a good job for a particular program may result in approaches that cause it to perform badly on average.

Profiling Sources in JavaScriptCore

JavaScriptCore gathers profiling from multiple different sources. These profiling sources use different designs. Sometimes, a profiling source is a unique source of data, but other times, profiling sources are able to provide some redundant data. We only speculate when all profiling sources concur that the speculation would always succeed. The following sections describe our profiling sources in detail.

Case Flags

Case flags are used for branch speculation. This applies anytime the best way to implement a JS operation involves branches and multiple paths, like a math operation having to handle either integers or doubles. The easiest way to profile and speculate is to have the profiling tiers implement both sides of the branch and set a different flag on each side. That way, the optimizing tier knows that it can profitably speculate that only one path is needed if the flags for the other paths are not set. In cases where there is clearly a preferred speculation — for example, speculating that an integer add did not overflow is clearly preferred overspeculating that it did overflow — we only need flags on the paths that we don’t like (like the overflow path).

Let’s consider two examples of case flags in more detail: integer overflow and property accesses on non-object values.

Say that we are compiling an add operation that is known to take integers as inputs. Usually the way that the LLInt interpreter or Baseline compiler would “know” this is that the add operation we’ll talk about is actually the part of a larger add implementation after we’ve already checked that the inputs are integers. Here’s the logic that the profiling tier would use written as if it was C++ code to make it easy to parse:

int32_t left = ...; int32_t right = ...; ArithProfile* profile = ...; // This is the thing with the case flags. int32_t intResult; JSValue result; // This is a tagged JavaScript value that carries type. if (UNLIKELY(addOverflowed(left, right, &intResult))) { result = jsNumber(static_cast<double>(left) + static_cast<double>(right)); // Set the case flag indicating that overflow happened. profile->setObservedInt32Overflow(); } else result = jsNumber(intResult);

When optimizing the code, we will inspect the ArithProfile object for this instruction. If !profile->didObserveInt32Overflow() , we will emit something like:

int32_t left = ...; int32_t right = ...; int32_t result; speculate(!addOverflowed(left, right, &result));

I.e. we will add and branch to an exit on overflow. Otherwise we will just emit the double path:

double left = ...; double right = ...; double result = left + right;

Unconditionally doing double math is not that expensive; in fact on benchmarks that I’ve tried, it’s cheaper than doing integer math and checking overflow. The only reason why integers are profitable is that they are cheaper to use for bit operations and pointer arithmetic. Since CPUs don’t accept floats or doubles for bit and pointer math, we need to convert the double to an integer first if the JavaScript program uses it that way (pointer math arises when a number is used as an array index). Such conversions are relatively expensive even on CPUs that support them natively. Usually it’s hard to tell, using profiling or any static analysis, whether a number that a program computed will be used for bit or pointer math in the future. Therefore, it’s better to use integer math with overflow checks so that if the number ever flows into an operation that requires integers, we won’t have to pay for expensive conversions. But if we learn that any such operation overflows — even occasionally — we’ve found that it’s more efficient overall to unconditionally switch to double math. Perhaps the presence of overflows is strongly correlated with the result of those operations not being fed into bit math or pointer math.

A simpler example is how case flags are used in property accesses. As we will discuss in the inline caches section, property accesses have associated metadata that we use to track details about their behavior. That metadata also has flags, like the sawNonCell bit, which we set to true if the property access ever sees a non-object as the base. If the flag is set, the optimizing compilers know not to speculate that the property access will see objects. This typically forces all kinds of conservatism for that property access, but that’s better than speculating wrong and exiting in this case. Lots of case flags look like sawNonCell : they are casually added as a bit in some existing data structure to help the optimizing compiler know which paths were taken.

To summarize, case flags are used to record counterexamples to the speculations that we want to do. They are a natural way to implement profiling in those cases where the profiling tiers would have had to branch anyway.

Case Counts

A predecessor to case flags in JavaScriptCore is case counts. It’s the same idea as flags, but instead of just setting a bit to indicate that a bad thing happened, we would count. If the count never got above some threshold, we would speculate.

Case counts were written before we realized that the EV of speculation is awful unless the probability of success is basically 1. We thought that we could speculate in cases where we knew we’d be right a majority of the time, for example. Initial versions of case counts had variable thresholds — we would compute a ratio with the execution count to get a case rate. That didn’t work as well as fixed thresholds, so we switched to a fixed count threshold of 100. Over time, we lowered the threshold to 20 or 10, and then eventually found that the threshold should really be 1, at which point we switched to case flags.

Some functionality still uses case counts. We still have case counts for determining if the this argument is exotic (some values of this require the function to perform a possibly-effectful conversion in the prologue). We still have case counts as a backup for math operations overflowing, though that is almost certainly redundant with our case flags for math overflow. It’s likely that we will remove case counts from JavaScriptCore eventually.

Value Profiling

Value profiling is all about inferring the types of JavaScript values (JSValues). Since JS is a dynamic language, JSValues have a runtime type. We use a 64-bit JSValue representation that uses bit encoding tricks to hold either doubles, integers, booleans, null, undefined, or pointers to cell, which may be JavaScript objects, symbols, or strings. We refer to the act of encoding a value in a JSValue as boxing it and the act of decoding as unboxing (note that boxing is a term used in other engines to refer specifically to the act of allocating a box object in the heap to hold a value; our use of the term boxing is more like what others call tagging). In order to effectively optimize JavaScript, we need to have some way of inferring the type so that the compiler can assume things about it statically. Value profiling tracks the set of values that a particular program point saw so that we can predict what types that program point will see in the future.

Figure 12. Value profiling and prediction propagation for a sample data flow graph.

We combine value profiling with a static analysis called prediction propagation. The key insight is that prediction propagation can infer good guesses for the types for most operations if it is given a starting point for certain opaque operations:

Arguments incoming to the function.

Results of most load operations.

Results of most calls.

There’s no way that a static analysis running just on some function could guess what types loads from plain JavaScript arrays or calls to plain JavaScript functions could have. Value profiling is about trying to help the static analysis guess the types of those opaque operations. Figure 12 shows how this plays out for a sample data flow graph. There’s no way static analysis can tell the type of most GetByVal and GetById oerations, since those are loads from dynamically typed locations in the heap. But if we did know what those operations return then we can infer types for this entire graph by using simple type rules for Add (like that if it takes integers as inputs and the case flags tell us there was no overflow then it will produce integers).

Let’s break down value profiling into the details of how exactly values are profiled, how prediction propagation works, and how the results of prediction propagation are used.

Recording value profiles. At its core, value profiling is all about having some program point (either a point in the interpreter or something emitted by the Baseline JIT) log the value that it saw. We log values into a single bucket so that each time the profiling point runs, it overwrites the last seen value. The code looks like this in the LLInt:

macro valueProfile(op, metadata, value) storeq value, %op%::Metadata::profile.m_buckets[metadata] end

Let’s look at how value profiling works for the get_by_val bytecode instruction. Here’s part of the code for get_by_val in LLInt:

llintOpWithMetadata( op_get_by_val, OpGetByVal, macro (size, get, dispatch, metadata, return) macro finishGetByVal(result, scratch) get(dst, scratch) storeq result, [cfr, scratch, 8] valueProfile(OpGetByVal, t5, result) dispatch() end ... // more code for get_by_val

The implementation of get_by_val includes a finishGetByVal helper macro that stores the result in the right place on the stack and then dispatches to the next instruction. Note that it also calls valueProfile to log the result just before finishing.

Each ValueProfile object has a pair of buckets and a predicted type. One bucket is for normal execution. The valueProfile macro in the LLInt uses this bucket. The other bucket is for OSR exit: if we exit due to a speculation on a type that we got from value profiling, we feed the value that caused OSR exit back into the second bucket of the ValueProfile.

Each time that our execution counters (used for controlling when to invoke the next tier) count about 1000 points, the execution counting slow path updates all predicted types for the value profiles in that function. Updating value profiles means computing a predicted type for the value in the bucket and merging that type with the previously predicted type. Therefore, after repeated predicted type updates, the type will be broad enough to be valid for multiple different values that the code saw.

Predicted types use the SpeculatedType type system. A SpeculatedType is a 64-bit integer in which we use the low 40 bits to represent a set of 40 fundamental types. The fundamental types, shown in Figure 13, represent non-overlapping set of possible JSValues. 240 SpeculatedTypes are possible by setting any combination of bits.

Figure 13. All of the fundamental SpeculatedTypes.

This allows us to invent whatever types are useful for optimization. For example, we distinguish between 32-bit integers whose value is either 0 or 1 (BoolInt32) versus whose value is anything else (NonBoolInt32). Together these form the Int32Only type, which just has both bits set. BoolInt32 is useful for cases there integers are converted to booleans.

Prediction propagation. We use value profiling to fill in the blanks for the prediction propagation pass of the DFG compiler pipeline. Prediction propagation is an abstract interpreter that tracks the set of types that each variable in the program can have. It’s unsound since the types it produces are just predictions (it can produce any combination of types and at worst we will just OSR exit too much). However, it can be said that we optimize it to be sound; the more sound it is, the fewer OSR exits we have. Prediction propagation fills in the things that the abstract interpreter can’t reason about (loads from the heap, results returned by calls, arguments to the function, etc.) using the results of value profiling. On the topic of soundness, we would consider it to be a bug if the prediction propagation was unsound in a world where value profiling is never wrong. Of course, in reality, we know that value profiling will be wrong, so we know that prediction propagation is unsound.

Let’s consider some of the cases where prediction propagation can be sure about the result type of an operation based on the types of its inputs.

array[index]

Figure 14. Some of the prediction propagation rules for Add. This figure doesn’t show the rules for string concatenation and objects.Figure 15. Some of the prediction propagation rules for GetByVal (the DFG opcode for subscript access like). This figure only shows a small sample of the GetByVal rules.

Figure 14 shows some of the rules for the Add operation in DFG IR. Prediction propagation and case flags tell us everything we want to know about the output of Add. If the inputs are integers and the overflow flag isn’t set, the output is an integer. If the inputs are any other kinds of numbers or there are overflows, the output is a double. We don’t need anything else (like value profiling) to understand the output type of Add.

Figure 15 shows some of the rules for GetByVal, which is the DFG representation of array[index] . In this case, there are types of arrays that could hold any type of value. So, even knowing that it is a JSArray isn’t enough to know the types of values inside the array. Also, if the index is a string, then this could be accessing some named property on the array object or one of its prototypes and those could have any type. It’s in cases like GetByVal that we leverage value profiling to guess what the result type is.

Prediction propagation combined with value profiling allows the DFG to infer a predicted type at every point in the program where a variable is used. This allows operations that don’t do any profiling on their own to still perform type-based speculations. It’s of course possible to also have bytecode instructions that can speculate on type collect case flags (or use some other mechanism) to drive those speculations — and that approach can be more precise — but value profiling means that we don’t have to do this for every operation that wants type-based speculation.

Using predicted types. Consider the CompareEq operation in DFG IR, which is used for the DFG lowering of the eq , eq_null , neq , neq_null , jeq , jeq_null , jneq , and jneq_null bytecodes. These bytecodes do no profiling of their own. But CompareEq is one of the most aggressive type speculators in all of the DFG. CompareEq can speculate on the types it sees without doing any profiling of its own because the values it uses will either have value profiling or will have a predicted type filled in by prediction propagation.

Type speculations in the DFG are written like:

CompareEq(Int32:@left, Int32:@right)

This example means that the CompareEq will specuate that both operands are Int32. CompareEq supports the following speculations, plus others we don’t list here:

CompareEq(Boolean:@left, Boolean:@right) CompareEq(Int32:@left, Int32:@right) CompareEq(Int32:BooleanToNumber(Boolean:@left), Int32:@right) CompareEq(Int32:BooleanToNumber(Untyped:@left), Int32:@right) CompareEq(Int32:@left, Int32:BooleanToNumber(Boolean:@right)) CompareEq(Int32:@left, Int32:BooleanToNumber(Untyped:@right)) CompareEq(Int52Rep:@left, Int52Rep:@right) CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(Int52:@right)) CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(RealNumber:@right)) CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(Number:@right)) CompareEq(DoubleRep:DoubleRep(Int52:@left), DoubleRep:DoubleRep(NotCell:@right)) CompareEq(DoubleRep:DoubleRep(RealNumber:@left), DoubleRep:DoubleRep(RealNumber:@right)) CompareEq(DoubleRep:..., DoubleRep:...) CompareEq(StringIdent:@left, StringIdent:@right) CompareEq(String:@left, String:@right) CompareEq(Symbol:@left, Symbol:@right) CompareEq(Object:@left, Object:@right) CompareEq(Other:@left, Untyped:@right) CompareEq(Untyped:@left, Other:@right) CompareEq(Object:@left, ObjectOrOther:@right) CompareEq(ObjectOrOther:@left, Object:@right) CompareEq(Untyped:@left, Untyped:@right)

Some of these speculations, like CompareEq(Int32:, Int32:) or CompareEq(Object:, Object:) , allow the compiler to just emit an integer compare instruction. Others, like CompareEq(String:, String:) , emit a string compare loop. We have lots of variants to optimally handle bizarre comparisons that are not only possible in JS but that we have seen happen frequently in the wild, like comparisons between numbers and booleans and comparisons between one value that is always a number and another that is either a number or a boolean. We provide additional optimizations for comparisons between doubles, comparisons between strings that have been hash-consed (so-called StringIdent, which can be compared using comparison of the string pointer), and comparisons where we don’t know how to speculate ( CompareEq(Untyped:, Untyped:) ).

The basic idea of value profiling — storing a last-seen value into a bucket and then using that to bootstrap a static analysis — is something that we also use for profiling the behavior of array accesses. Array profiles and array allocation profiles are like value profiles in that they save the last result in a bucket. Like value profiling, data from those profiles is incorporated into prediction propagation.

To summarize, value profiling allows us to predict the types of variables at all of their use sites by just collecting profiling at those bytecode instructions whose output cannot be predicted with abstract interpretation. This serves as the foundation for how the DFG (and FTL, since it reuses the DFG’s frontend) speculates on the types of JSValues.

Inline Caches

Property accesses and function calls are particularly difficult parts of JavaScript to optimize:

Objects behave as if they were just ordered mappings from strings to JSValues. Lookup, insertion, deletion, replacement, and iteration are possible. Programs do these operations a lot, so they have to be fast. In some cases, programs use objects the same way that programs in other languages would use hashtables. In other cases, programs use objects the same way that they would in Java or some sensibly-typed object-oriented language. Most programs do both.

Function calls are polymorphic. You can’t make static promises about what function will be called.

Both of these dynamic features are amenable to optimization with Deutsch and Schiffman’s inline caches (ICs). For dynamic property access, we combine this with structures, based on the idea of maps in the Chambers, Ungar, and Lee’s Self implementation. We also follow Hölzle, Chambers, and Ungar: our inline caches are polymorphic and we use data from these caches as profiling of the types observed at a particular property access or call site.

It’s worth dwelling a bit on the power of inline caching. Inline caches are great optimizations separately from speculative compilation. They make the LLInt and Baseline run faster. Inline caches are our most powerful profiling source, since they can precisely collect information about every type encountered by an access or call. Note that we previously said that good profiling has to be cheap. We think of inline caches as negative cost profiling since inline caches make the LLInt and Baseline faster. It doesn’t get cheaper than that!

This section focuses on inline caching for dynamic property access, since it’s strictly more complex than for calls (accesses use structures, polymorphic inline caches (PICs), and speculative compilation; calls only use polymorphic inline caches and speculative compilation). We organize our discussion of inline caching for dynamic property access as follows. First we describe how structures work. Then we show the JavaScriptCore object model and how it incorporates structures. Next we show how inline caches work. Then we show how profiling from inline caches is used by the optimizing compilers. After that we show how inline caches support polymorphism and polyvariance. Finally we talk about how inline caches are integrated with the garbage collector.

Structures. Objects in JavaScript are just mappings from strings to JSValues. Lookup, insertion, deletion, replacement, and iteration are all possible. We want to optimize those uses of objects that would have had a type if the language had given the programmer a way to say it.

Figure 16. Some JavaScript objects that have x and y properties. Some of them have exactly the same shape (only x and y in the same order).

Consider how to implement a property access like:

var tmp = o.x;

Or:

o.x = tmp;

One way to make this fast is to use hashtables. That’s certainly a necessary fallback mechanism when the JavaScript program uses objects more like hashtables than like objects (i.e. it frequently inserts and deletes properties). But we can do better.

This problem frequently arises in dynamic programming languages and it has a well-understood solution. The key insight of Chambers, Ungar, and Lee’s Self implementation is that property access sites in the program will typically only see objects of the same shape. Consider the objects in Figure 16 that have x and y properties. Of course it’s possible to insert x and y in two possible orders, but folks will tend to pick some order and stick to it (like x first). And of course it’s possible to also have objects that have a z property, but it’s less likely that a property access written as part of the part of the program that works with {x, y} objects will be reused for the part that uses {x, y, z}. It’s possible to have shared code for many different kinds of objects but unshared code is more common. Therefore, we split the object representation into two parts:

The object itself, which only contains the property values and a structure pointer.

The structure, which is a hashtable that maps property names (strings) to indices in the objects that have that structure.

Figure 17. The same objects as in Figure 16, but using structures.

Figure 17 shows objects represented using structures. Objects only contain object property values and a pointer to a structure. The structure tells the property names and their order. For example, if we wanted to ask the {1, 2} object in Figure 17 for the value of property x, we would load the pointer to its structure, {x, y}, and ask that structure for the index of x. The index is 0, and the value at index 0 in the {1, 2} object is 1.

A key feature of structures is that they are hash consed. If two objects have the same properties in the same order, they are likely to have the same structure. This means that checking if an object has a certain structure is O(1): just load the structure pointer from the object header and compare the pointer to a known value.

Structures can also indicate that objects are in dictionary or uncacheable dictionary mode, which are basically two levels of hashtable badness. In both cases, the structure stops being hash consed and is instead paired 1:1 with its object. Dictionary objects can have new properties added to them without the structure changing (the property is added to the structure in-place). Uncacheable dictionary objects can have properties deleted from them without the structure changing. We won’t go into these modes in too much detail in this post.

To summarize, structures are hashtables that map property names to indices in the object. Object property lookup uses the object’s structure to find the index of the property. Structures are hash consed to allow for fast structure checks.

Figure 18. The JavaScriptCode object model.

JavaScriptCore object model. JavaScriptCore uses objects with a 64-bit header that includes a 32-bit structure ID and 32 bits worth of extra state for GC, type checks, and arrays. Figure 18 shows the object model. Named object properties may end up either in the inline slots or the out-of-line slots. Objects get some number of inline slots based on simple static analysis around the allocation site. If a property is added that doesn’t fit in the inline slots, we allocate a butterfly to store additional properties out-of-line. Accessing out-of-line properties in the butterfly costs one extra load.

Figure 19 shows an example object that only has two inline properties. This is the kind of object you would get if you used the object literal {f:5, g:6} or if you assigned to the f and g properties reasonably close to the allocation.

Figure 19. Example JavaScriptCore object together with its structure.

Simple inline caches. Let’s consider the code:

var v = o.f;

Let’s assume that all of the objects that flow into this have structure 42 like the object in Figure 19. Inline caching this property access is all about emitting code like the following:

if (o->structureID == 42) v = o->inlineStorage[0] else v = slowGet(o, "f")

But how do we know that o will have structure 42? JavaScript does not give us this information statically. Inline caches get this information by filling it in once the code runs. There are a number of techniques for this, all of which come down to self-modifying code. Let’s look at how the LLInt and Baseline do it.

In the LLInt, the metadata for get_by_id contains a cached structure ID and a cached offset. The cached structure ID is initialized to an absurd value that no structure can have. The fast path of get_by_id loads the property at the cached offset if the object has the cached structure. Otherwise, we take a slow path that does the full lookup. If that full lookup is cacheable, it stores the structure ID and offset in the metadata.

The Baseline JIT does something more sophisticated. When emitting a get_by_id , it reserves a slab of machine code space that the inline caches will later fill in with real code. The only code in this slab initially is an unconditional jump to a slow path. The slow path does the fully dynamic lookup. If that is deemed cacheable, the reserved slab is replaced with code that does the right structure check and loads at the right offset. Here’s an example of a get_by_id initially compiled with Baseline:

0x46f8c30b9b0: mov 0x30(%rbp), %rax 0x46f8c30b9b4: test %rax, %r15 0x46f8c30b9b7: jnz 0x46f8c30ba2c 0x46f8c30b9bd: jmp 0x46f8c30ba2c 0x46f8c30b9c2: o16 nop %cs:0x200(%rax,%rax) 0x46f8c30b9d1: nop (%rax) 0x46f8c30b9d4: mov %rax, -0x38(%rbp)

The first thing that this code does is check that o (stored in %rax ) is really an object (using a test and jnz ). Then notice the unconditional jmp followed by two long nop instructions. This jump goes to the same slow path that we would have branched to if o was not an object. After the slow path runs, this is repatched to:

0x46f8c30b9b0: mov 0x30(%rbp), %rax 0x46f8c30b9b4: test %rax, %r15 0x46f8c30b9b7: jnz 0x46f8c30ba2c 0x46f8c30b9bd: cmp $0x125, (%rax) 0x46f8c30b9c3: jnz 0x46f8c30ba2c 0x46f8c30b9c9: mov 0x18(%rax), %rax 0x46f8c30b9cd: nop 0x200(%rax) 0x46f8c30b9d4: mov %rax, -0x38(%rbp)

Now, the is-object check is followed by a structure check (using cmp to check that the structure is 0x125) and a load at offset 0x18.

Inline caches as a profiling source. The metadata we use to maintain inline caches makes for a fantastic profiling source. Let’s look closely at what this means.

Figure 20. Timeline of using an inline cache at each JIT tier. Note that we end up having to generate code for this `get_by_id` *six times* in the berst case that each tier compiles this only once.

Figure 20 shows a naive use of inline caches in a multi-tier engine, where the DFG JIT forgets everything that we learned from the Baseline inline cache and just compiles a blank inline cache. This is reasonably efficient and we fall back on this approach when the inline caches from the LLInt and Baseline tell us that there is unmanagable polymorphism. Before we go into how polymorphism is profiled, let’s look at how a speculative compiler really wants to handle simple monomorphic inline caches like the one in Figure 20, where we only see one structure (S1) and the code that the IC emits is trivial (load at offset 10 from %rax ).

When the DFG frontend (shared by DFG and FTL) sees an operation like get_by_id that can be implemented with ICs, it reads the state of all ICs generated for that get_by_id . By “all ICs” we mean all ICs that are currently in the heap. This usually just means reading the LLInt and Baseline ICs, but if there exists a DFG or FTL function that generated an IC for this get_by_id then we will also read that IC. This can happen if a function gets compiled multiple times due to inlining — we may be compiling function bar that inlines a call to function foo and foo already got compiled with FTL and the FTL emitted an IC for our get_by_id .

If all ICs for a get_by_id concur that the operation is monomorphic and they tell us the structure to use, then the DFG frontend converts the get_by_id into inline code that does not get repatched. This is shown in Figure 21. Note that this simple get_by_id is lowered to two DFG operations: CheckStructure, which OSR exits if the given object does not have the required structure, and GetByOffset, which is just a load with known offset and field name.

Figure 21. Inlining a simple momomorphic inline cache in DFG and FTL.

CheckStructure and GetByOffset are understood precisely in DFG IR:

CheckStructure is a load to get the structure ID of an object and a branch to compare that structure ID to a constant. The compiler knows what structures are. After a CheckStructcure, the compiler knows that it’s safe to execute loads to any of the properties that the structure says that the object has.

GetByOffset is a load from either an inline or out-of-line property of a JavaScript object. The compiler knows what kind of property is being loaded, what its offset is, and what the name of the property would have been.

The DFG knows all about how to model these operations and the dependency between them:

The DFG knows that neither operation causes a side effect, but that the CheckStructure represents a conditional side exit, and both operations read the heap.

The DFG knows that two CheckStructures on the same structure are redundant unless some operation between them could have changed object structure. The DFG knows a lot about how to optimize away redundant structure checks, even in cases where there is a function call between two of them (more on this later).

The DFG knows that two GetByOffsets that speak of the same property and object are loading from the same memory location. The DFG knows how to do alias analaysis on those properties, so it can precisely know when a GetByOffset’s memory location got clobbered.

The DFG knows that if it wants to hoist a GetByOffset then it has to ensure that the corresponding CheckStructure gets hoisted first. It does this using abstract interpretation, so there is no need to have a dependency edge between these operations.

The DFG knows how to generate either machine code (in the DFG tier) or B3 IR (in the FTL tier) for CheckStructure and GetByOffset. In B3, CheckStructure becomes a Load, NotEqual, and Check, while GetByOffset usually just becomes a Load.

Figure 22. Inlining two momomorphic inline caches, for different properties on the same object, in DFG and FTL. The DFG and FTL are able to eliminate the CheckStructure for the second IC.

The biggest upshot of lowering ICs to CheckStructure and GetByOffset is the redundancy elimination. The most common redundancy we eliminate is multiple CheckStrutures. Lots of code will do multiple loads from the same object, like:

var f = o.f; var g = o.g;

With ICs, we would check the structure twice. Figure 22 shows what happens when the speculative compilers inline these ICs. We are left with just a single CheckStructure instead of two thanks to the fact that:

CheckStructure is an OSR speculation.

CheckStructure is not an IC. The compiler knows exactly what it does, so that it can model it, so that it can eliminate it.

Let’s pause to appreciate what this technique gives us so far. We started out with a language in which property accesses seem to need hashtable lookups. A o.f operation requires calling some procedure that is doing hashing and so forth. But by combining inline caches, structures, and speculative compilation we have landed on something where some o.f operations are nothing more than load-at-offset like they would have been in C++ or Java. But this assumes that the o.f operation was monomorphic. The rest of this section considers minimorphism, polymorphism, and polyvariance.

Minimorphism. Certain kinds of polymorphic accesses are easier to handle than others. Sometimes an access will see two or more structures but all of those structures have the property at the same offset. Other times an access will see multiple structures and those structures do not agree on the offset of the property. We say that an access is minimorphic if it sees more than one structure and all structures agree on the offset of the property.

Our inline caches handle all forms of polymorphism by generating a stub that switches on the structure. But in the DFG, minimorphic accesses are special because they still qualify for full inlining. Consider an access o.f that sees structures S1 and S2, and both agree that f is at offset 0. Then we would have:

CheckStructure(@o, S1, S2) GetByOffset(@o, 0)

This minimorphic CheckStructure will OSR exit if @o has none of the listed structures. Our optimizations for CheckStructure generally work for both monomorphic and minimorphic variants. So, minimorphism usually doesn’t hurt performance much compared to monomorphism.

Polymorphism. But what about some access sees different structures, and those structures have the property at different offsets? Consider an access to o.f that sees structures S1 = {f, g}, S2 = {f, g, h}, and S3 = {g, f}. This would be a minimorphic access if it was just S1 or S2, but S3 has f at a different offset. In this case, the FTL will convert this to:

MultiGetByOffset(@o, [S1, S2] => 0, [S3] => 1)

in DFG IR and then lower it to something like:

if (o->structureID == S1 || o->structureID == S2) result = o->inlineStorage[0] else result = o->inlineStorage[1]

in B3 IR. In fact, we would use B3’s Switch since that’s the canonical form for this code pattern in B3.

Note that we only do this optimization in the FTL. The reason is that we want polymorphic accesses to remain ICs in the DFG so that we can use them to collect refined profiling.

foo

bar

bar

get_by_id

Figure 23. Polyvariant inlining of an inline cache. The FTL can inline the inline cache