While in the middle of writing “Reading bits in far too many ways, part 3”, I realized that I had written a lot of background material that had absolutely nothing to do with bit I/O and really was worth putting in its own post. This is that post.

The problem I’m concerned with is fairly easy to state: say we have some piece of C++ code that we’re trying to understand (and perhaps improve) the performance of. A good first step is to profile it, which will give us some hints which parts are slow, but not necessarily why. On a fundamental level, any kind of profiling (or other measurement) is descriptive, not predictive: it can tell you how an existing system is behaving, but if you’re designing something that’s more than a few afternoons worth of work, you probably don’t have the time or resources to implement 5 or 6 completely different design alternatives, pick whichever one happens to work best, and throw the rest away. You should be able to make informed decisions up front from an algorithm sketch without having to actually write a fleshed-out implementation.

One thing I want to emphasize particularly here is that experiments coupled with before/after measurements are no adequate substitute for a useful performance model. These kinds of measurements can tell you how much you’ve improved, but not if you are where you should be: if I tell you that by tweaking some config files, I managed to double the number of requests served per second by the web server, that sounds great. It sounds less good if I give you the additional piece of information that with this fix deployed, we’re now at a whopping 1.5 requests per second; having an absolute scale of reference matters!

This goes especially for microbenchmarks. With microbenchmarks, like a trial lawyer during cross-examination, you should never ask a question you don’t know the answer to (or at least have a pretty good idea of what it is). Real-world systems are generally too complex and intertwined to understand from surface measurements alone. If you have no idea how a system works at all, you don’t know what the right questions are, nor how to ask them, and any answers you get will be opaque at best, if not outright garbage. Microbenchmarks are a useful tool to confirm that an existing model is a good approximation to reality, but not very helpful in building these models to begin with.

Machine models

So, if we want to go deeper than just squinting at C/C++ code and doing some hand-waving, we need to start looking at a somewhat lower abstraction level and define a machine model that is more sophisticated than “statements execute one by one”. If you’re only interested in a single specific processor, one option is to use whatever documentation and tools you can find for the chip in question and analyze your code in detail for that specific machine. And if you’re willing to go all-out on microarchitectural tweaking, that’s indeed the way to go, but it’s a giant step from looking at C++ code, and complete overkill in most cases.

Instead, what I’m going to do is use a simplified machine model that allows us to make quantitative predictions about the behavior of straightforward compute-bound loops, which is simple to describe but still gives us a lot of useful groundwork for more complex scenarios. Here’s what I’ll use:

We have an unlimited set of 64-bit integer general-purpose registers, which I’ll refer to by names like rSomething . Any “identifiers” that aren’t prefixed with a lowercase r are either symbolic constants or things like labels.

. Any “identifiers” that aren’t prefixed with a lowercase r are either symbolic constants or things like labels. We have the usual 64-bit integer arithmetic and logic operations. All operations can either be performed between two registers or a register and an immediate constant, and the result is placed in another register. All arithmetic uses two’s complement. For simplicity, all 64-bit values are permitted as immediate constants.

There’s a flat, byte-granular 64-bit address space, and pointers are just represented as integers.

All memory accesses require explicit load and store operations. Memory accesses are either 8, 16, 32, or 64 bits in size and can use (for my convenience) both little-endian or big-endian byte ordering, when requested. One of these is the default, but both are the same cost. Narrow stores store the least significant bits of the register in question; narrow loads zero-extend to 64 bits. Loads and stores have a few common addressing modes (that I’ll introduce as I use them). Unaligned loads and stores are supported.

There’s unconditional branches, which just jump to a given location, and conditional branches, which compare a register to either another register or an immediate constant, and branch to a given destination if the condition is true.

Code will be written in a pseudo-C form, at most one instruction per line. Here’s a brief example showing what kind of thing I have in mind:

loop: // label rFoo = rBar | 1; // bitwise logical OR rFoo = lsl(rFoo, 3); // logical shift left rBar = asr(rBar, rBaz); // arithmetic shift right rMem = load64LE(rBase + rFoo); // little-endian load store16BE(rDest + 3, rMem); // big-endian store rCount = rCount - 1; // basic arithmetic if rCount != 0 goto loop; // branch

Shifts use explicit mnemonics because there’s different types of right shifts and at this level of abstraction, registers are generally treated as untyped bags of bits. I’ll introduce other operations and addressing modes as we get to them. What we’ve seen so far is quite close to classic RISC instruction sets, although I’ll allow a larger set of addressing modes than some of the more minimalist designs, and require support for unaligned access on all loads and stores. It’s also close in spirit to an IR (Intermediate Representation) you’d expect to see early in the backend of a modern compiler: somewhat lower-level than LLVM IR, and comparable to early-stage LLVM Machine IR or GCC RTL.

This model requires us to make the distinction between values kept in registers and memory accesses explicit, and flattens down control flow to basic blocks connected by branches. But it’s still relatively easy to look at a small snippet of C++ and e.g. figure out how many arithmetic instructions it boils down to: just count the number of operations.

As a next step, we could now specify a virtual processor to go with our instruction set, but I don’t want to really get into that level of detail; instead of specifying the actual processor, I’ll work the same way actual architectures do: we require that the end result (eventual register and memory contents in our model) of running a program must be as if we had executed the instructions sequentially one by one (as-if rule). Beyond that, an aggressive implementation is free to cut corners as much as it wants provided it doesn’t get caught. We’ll assume we’re in an environment—the combination of compilers/tools and the processor itself—that uses pipelining and tries to extract instruction-level parallelism to achieve higher performance, in particular:

Instructions can launch independent from each other, and take some number of clock cycles to complete. For an instruction to start executing, all the operands it depends on need to have been computed. As long as the dependencies are respected, all reorderings are valid.

There is some limit W (“width”) on how many new instructions we can start per clock cycle. In-flight instructions don’t interfere with each other; as long as we have enough independent work, we can start W new instructions every cycle. We’re going to treat W as variable.

Memory operations have a latency of 4 cycles, meaning that the result of a load is available 4 cycles after the load issued, and a load reading the bytes written by a prior store can issue 4 cycles after the store. That’s a fairly typical latency for a load that hits in the L1 cache, in case you were wondering.

Branches (conditional or not) count as a single instruction, but their latency is variable. Unconditional branches or easily predicted branches such as the loop counter in along-running loop have an effective latency of 0 cycles, meaning the instructions being branched to can issue at the same time as the branch itself. Unpredictable branches have a nonzero cost that depends on how unpredictable they are—I won’t even try to be more precise here.

Every other instruction has a latency of 1 clock cycle, meaning the result is available in the next cycle.

This model can be understood as approximating either a dataflow architecture, an out-of-order machine with a very large issue window (and infinitely fast front-end), or a statically scheduled in-order machine running code compiled with a Sufficiently Smart Scheduler. (The kind that actually exists; e.g. a compiler implementing software pipelining).

Furthermore, I’m assuming that while there is explicit control flow (unlike a pure dataflow machine), there is a branch prediction mechanism in place that allows the machine to guess the control flow path taken arbitrarily far in advance. When these guesses are correct, the branches are effectively free other than still taking an instruction slot, during which time the machine checks whether its prediction was correct. When the guess was incorrect, the machine reverts all computations that were down the incorrectly guessed path, and takes some number of clock cycles to recover. If this idea of branch prediction is new to you, I’ll refer you to Dan Luu’s excellent article on the subject, which explains both how and why computers would be doing this.

The end result of these model assumptions is that while control flow exists, it’s on the sidelines: its only observable effect is that it sometimes causes us to throw away a bunch of work and take a brief pause to recover when we guessed wrong. Dataflow, on the other hand—the dependencies between instructions, and how long it takes for these dependencies to be satisfied—is front and center.

Dataflow graphs

Why this emphasis? Because dataflow and data dependencies is because they can be viewed as the fundamental expression of the structure of a particular computation, whether it’s done on a small sequential machine, a larger superscalar out-of-order CPU, a GPU, or in hardware (be it a hand-soldered digital circuit, a FPGA, or an ASIC). Dataflow and keeping track of the shape of data dependencies is an organizing principle of both the machines themselves and the compilers that target them.

And these dependencies are naturally expressed in graph form, with individual operations being the nodes and data dependencies denoted by directed edges. In this post, I’ll have dependent operations point towards the operations they depend on, with the directed edges labeled with their latency. To reduce clutter, I’ll only write latency numbers when they’re not 1.

With all that covered, and to see what the point of this all is, let’s start with a simple, short toy program that just sums the 64-bit integers in some array delineated by two pointers stored in rCurPtr (which starts pointing to the first element) and rEndPtr (which points to one past the last element), idiomatic C++ iterator-style.

loop: rCurInt = load64(rCurPtr); // Load rSum = rSum + rCurInt; // Sum rCurPtr = rCurPtr + 8; // Advance if rCurPtr != rEndPtr goto loop; // Done?

We load a 64-bit integer from the current pointer, add it to our current running total in register rSum , increment the pointer by 8 bytes (since we grabbed a 64-bit integer), and then loop until we’re done. Now let’s say we run this program for a short 6 iterations and draw the corresponding dataflow graph (click to see full-size version):

Note I group nodes into ranks by which cycle they can execute in, at the earliest, assuming we can issue as many instructions in parallel as we want, purely constrained by the data dependencies. The “Load” and “Advance” from the first iteration can execute immediately; the “Done?” check from the first iteration looks at the updated rCurPtr , which is only known one cycle later; and “Sum” from the first iteration needs to wait for the load to finish, which means it can only start a full 4 cycles later.

As we can see, during the first four cycles, all we do is keep issuing more loads and advancing the pointer. It takes until cycle 4 for the results of the first load to become available, so we can actually do some summing. After that, one more load completes every cycle, allowing us to add one more integer to the running sum in turn. If we let this process continue for longer, all the middle iterations would look the way cycles 4 and 5 do: in our state state, we’re issuing a copy of all four instructions in the loop every cycle, but from different iterations.

There’s a few conclusions we can draw from this: first, we can see that this four-instruction loop achieves a steady-state throughput of one integer added to the sum in every clock cycle. We take a few cycles to get into the steady state, and then a few more cycles at the end to drain out the pipeline, but if we start in cycle 0 and keep running N iterations, then the final sum will be completed by cycle N+4. Second, even though I said that our model has infinite lookahead and is free to issue as many instructions per cycle as it wants, we “only” end up using at most 4 instructions per cycle. The limiter here ends up being the address increment (“Advance”); we increment the pointer after every load, per our cost model this increment takes a cycle of latency, and therefore the load in the next iteration of the loop (which wants to use the updated pointer) can start in the next cycle at the earliest.

This is a crucial point: the longest-latency instruction in this loop is definitely the load, at 4 cycles. But that’s not a limiting factor; we can schedule around the load and do the summing later. The actual problem here is with the pointer advance; every single instruction that comes after it in program order depends on it either directly or indirectly, and therefore, its 1 cycle of latency determines when the next loop iteration can start. We say it’s on the critical path. In loops specifically, we generally distinguish between intra-iteration dependencies (between instructions within the same iteration, say “Sum 0” depending on “Load 0”) and inter-iteration or loop-carried dependencies (say “Sum 1” depending on “Sum 0”, or “Load 1” depending on “Advance 0”). Intra-iteration dependencies may end up delaying instructions within that iteration quite a lot, but it’s inter-iteration dependencies that determine how soon we can start working on the next iteration of the loop, which is usually more important because it tends to open up more independent instructions to work on.

The good news is that W=4 is actually a fairly typical number of instructions decoded/retired per cycle in current (as of this writing in early 2018) out-of-order designs, and the instruction mixture here (1 load, 1 branch, 2 arithmetic instructions) is also one that is quite likely to be able to issue in parallel on a realistic 4-wide decode/retire design. While many machines can issue a lot more instructions than that in short bursts, a steady state of 4 instructions per cycle is definitely good. So even though we’re not making much of the infinite parallel computing power of our theoretical machine, in practical terms, we’re doing OK, although on real machines we might want to apply some more transforms to the loop; see below.

Because these real-world machines can’t start an arbitrary number of instructions at the same time, we have another concern: throughput. Say we’re running the same loop on a processor that has W=2, i.e. only two instructions can start every cycle. Because our loop has 4 instructions, that means that we can’t possibly start a new loop iteration more often than once every two clock cycles, and the limiter aren’t the data dependencies, but the number of instructions our imaginary processor can execute in a clock cycle; we’re throughput-bound. We would also be throughput-bound on a machine with W=3, with a steady state of 3 new instructions issued per clock cycle, where we can start working on a new iteration every 4/3≈1.33 cycles.

A different example

For the next example, we’re going to look at what’s turned into everyone’s favorite punching-bag of a data structure, the linked list. Let’s do the exact same task as before, only this time, the integers are stored in a singly-linked list instead of laid out as an array. We store first a 64-bit integer and then a 64-bit pointer to the next element, with the end of the list denoted by a special value stored in rEndPtr as before. We also assume the list has at least 1 element. The corresponding program looks like this:

loop: rCurInt = load64(rCurPtr); // LoadInt rSum = rSum + rCurInt; // Sum rCurPtr = load64(rCurPtr + 8); // LoadNext if rCurPtr != rEndPtr goto loop; // Done?

Very similar to before, only this time, instead of incrementing the pointer, we do another load to grab the “next” pointer. And here’s what happens to the dataflow graph if we make this one-line change:

Switching from a contiguous array to a linked list means that we have to wait for the load to finish before we can start the next iteration. Because loads have a latency of 4 cycles in our model, that means we can’t start a new iteration any more often than once every 4 cycles. With our 4-instruction loop, we don’t even need any instruction-level parallelism to reach that target; we might as well just execute one instruction per cycle and still hit the same overall throughput.

Now, this example, with its short 4-instruction loop, is fairly extreme; if our loop had say a total of 12 instructions that worked out nicely, the same figure might well end up averaging 3 instructions per clock cycle, and that’s not so bad. But the underlying problem here is a nasty one: because our longest-latency instruction is on the critical path between iterations, it ends up determining the overall loop throughput.

In our model, we’re still primarily focused on compute-bound code, and memory access is very simple: there’s no memory hierarchy with different cache levels, all memory accesses take the same time. If we instead had a more realistic model, we would also have to deal with the fact that some memory accesses take a whole lot longer than 4 cycles to complete. For example, suppose we have three cache levels and, at the bottom, DRAM. Sticking with the powers-of-4 theme, let’s say that a L1 cache hit takes 4 cycles (i.e. our current memory access latency), a L2 hit takes 16 cycles, a L3 hit takes 64 cycles, and an actual memory access takes 256 cycles—for what it’s worth, all these numbers are roughly in the right ballpark for high-frequency desktop CPUs under medium memory subsystem load as of this writing.

Finding work to keep the machine otherwise occupied for the next 4 cycles (L1 hit) is usually not that big a deal, unless we have a very short loop with unfavorable dependency structure, as in the above example. Fully covering the 16 cycles for a L1 miss but L2 hit is a bit trickier and requires a larger out-of-order window, but current out-of-order CPUs have those, and as long as there’s enough other independent work and not too many hard-to-predict branches along the way, things will work out okay. With a L3 cache hit, we’ll generally be hard-pressed to find enough independent work to keep the core usefully busy during the wait for the result, and if we actually miss all the way to DRAM, then in our current model, the machine is all but guaranteed to stall; that is, to have many cycles with no instructions executed at all, just like the gaps in the diagram above.

Because linked lists have this nasty habit of putting memory access latencies on the critical path, they have a reputation of being slow “because they’re bad for the cache”. Now while it’s definitely true that most CPUs with a cache would much rather have you iterate sequentially over an array, we have to be careful how we think about it. To elaborate, suppose we have yet another sum kernel, this time processing an array of pointers to integers, to compute the sum of the pointed-to values.

loop: rCurIntPtr = load64(rCurPtr); // LoadPtr rCurInt = load64(rCurIntPtr); // LoadInt rSum = rSum + rCurInt; // Sum rCurPtr = rCurPtr + 8; // Advance if rCurPtr != rEndPtr goto loop; // Done?

And this time, I’ll prune the dataflow graph to show only the current iteration and its direct dependency relationships with earlier and later iterations, because otherwise these more complicated graphs will get cluttered and unreadable quickly:

A quick look over that graph shows us that copies of the same instruction from different iterations are all spaced 1 cycle apart; this means that in the steady state, we will again execute one iteration of the loop per clock cycle, this time issuing 5 instructions instead of 4 (because there are 5 instructions in the loop). Just like in the linked list case, the pointer indirection here allows us to jump all over memory (potentially incurring cache misses along the way) if we want to, but there’s a crucial difference: in this setup, we can keep setting up future iterations of the loop and get more loads started while we’re waiting for the first memory access to complete.

To explain what I mean, let’s pretend that every single of the “LoadInt”s misses the L1 cache, but hits in the L2 cache, so its actual latency is 16 cycles, not 4. But a latency of 16 cycles just means that it takes 16 cycles between issuing the load and getting the result; we can keep issuing other loads for the entire time. So the only thing that ends up happening is that the “Sum k” in the graph above happens 12 cycles later. We still start two new loads every clock cycle in the steady state; some of them end up taking longer, but that does not keep us from starting work on a new iteration of the loop in every cycle.

Both the linked list and the indirect-sum examples have the opportunity to skip all over memory if they want to; but in the linked-list case, we need to wait for the result of the previous load until we can get started on the next one, whereas in the indirect-sum case, we get to overlap the wait times from the different iterations nicely. As a result, in the indirect-sum case, the extra latency towards reaching the final sum is essentially determined by the worst single iteration we had, whereas in the linked-list case, every single cache miss makes our final result later (and costs us throughput).

The fundamental issue isn’t that the linked-list traversal might end up missing the cache a lot; while this isn’t ideal (and might cost us in other ways), the far more serious issue is that any such cache miss prevents us from making progress elsewhere. Having a lot of cache misses isn’t necessarily a problem if we get to overlap them; having long stretches of time were we can’t do anything else, because everything else we could do depends on that one cache-missing load, is.

In fact, when we hit this kind of problem, our best bet is to just switch to doing something else entirely. This is what CPUs with simultaneous multithreading/hardware threads (“hyperthreads”) and essentially all GPUs do: build the machine so that it can process instructions from multiple instruction streams (threads), and then if one of the threads isn’t really making progress right now because it’s waiting for something, just work on something else for a while. If we have enough threads, then we can hopefully fill those gaps and always have something useful to work on. This trade-off is worthwhile if we have many threads and aren’t really worried about the extra latency caused by time-slicing, which is why this approach is especially popular in throughput-centric architectures that don’t worry about slight latency increases.

Unrolling

But let’s get back to our original integer sum code for a second:

loop: rCurInt = load64(rCurPtr); // Load rSum = rSum + rCurInt; // Sum rCurPtr = rCurPtr + 8; // Advance if rCurPtr != rEndPtr goto loop; // Done?

We have a kernel with four instructions here. Out of these four, two (“Load” and “Sum”) do the actual work we want done, whereas “Advance” and “Done?” just implement the loop itself and are essentially overhead. This type of loop is a prime target for unrolling, where we collapse two or more iterations of the loop into one to decrease the overhead fraction. Let’s not worry about the setup or what to do when the number of elements in the array is odd right now, and only focus on the “meat” of the loop. Then a 2× unrolled version might look like this:

loop: rCurInt = load64(rCurPtr); // LoadEven rSum = rSum + rCurInt; // SumEven rCurInt = load64(rCurPtr + 8); // LoadOdd rSum = rSum + rCurInt; // SumOdd rCurPtr = rCurPtr + 16; // Advance if rCurPtr != rEndPtr goto loop; // Done?

which has this dataflow graph:

Note that even though I’m writing to rCurInt twice in an iteration, which constitutes a write-after-write (WAW) or “output dependency”, there’s no actual dataflow between the loads and sums for the first and second version of rCurInt , so the loads can issue in parallel just fine.

This isn’t bad: we now have two loads every iteration and spend 6N instructions to sum 2N integers, meaning we take 3 instructions per integer summed, whereas our original kernel took 4. That’s an improvement, and (among other things) means that while our original integer-summing loop needed a machine that sustained 4 instructions per clock cycle to hit full throughput, we can now hit the same throuhgput on a smaller machine that only does 3 instructions per clock. This is definitely progress.

However, there’s a problem: if we look at the diagram, we can see that we can indeed start a new pair of loads every clock cycle, but there’s a problem with the summing: we have two dependent adds in our loop, and as we can see from the relationship between “SumEven k” and “SumEven k+1”, the actual summing part of the computation still takes 2 cycles per iteration. On our idealized dataflow machine with infinite lookahead, that just means that all the loads will get front-loaded, and then the adds computing the final sum proceed at their own pace; the result will eventually be available, but it will still take a bit more than 2N cycles, no faster than we were in the original version of the loop. On a more realistic machine (which can only look ahead by a limited number of instructions), we would eventually stop being able to start new loop iterations until some of the old loop iterations have completed. No matter how we slice it, we’ve gone from adding one integer to the sum per cycle to adding two integers to the sum every two cycles. We might take fewer instructions to do so, which is a nice consolation prize, but this is not what we wanted!

What’s happened is that unrolling shifted the critical path. Before, the critical path between iterations went through the pointer advance (or, to be more precise, there were two critical paths, one through the pointer advance and one through the sum, and they were both the same length). Now that we do half the number of advances per item, that isn’t a problem anymore; but the fact that we’re summing these integers sequentially is now the limiter.

A working solution is to change the algorithm slightly: instead of keeping a single sum of all integers, we keep two separate sums. One for the integers at even-numbered array positions, and one for the integers at odd-numberd positions. Then we need to sum those two values at the end. This is the algorithm:

loop: rCurInt = load64(rCurPtr); // LoadEven rSumEven = rSumEven + rCurInt; // SumEven rCurInt = load64(rCurPtr + 8); // LoadOdd rSumOdd = rSumOdd + rCurInt; // SumOdd rCurPtr = rCurPtr + 16; // Advance if rCurPtr != rEndPtr goto loop; // Done? rSum = rSumEven + rSumOdd; // FinalSum

And the dataflow graph for the loop kernel looks as follows:

Where before all the summing was in what’s called the same dependency chain (the name should be self-explanatory by now, I hope), we have now split the summation into two dependency chains. And this is enough to make a sufficiently-wide machine that can sustain 6 instructions per cycle complete our integer-summing task in just slightly more than half a cycle per integer being summed. Progress!

On a somewhat narrower 4-wide design, we are now throughput-bound, and take around 6/4=1.5 cycles per two integers summed, or 0.75 cycles per integer. That’s still a good improvement from the 1 cycle per integer we would have gotten on the same machine from the non-unrolled version; this gain is purely from reduction the loop overhead fraction, and further unrolling could reduce it even further. (That said, unless your loop really is as tiny as our example, you don’t generally want to go overboard with unrolling.)

Tying it all together

In the introduction, I talked about the need for a model detailed enough to make quantitative, not just qualitative, predictions; and at least for very simple compute-bound loops, that is exactly what we have now. At this point, you should know enough to look at the dependency structure of simple loops, and have some idea for how much (or how little) latent parallelism there is, and be able to compute a coarse upper bound on their “speed of light” on various machines with different peak instructions/cycle rates.

Of course, there are many simplifications here, most of which have been already noted in the text; we’re mostly ignoring the effects of the memory hierarchy, we’re not worrying at all about where the decoded instructions come from and how fast they can possibly be delivered, we’ve been flat-out assuming that our branch prediction oracle is perfect, and we’ve been pretending that while there may be a limit on the total number of instructions we can issue per cycle, it doesn’t matter what these instructions are. None of these are true. And even if we’re still compute-bound, we need to worry at least about that latter constraint: sometimes it can make a noticeable difference to tweak the “instruction mix” so it matches better what the hardware can actually do in a given clock cycle.

But all these caveats aside, the basic concepts introduced here are very general, and even just sketching out the dependency graph of a loop like this and seeing it in front of you should give you useful ideas about what potential problems are and how you might address them. If you’re interested in performance optimization, it is definitely worth your time practicing this so you can look at loops and get a “feel” for how they execute, and how the shape of your algorithm (or your data structures, in the linked list case) aids or constrains the compiler and processor.

UPDATE: Some additional clarifications in answer to some questions: paraphrasing one, “if you have to first write C code, translate it to some pseudo-assembly, and then look at the graph, how can this possibly be a better process than just measuring the code in the first place?” Well, the trick here is that to measure anything, you actually need a working program. You don’t to draw a dataflow graph. For example, a common scenario is that there are many ways you could structure some task, and they all want their data structured differently. Actually implementing and testing multiple variants like this requires you to write a lot of plumbing to massage data from one format into another (all of which can be buggy). Drawing a graph can be done from a brief description of the inner loop alone, and you can leave out the parts that you don’t currently care about, or “dummy them out” by replacing them with a coarse approximation (“random work here, maybe 10 cycles latency?”). You only need to make these things precise when they become close to the critical path (or you’re throughput-bound).

The other thing I’ll say is that even though I’ve been talking about adding cycle estimates for compute-bound loops here, this technique works and is useful at pretty much any scale. It’s applicable in any system where work is started and then processed asynchronously, with the results arriving some time later. If you’re analyzing a tight, compute-bound loop, cycle-level granularity is the way to go. But you can zoom out and use the same technique to figure out how your decomposition of an algorithm into tasklets processed by a thread pool works out: do you actually have some meaningful overlap, or is there still one long serial dependency chain that dominates everything, and all you’re doing by splitting it into tasklets like that is adding overhead? Zooming out even further, it works to analyze RPCs you’re sending to a different machine, or queries to some database. Say you have a 30ms target response time, and each RPC takes about 2ms to return its results. In a system that takes 50 RPCs to produce a result, can you meet that deadline? The answer depends on how the dataflow between them looks. If they’re all in series, almost certainly not. If they’re in 5 “layers” that each fan out to 10 different machines then collect the results, you probably can. It certainly applies in project scheduling, and is one of the big reasons the “man-month” isn’t a very useful metric: adding manpower increases your available resources but does nothing to relax your dependencies. In fact, it often adds more of them, to bring new people up to speed. If the extra manpower ends up resulting in more work on the critical path towards finishing your project (for example to train new hires), then adding these extra people to the project made it finish later. And so forth. The point being, this is not just limited to cycle-by-cycle analysis, even though that’s the context I’ve been introducing it in. It’s far more general than that.

And I think that’s enough material for today. Next up, I’ll continue my “Reading bits in far too many ways” series with the third part, where I’ll be using these techniques to get some insight into what kind of difference the algorithm variants make. Until then!