Speeding Up Your Engineering Org, Part I: Beyond the Cost Center Mentality

April 17, 2014 by Edmund Jorgensen

It is a truth universally acknowledged, that engineering orgs—like greyhounds, sports cars, and wide receivers—slow down as they age.

Odds are good that you have experienced this phenomenon personally at some point in your engineering career. The slowdown was gradual, frustrating, and oddly stubborn. It survived: numerous rounds of hiring; a spate of offsites where inspiring speakers harangued everyone to “cut through the crap” and just “get shit done”; a blood-spattered re-org or two; and even a few ground-up rewrites that utterly failed to deliver on their promised boost in velocity.

If you’re now involved with engineering leadership in some capacity, you may well have accepted the slowdown as a sad universal truth. Accordingly, you may have shifted your efforts from the impossible task of making the org go faster to the thankless but crucial job of jealously guarding how engineers spend their time—because as it takes longer and longer to get even simple features out the door, those engineering hours become increasingly precious.

If all this sounds familiar, I have good news and bad news for you.

The good news: it isn’t actually a law of nature that engineering orgs have to slow down as they mature and grow. With active, contravening investment, it’s possible to maintain and even gain speed.

“But,” you protest, “I’ve made investments, remember? I’ve hired! I’ve brought in speakers! I’ve re-orged and re-factored and tried out every flavor of agile there is, and still we go slower and slower!”

Yes, which brings us to the bad news: that slowdown is a far bigger deal than you might have realized, and way more harmful to the bottom line of your business than you might imagine. Oh, and that jealous guarding of engineer hours for features? It’s only making things worse.

In this article I’m going to consider the speed of an engineering org as an economic question—not a moral question, or a question of technology choices, or a question of people “hustling” and “powering through” the obstacles they find in their path. I believe that a good percentage of engineering and business leaders economically model their engineering org—consciously or unconsciously—as a “cost center,” where every engineer hour not spent on features must translate to (at least) one engineer hour saved, and I believe that this economic model makes it extremely difficult to identify and justify the investments that could actually speed that org up. I’ll propose an alternate economic model of an engineering org—one in which speed to delivery, rather than number of engineer hours paid, is the dominant economic factor—and in which considerable, sustained investment in that speed can reap massive economic returns.

But let’s get a little more concrete with this—let’s look at an example of the kinds of decisions that face engineering orgs and their leaders every day, and just how easy it is to slip into the “cost center” mentality when attempting to juggle them.

A Tale of Two Engineers

Say you’re an engineering manager at Company X, and one morning you arrive at work to find two of your best engineers waiting outside your office. You haven’t even opened your door before they start in on you.

“Look,” says Cindy, the first engineer, “I know that the CEO is breathing down our neck to finish the new Facebook for Cats integration, but we’ve got to clear some time to work on automating database migrations. I’m the only one who knows enough to apply them to the prod DB, and I’m getting tired of spending half an hour every morning rolling out everyone else’s changes. So can we push a feature or two back and squeeze that in?”

“Forget the migrations,” says Scott, the second engineer, “we need to talk about the Frobulator Service. Two years ago we agreed to hack it up quickly in PHP, but product promised us—PROMISED—that we would have time to go back and clean it up. Yesterday I happened to be back in that code while I was updating the copyright years in our headers, and it’s even worse than I remembered. We need to rewrite it in Scala so it’s more modern, performant, and easier to maintain. Can you tell product we’re calling in that promise, please, and I’ll get started?”

First off: everything your engineers have said is true. Cindy really is spending a half hour every morning dealing with database migrations; the source for the Frobulator Service really does look like a plate of partially digested capellini; product really did promise time to clean that mess up; and of course there really is a long and growing backlog of features for the upcoming Facebook for Cats integration, each of them (according to the CEO and product) absolutely essential and destined to become a customer favorite.

Furthermore, you’ve been around long enough to know that there won’t be any “calm periods” when there’s time for your engineers to scratch these other itches—after the Facebook for Cats integration goes out, you’ll be right on to integrating with Twitter for Dogs, or LinkedIn for Ferrets. So on this fine morning someone has to make a real and uncomfortable decision: either tell Cindy and Scott to stop complaining and get back to feature work, or let product and the CEO know that you’re going to spend some engineering hours on something other than features. And today that someone is you.

Pop quiz, hot shot: what do you do?

WHAT DO YOU DO?

A Simple, Responsible, and Totally Wrong Approach

If you’re a mature, business-focused engineering leader, you might grab some coffee, sit Cindy and Scott down, and tell them something like this:

“Cindy, I’m sorry to hear that you’re getting bored doing so much production DB work, but realistically it would take you at least 40 hours of work to write, test, and deploy a migration utility, right? So if you’re spending a half hour a day on migrations, it would be 80 working days before we saw a return on our investment—that’s like 4 months, and that’s just too long for me to sanction—precisely because you’re such a valuable member of the team, and I can’t spare so much of your time right now away from our feature backlog. We can touch base if the migration workload increases too much, OK? Until then, I have to ask you to put your head down and be a team player.

“Scott, you’re absolutely right, product did promise that we could spend time cleaning up the Frobulator Service, and I’m sure they were acting in good faith, but none of us could have possibly known at the time how our product was going to take off—we’ve got customers practically beating down our door for new features, and they’re not going to see any difference whether the Frobulator Service is written in crappy PHP or transcendent Scala.

“Both of you are great engineers with bright futures, and if those futures include engineering management, then part of your job will be to understand that engineering’s job is to produce effects that are visible to customers. So if we burn hours on projects that aren’t customer visible—projects that are by engineers, for engineers—we need to be able to show directly how those hours will pay for themselves in saved engineering hours in pretty short order.”

This approach feels rational, responsible, and easy to apply, right? There’s only one small problem: by slipping into the “cost center” mentality, where engineering hours must only be spent on features or a greater savings in engineering hours, you’ve actually just slowed your engineering org down further, and cost your company real (though largely invisible) money in the process. How did this happen without our even noticing, while we thought we were being so responsible?

“ Engineer Hours” vs. Latency—Where the “Cost Center” Gets it Wrong

The cost center model of engineering, to which our hypothetical engineering leader has just retreated, is basically this: an engineering org is a furnace which burns money, in the form of compensated engineer hours, and produces features. Therefore if org A can produce the same feature at half the cost of org B, then org A is twice as good as org B! And if spending 1 engineer hour on some task today will save you 100 engineer hours in the next few weeks, then you have just improved your org’s economics by 99 of those expensive engineer hours!

The fundamental and deadly flaw in this model is that it does not account economically for the speed of work through the engineering org—or what I’ll refer to from here on out as “latency”—the wall-clock hours, not paid engineer hours, that it takes the engineering org to turn some concept into reality. In other words, we can’t simply think of an engineering org as “an engine that produces thing X at cost Y.” We have to model it as “an engine that produces thing X at cost Y with latency Z,” and recognize that “latency Z” itself can and should be translated into some cost / value structure.

This is not to say that engineering leaders who employ this cost center model don’t care or think about latency. To the contrary, they often talk about it quite a bit, exhorting their teams to feel a “sense of urgency” and to exhibit a “just git ‘er done” attitude—but they treat latency as a moral or personal question—a matter of character or work ethic—rather than something that is, at its heart, organizational and economic.

It’s human nature to experience paid engineer hours as expensive and latency as annoying, because the costs of latency tend to be invisible—they usually take the form of lost opportunities or earnings, many of which, once you miss them, you never even know existed—rather than real, painful checks that you have to cut each month for payroll.

Consider an analogue: the rent your business pays on an office building. If you found a building that was only half the rent, you might well be tempted to move and count that as a huge savings—but that’s rarely the whole economic story. Is the new building farther away from where the bulk of your employees live? Does it lack the public transit options of the more expensive building? How’s the light? What’s the layout like? All of these factors can affect the amount of time your employees spend in the office, the amount and quality of work they get done there, and even the kind of people who want to work at your company in the first place—and if the cheaper building leads to a drop in productivity, or to worse hires, then that “savings” on rent might turn out to be very expensive indeed to your business’s bottom line, even though—and here’s the horrific part—that connection will probably never show up on your company’s balance sheet. It’s not hard to imagine the employee who found the cheaper building being rewarded with a fat bonus in the same cycle that a bunch of other employees are dinged for a stagnant product, increasing bug count, and flagging sales—even if all those problems were caused, to some extent, by the change in location.

One method to expose some of these invisible economic effects is to take them to an absurd extreme. For example, if your business is currently paying a half million in rent a year for a Boston office, with a workforce who lives in nearby suburbs, it’s clearly not a smart economic decision to move to a snow-cave in Juneau, Alaska—even if it’s wired for Ethernet and your annual rent would drop to $1. We’ve managed to magnify the invisible costs to a size where they can’t be easily ignored.

So let’s employ the same technique—reduction to some absurd extreme—in a thought experiment designed to demonstrate how the latency of your engineering org is almost certainly its dominant economic factor—much, much larger than the piddling six-figure salaries you’re paying the engineers it comprises.

The Thought Experiment

Role change: you’re no longer an engineering leader overseeing Facebook for Cats integration. Now you’re the CEO of a company that makes its money through big, enterprise contracts. A potential customer you’ve been after for a while is entertaining bids on a project, and will consider proposals—which are expected to include a working proof of concept—in one month.

You aren’t the only company trying to land this contract—there are lots of smart competitors. And, by the way, you’re not allowed to deliver early, even if you finish the proof of concept early—all proposals will be considered on the same day, one month from now.

As CEO you have two engineering teams available to you.

The first team is a group of good, steady developers, who correctly estimate that the proof of concept will take exactly one month for them to build (of course they can’t possibly know this, but that’s a story for another article and here we’ll just pretend they can, because we’re in a thought experiment and we can do whatever we want). Over this month of development, this team will cost the business $100,000 in salary and other compensation.

The second team, on the other hand, is a group of freelancers who are amazingly, inhumanly fast: they can produce the same proof of concept, at the same level of quality, in just one second. Before you get too excited thinking about all the money you’re going to save with this team, however, you should know this: for that one second of work, these freelancers will be invoicing you dearly—to the tune of $100,000.

Recapping your options, you have:

the normal team, which will take a month to produce the proof of concept for a total cost of $100,000

the insanely fast team, which will take a second to produce the proof of concept for a total cost of $100,000

The costs of the proof of concept are equivalent with either team, as is the quality of the product—only the latency differs. Obviously if you could deliver the proposal as soon as the proof of concept was done, you’d choose the insanely fast team every time. But that would be too easy, so in our thought experiment—where you’re not allowed to deliver the proposal early—does the latency even matter?

There’s only one scenario to consider with the normal team—they have to start working today, and they’ll finish just in time for the presentation. Start them even a day late, and they won’t finish.

With the insanely fast team, on the other hand, you have on the order of 2,592,000 scenarios to consider, as they could start and finish at any second in the entire month. But are any of these scenarios valuable?

Let’s take a look at a couple of these possibilities.

The Need for Speed

One obvious approach with the insanely fast team would be to produce the proof of concept immediately, in the very first second. Does that buy you anything? You can’t deliver the proof of concept early, but now that it exists, there are a couple things you could do with it.

For example, you could show it around and get a reaction—internally, if your business has some good proxies for your customer’s needs, or to one of the customers “on the ground” (not the Big Important People you’ll be pitching at the end of the month, just regular workers). Then you can take their feedback and do any of the following:

Iterate: Have the insanely fast team produce a second, improved version of the proof of concept—you’ll have to pay them another $100,000, but you’ll have good information about whether that’s worth it or not. You can repeat this process as many times as you like or can afford, and go into the demo having iterated through N versions to your competition’s one.

Abandon: If the feedback you get is “this is crap, and the only ways to make it good enough are too difficult or expensive to consider,” then you can abandon the contract and move on to try to sell something different to a different customer—or something different to the same customer! Meanwhile, your competition is sweating away trying to produce their own proofs of concept—squandering precious time and attention on a contest you already know isn’t worth winning.

Sell to Someone Else: By the rules you can’t deliver your proof of concept early to the one potential customer, but nothing says you can’t go out and try to sell it to a different one, or a different six. By the time proposal day arrives, you’re already a month ahead of your competition in other markets, and you might even have a nice story to tell about how your customer’s competition has already bought your version—and they’d better too, if they don’t want to fall behind.

So yeah, you could definitely say there’s some value to being able to finish the proof of concept in a second. That insanely fast team is starting to look pretty good right about now.

But wait…there’s more!

The Genius of Procrastination

What if you went to the other extreme, and waited as long as you could to produce the proof of concept, until the last possible second—literally while you’re walking down the hallway to make your presentation? Does that give you any interesting advantages?

One possibility that leaps to mind: given that your development is so expensive, you could do some cheaper exploration before you committed to a proof of concept. For example, you could send some PMs to shadow the customers, research companies that had tried similar approaches, etc.

By the time you commit to spending $100,000 on the proof of concept, you can have much better information about what it should do and what it shouldn’t. Maybe it turns out to be so difficult that you decide not to even build. Or maybe, with the insanely fast team at your back, an offhand remark as the customer is walking you to the presentation room prompts a quick phone call and a development cycle, allowing you to produce a last-second revision that totally changes the game.

In essence, by waiting until the last second to produce your proof of concept, you have the chance to be roughly 29 days, 23 hours, 59 minutes and 59 seconds better informed than your competition (the actual amount of time will depend on the particular month, whether it’s a leap year, etc., which is left as an exercise for the reader).

Mix and Match

But the real power of the insanely fast team comes when you mix and match all the techniques above.

Step 1: Do cheap research until you have an idea of what to build.

Step 2: Build it instantly and loop back to Step 1, until you decide another iteration isn’t worth $100,000 (either because the proof of concept is now good enough, or because you’ve decided to scrap the project).

Step 3: Profit!

Finish Early, Start Late

What the insanely fast team gives you, in other words, is the ability to finish early or start late. In an environment where uncertainty rules and information is value—like software development—that allows for tremendously valuable information gain, because what you finish early tends to generate information, and what you start late tends to benefit from newly available information. The poor old regular engineering team, on the other hand, has to start early and finish late just in order to get the work done by the deadline. Their labor can neither generate extra information or benefit from it as it becomes available.

So Which Team Do You Want, Mr. CEO ?

By now it should be clear: although the two teams cost the same, and produce the same quality output, you would be crazy not to choose the insanely fast team and their drastically reduced latency. In fact, you’d be crazy not to pay a steep premium, well beyond the normal team’s salaries, to use the insanely fast team, or even to keep them inactive but on retainer.

This is so important it’s worth calling out: if you’re any kind of rational, you would pay a tremendous amount of extra money to use the insanely fast team, which means that a reduction in latency equals money. Real, actual money—and usually a lot of it. In our thought experiment, for example, a smart CEO would gladly pay $1,000,000 to use the insanely fast team instead of the regular team if it meant a massively increased chance at a $15,000,000 project. A smart CEO would see that not as “spending” money—but as investing it—putting money out into the world in the reasonable expectation of having that money return, now increased by some multiple.

Once you start thinking of engineering dollars as investment rather than cost, the fallacies of the “cost center” model become glaringly obvious. The equation behind your org isn’t “engineer hours paid for features or saved engineering hours”—it’s “money invested in the expectation of more money.” Often the money invested is in the form of paid engineer hours, but sometimes it’s new machines, or better chairs, or office space for a remote contingent, and so on. And sometimes the “more money” you expect in return comes from features for which customers will pay, but often (as in our thought experiment) it comes in the form of valuable information, or—if you’re doing it right—a reduction in (or prevention of) latency for future work, which, as we’ve just shown with our thought experiment, is worth actual money.

Sitting Down Again with Cindy and Scott

Let’s rewind back to that coffee with Cindy and Scott, where you as engineering leader were explaining to them all about how engineer hours could only be spent on features or efforts that would cut future engineer hours. With the clearer economic picture in mind, this argument no longer seems so simple and rational.

Cindy wanted time to work on DB deploy scripts, since she was the only one who could reliably get changes out to the production DB and was spending a chunk of her mornings doing so. At the time, what we heard behind her lament was “I’m getting bored doing the job you’re paying me to do and I need to be gently cat-herded to keep doing it”—but what we should have heard was “DANGER, WILL ROBINSON—a queue is forming in your engineering org.”

Cindy has become a bottleneck for changes making their way to production, and a queue of people trying to make those changes is forming behind her. Queues are one of the clearest signals of developing latency. What happens if Cindy is out for a few days on (gasp) vacation? No changes will go out. What happens if she becomes overloaded with other matters, and—without telling you—starts applying DB migrations only once a week, to “batch things up” and “be more efficient” with her time? Your latency has just skyrocketed invisibly—and the fact that this is possible should terrify you as an engineering leader. Cindy’s complaint is a warning of latency to come, and you need to nip that in the bud with extreme prejudice. You should probably allow Cindy to do her migration project—and you should definitely explain to her why you’re allowing it.

As for Scott, who wanted to rewrite the Frobulator Service from horrific PHP to stunning Scala because product had promised the time to clean it up: the “promise” from product is clearly economically irrelevant, and big rewrites tend to be a terrible investment, so you probably shouldn’t say yes to Scott’s exact request—but you still have some digging to do here to figure out whether this (almost certainly misguided) desire to rewrite is just a blue-sky engineering itch, or a signal that the Frobulator Service is creating latency.

First of all, Scott was only in that code to “update copyright years”—he wasn’t making functional changes, and apparently hadn’t made any in at least a year. Is this a clue that the Frobulator Service doesn’t see that much coding activity? Worth digging into, because if engineers aren’t touching the Frobulator Service because it’s frobulating just fine and there aren’t really any changes to make, that’s great—the code might read like Cthulhu’s diary, but it’s not affecting your latency and can be left as is for the moment. If, on the other hand, there are tons of changes that should go into the Frobulator Service, but which are finding their way into compensatory hacks throughout the rest of the codebase instead—because engineers are terrified to touch the Frobulator Service code—then you’ve got a brewing latency problem that you need to expose and deal with, because those hacks are probably already slowing you down, and the situation is only going to get worse. Almost certainly you still don’t want to commission a full-on rewrite, but a steady, incremental investment in testing, monitoring, and refactoring the Frobulator Service might be indicated.

Takeaways from Cindy and Scott

One of the deadliest things about latency is that often the slowdown of even a single piece of your org can introduce it, while making things faster generally requires steady work on a lot of fronts. That’s an imbalance that’s not in your favor. Add to this the certainty that latency is developing in your organization at every moment—that is the nature of organizations—and that it is often invisible to you (or any single individual)—and that, as we saw in our thought experiment, latency is tremendously expensive—and the response that’s indicated from you, the engineering leader, is a calm but constant terror.

Your job is to translate that terror into a form of shared vigilance: listen carefully to your engineers, dig into the problems they bring you, and ensure that every one of them understands the cost of latency and is on the lookout for it, making micro speed-ups everywhere they see the opportunity and surfacing brewing slowdowns.

In other words, make latency something your whole team seeks, hates, and destroys.

How to Invest in Latency Reduction

“All right,” you say, “I’m convinced—latency is a bigger deal than I thought before, and something I can improve—in theory. But how do I do it in practice? I’ve made all those investments that didn’t help at all—how do I know that if I invest in something, it will actually improve my latency?”

Some of this also comes down to how much you invest, but we’ll leave that until Part II, and here just discuss what you can look to invest in.

Here are a few places you can start.

Activities Engineers Bitch About

Engineers tend to experience latency centers as painful or “busywork.” For example, do your engineers play “Rock Paper Scissors” to determine who has to spin up a new server? Does the loser go off cursing his luck and the world? Do your engineers go to absurd lengths to pack new services onto old machines, even when a new server would be the natural solution to the problem? Then take a look at what it requires to spin up a new server, and whether you can make an investment to make it less painful—you’ll likely effect a drop in latency.

Things Only Cindy Can Do

We saw an example of this with Cindy, who was the only engineer who knew enough about the prod DB to get migrations out. If only person X can do thing Y in your organization, you’ve created a bottleneck, and bottlenecks lead to latency. Cross-train or create tools to terminate these bottlenecks with extreme prejudice.

Look for Queues

Queues are a manifestation of latency, and once you can see them, you can attack them. Find them where they’re visible—ticketing systems and so on—and try to make them visible where they’re not, using techniques like a Kanban board.

Automated Tests

Good automated tests reduce latency, because they help you make changes more quickly and confidently.

Monitoring

Good monitors reduce latency, because they allow you to release more frequently, confident in the knowledge that, if something goes wrong, you’ll find out immediately.

Post Mortems

A good post mortem is a great opportunity to let reality point you towards improvements that not only make your systems safer, but reduce your latency as well. Do them!

Decentralization with Safety Nets / Impact Reduction Schemes

Organizations often insist that high-impact changes to products or systems pass through multiple steps of centralized review for correctness, which can become a source of dramatic latency—sometimes on the order of weeks or months. Usually these controls exist for a reason, because the mistakes they attempt to prevent are expensive.

You can attack such a situation in two ways: either by making it harder to break things in the first place (often more difficult and expensive), or by changing the game so that breaking things isn’t as big a deal (often cheaper and easier). For example, if engineers can deploy potentially high-impact changes at will to a small percentage of traffic, or to a known beta-tolerant population, or to internal users, then the downside of breaking changes is capped, and is often eminently worth the decreased latency you enjoy.

And Many, Many More

We’ve only scratched the surface here: tools for operators, intelligent development tools, even crazy things like DSLs for demo or test data creation can all reduce your latency. Once you start looking specifically for projects that reduce latency, you will see opportunities everywhere.

How Not to Invest in Latency Reduction: REWRITE ALL THE THINGS

The “rewrite reflex” exhibited by Scott is, unfortunately, a real and dangerous tendency that almost all engineers have to some extent (I myself struggle with it daily): the fanatical belief that, if a system were rewritten to framework X or language Y, development would proceed much more quickly. Generally this doesn’t pan out, both because of the astounding (and routinely underestimated) cost of the rewrite, but also because the causes of latency introduced in real-world engineering are rarely addressed more directly by languages and frameworks than by operational and organizational changes. The latency caused by having to write three ugly lines in one language rather than one pretty line in another tends to pale in comparison with delays in deploys, finding and fixing bugs that tests could have caught, etc. (note: I’m not arguing that there is no difference in language productivity, and no point to choosing a language for a new venture carefully, just that for a working system the gain is usually dwarfed by the rewrite cost and other, lower hanging fruit).

Incrementalism FTW

Maybe it’s a “one ring to rule them all” deployment system, or a templating system to speed up writing your views, or a monitoring framework to end all monitoring frameworks—whatever it is, if you think it will reduce latency, and it’s a big project, you should probably try breaking it into smaller increments, each of which reduces some latency, and release those independently, as each is ready.

Most engineers will hate to hear this. They’ve already “seen” the full system in their head, and now want to bang it out in a couple caffeine-fueled weeks. Typically if you object and request smaller increments, they will point out that, broken up into discrete releases, the job will require more hours overall, and therefore represent an inefficiency. They’re generally right, of course, that you will spend more engineer hours by delivering in increments—they’re just wrong about the economic consequences.

You should insist on smaller, incremental latency improvements, not just because of all the normal, eminently true reasons that big increments are bad (everything that makes waterfall a bad idea applies here too), but because latency reduction improves the same channels by which you deliver future latency reduction. That is, since latency reduction efforts generally come in the form of new software or processes, and what they’re reducing is the latency of delivering new software or processes, finished latency reduction efforts tend to speed up future latency reduction efforts.

Latency reduction is therefore a form of compound interest, which Einstein himself called “the most powerful force in the universe.” Latency reduction works just like your retirement account—steady, incremental investments generate more value than infrequent, bigger investments, because you earn interest on your interest—so you want the money in the account as soon as it becomes available. When you break a big, massively valuable latency reducing project into numerous smaller (but still latency reducing) projects, some of which can be delivered earlier, the one-time multiple you pay on extra engineering hours is nearly always a rounding error compared to the benefit of compounded latency reduction you enjoy forever.

So Much for the Easy Part

All right, we’ve skirted the hard part long enough. At this point we understand some of the costs of latency. We’ve sounded out whether projects like those that Cindy and Scott want to undertake will actually reduce latency, talked about some other projects that are good candidates for reducing latency, and understand how to generate the maximum overall value by attacking them in valuable increments. But there’s still the small matter of that endless stream of features—how do we compare the relative value of a feature and a project to reduce latency for the delivery of future features, and prioritize appropriately? How do we know how much time to spend on latency reduction vs. features? And—more difficult still—how do we convince the CEO and other Important People in the business, who are the ones asking for those features and signing our checks, that they should allow us to carve out that time to work on latency reduction?

Tune in as we tackle that in the upcoming Part II: Selling the Big Boss.