Here’s a common pattern that I see play out very often in software teams:

There is a need to change the existing system behavior to accomplish new functionality.

The software engineer looking at the task realizes that the existing design isn’t well suited to the change needed. They suggest design changes and refactoring, as part of implementing the desired functionality.

Their peers review the work, and they worry that the changes being proposed might be dangerous. Because no one in the team has a great understanding of the system, they worry that any refactoring will introduce unintended bugs. They have been bitten by this in the past, and they are determined not to make the same mistake again.

They identify a way to avoid making significant changes. “Right here, in method Foo in class Bar, just add another if-check, a service call and/or a dependency to class Baz.” This way, the changes will be only a couple lines of code. Far less risky.

Class Bar and method Foo was never meant to perform service calls, or to filter out results based on a dependency on Baz. It violates the original design of the class. The single responsibility of the class has now become multiple responsibilities. “Bar always returns ABC” has now become “Bar always returns ABC unless DEF in which case it returns XYZ”. Separation of concerns gives way to an intersection of concerns.

But still, it’s only a few additional lines of code. How bad can it really be. Besides, the risk is so much lower, because we’re touching such a small surface area. It’s hard to argue against that on the basis of abstract design principles. We’re pragmatists here, not architecture astronauts.

And so the team resolves to follow the path of least resistance. They dutifully update the unit tests. They also manually test the relevant changes, because they have learnt from experience that unit tests are only half the story. The code is successfully deployed without problems, everyone breathes a sigh of relief, and congratulate themselves on a job well done.

The above process repeats itself every time a change needs to be made. With each iteration, the design becomes a little more muddied, the system becomes a little harder to understand, and behaves in slightly more unpredictable ways. All of which makes the team even more risk averse and determined to avoid unnecessary changes. Which in turn accelerates the rot even further.

And any time someone leaves the team and is replaced by a newcomer, this acceleration kicks into overdrive.

Pretty soon, you have a full fledged legacy system that everyone loves to complain about and no one wants to meddle with. The code base is no longer a living system, but a museum artifact. It is meant to be gazed on with wonder and mystery, but certainly not touched.

The Origins of Legacy

In “Working Effectively with Legacy Code“, Michael Feathers describes legacy software as any software system that lacks tests. I think this observation is very insightful but I would go one step further. Legacy software is any software where people are afraid to make changes.

The two most common reasons for change aversion:

Complexity Test coverage holes

Complex code makes testing even more necessary. And holes in test coverage make it clear that the automated test suite can’t be relied on.

When asked about testing, most software teams insist that they are very disciplined about writing tests to cover all functionality. In order to truly figure out how good their test suite is, there is one simple questions you can ask: how much time do you spend on manual testing?

In my experience, there are a lot of teams that claim to be very disciplined about writing tests, but also spend significant amounts of time on manual testing. Clearly the reason is because their automated tests have significant holes. Often because their tests are focused only on unit testing, and not on the emergent functionality that you can cover in end-to-end testing.

This lack of test coverage causes a destructive feedback loop.

Lack of confidence in making changes

-> All changes are extensively tested manually

-> People avoid major changes because manual testing is time and labor intensive

-> All changes follow the path of least resistance

-> Code becomes increasingly complex, poorly designed, and full of gotchas

-> Lack of confidence in making changes

Once started, the doom loop only gets worse and worse with time, until you finally wind up with a legacy system that is business critical but a horrendous mess to work with. It takes more and more time to make any changes at all. And eventually, the day arrives when the system collapses under its own weight.

Breaking Out Of The Doom Loop

The only way to break this loop is by attacking it at its very source – the lack of confidence in making changes. I’ve seen a team successfully attack it by just hiring manual testers whose job is to thoroughly (manually) test all changes before deploying. This does have limitations, such as slow iteration times and payroll costs. But at least it successfully breaks the doom loop.

The more common and ideal way to break the doom loop is by investing heavily on automated testing, incremental deployments, and automated monitoring and alerting of production errors.

It takes significant time and effort to establish automated testing as a first class citizen in the development process. Especially to develop the tooling necessary for high quality integration and end-to-end tests. But the benefits are tremendous. Your developers no longer need to spend gobs of time on manual testing.

Incremental deployments and better production alarms may not prevent bugs, but they can help immensely in mitigating them. At companies like Amazon and Google, new software builds are often deployed to just one machine at first. A machine that resides in your production fleet and serves production traffic, just like the rest of your fleet. Then, after a period of time, the build is deployed incrementally to more and more machines, until it is eventually deployed onto your entire fleet.

This synergizes particularly well with automated alarms. By practicing a fail-fast methodology and configuring your code to automatically trigger alarms when such failures occur, you can nip bugs in the bud – not weeks or months later when customer complaints trickle in. When coupled with incremental deployments, you can release your new builds to just a few machines on your fleet, check to see if any alarms are going off, and rollback the deployment if they are. This way, only a tiny fraction of your users will be impacted. This is certainly not ideal, but it is orders of magnitude better than the alternative.

Finding The Will

If your team is already well inside the doom loop, it can be very hard to break out of it. The solutions discussed above don’t deliver any business value in and of themselves. Hence the tendency to throw them on the back burner. It’s hard to justify spending significant time on testing, monitoring and deployment enhancements, for a functioning legacy system, when the marketing team is insistent on releasing the next killer feature that will delight users and win over the competition.

Breaking out of the doom loop can certainly can be done with concerted effort and investment. But that is something most maintenance teams are seldom given. Hence why they invariably spiral further and further… until things get so bad that everyone decides to junk the whole thing and rewrite it from scratch. At which point, one legacy system is retired and the next legacy system is born.

Hacker News discussion

Daily WTF – How systemic debt sank a project