There’s a kind of cognitive dissonance in most people who’ve moved from the academic study of computer science to a job as a real-world software developer. The conflict lies in the fact that, whereas nearly every sample program in every textbook is a perfect and well-thought-out specimen, virtually no software out in the wild is, and this is rarely acknowledged.

To be precise: a tremendous amount of source code written for real applications is not merely less perfect than the simple examples seen in school — it’s outright terrible by any number of measures.

Due to bad design, sloppy or opaque coding practices, non-scalability, and layers of ugly “temporary” patches, it’s often difficult to maintain, harder still to modify or upgrade, painful or impossible for a new person joining the dev team to understand, or (a different kind of problem) slow and inefficient. In short, a mess.

Of course there are many exceptions, but they’re just that: exceptions. In my experience, software is, almost as a rule, bad in one way or another. And lest I be accused of over-generalising: in more than 20 years I’ve done work for maybe a dozen companies, almost all of them in the banking industry and many of them household names.

The technology people employed at these companies are considered to be the very best, if only because the pay tends to be so good. I’ll play it safe and stick to my actual experience in the financial sector even though I'm convinced this state of affairs is not limited to that one industry.

Getting back to the cognitive-dissonance problem: in casual discussion, developers and tech managers will talk about all the wonderful things their system does, the stellar technical skills of their team, and how much their users love them — and all that may be true.

But talk privately, colleague-to-colleague, to one of these developers about the quality of the code base, all the daily headaches, the quick hacks and patches, the laughable mistakes made by the original author of the system (who left the firm a couple of years ago), or the fear that the person who “knows the system” will leave for another job, and you’ll hear a different story: “Of course there are problems. Everyone knows that. Things are always this way — it’s barely even necessary to mention it.”

Very few coworkers with whom I’ve broached this subject have seen things differently, and I’ve often heard stories of costly screw-ups that would shock the most jaded techie. But on a daily basis, in all but the worst cases, it’s easier for developers to talk about things like what their system does, or how elegant its user interface is, than to dwell on any horrors lurking inside.

It also may be that, after years of working on a system with serious maintainability flaws, people simply become accustomed to the strange procedures they have to go through regularly to keep things running.

Complex systems + borked code = beelllions down the drain

In the financial business there have been several software-related blowups in the last few years that were big enough to make it onto the evening news. To name just three, there were: the Nasdaq failure that wreaked havoc with Facebook’s IPO; a trading fiasco at Knight Capital in August that led to widespread market disruption and a $400m drop in Knight’s market value; and the “flash crash” of May, 2010, which caused market losses of at least $1 trillion in a matter of minutes.

System glitches and bugs this visible and this costly are relatively rare, but for every one of them there are a hundred smaller ones that only a handful of people ever hear about. A Reuters article this summer with the title “Morgan Stanley Smith Barney Rainmakers Consider Exit” said this: “Several dozen Morgan Stanley Smith Barney advisers who manage tens of billions of dollars of client money are considering leaving the firm, saying that widespread technology problems have made it very difficult for them to do their jobs.” (Italics mine.)

These are all outright failures in highly complex systems, but poorly written code can crop up in applications of any size, and it may not lead to a direct, quantifiable loss. It will, however, require untold extra hours of work for routine support, make even minor upgrades painful, or force systems to be retired prematurely (in some cases, before they’ve even gone live). How does this happen?