Wall Street and the Mismanagement of Software How Knight Capital becomes a knight errant when it came to software design and delivery.



Last week, an error in some automated high-frequency trading software from Knight Capital Group caused the program to go seriously amok, and when the cyberdust cleared, the company was left barely alive, holding the bill for almost a half-billion dollars to cover the erroneous trades. Much of the ensuing uproar has cited the incident as rationale for additional regulation and/or putting humans more directly in the decision loop. However, that argument is implicitly based on the assumption that software, or at least automated trading software, is intrinsically unreliable and cannot be trusted. Such an assumption is faulty. Reliable software is indeed possible, and people's lives and well-being depend on it every day. But it requires an appropriate combination of technology, process, and culture.

In this specific case, the Knight software was an update that was intended to accommodate a new NYSE system, the Retail Liquidity Program that went live on August 1. Other trading companies' systems were able to cope with the new NYSE program; Knight was not so fortunate, and, in what was perhaps the most astounding part of the whole episode, it took the company 30 minutes before they shut down the program. By then, the expensive damage had been done.

It's clear that Knight's software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.

With a disaster of this magnitude—Knight's stock has nosedived since the incident—there is of course a lot of post mortem analysis: what went wrong, and how can it be prevented in the future.

The first question can only be answered in detail by the Knight software developers themselves, but several general observations may be made. First, the company's verification processes were clearly insufficient. This is sometimes phrased as "not enough testing" but there is more to verification than testing; for example source code analysis by humans or by automated tools to detect potential errors and vulnerabilities. Second, the process known as hazard analysis or safety analysis in other domains was not followed. Such an analysis involves planning for "what if..." scenarios: if the software fails—whether from bad code or bad data—, what is the worst that can happen? Answering such questions could have resulted in code to perform limit checks or carry out "fail soft" procedures. This would at least have shut down the program with minimal damage, rather than letting it rumble on like a software version of the sorcerer's apprentice.

The question of how to prevent such incidents in the future is more interesting. Some commentators have claimed that the underlying application in high-frequency trading (calculating trades within microseconds to take advantage of fraction-of-a-cent price differentials) is simply a bad idea that frightens investors and should be banned or heavily regulated. There are arguments on both sides of that issue, and we will leave that discussion to others. However, if such trading is permitted, then how are its risks to be mitigated?

To put things in perspective, in spite of the attention that the incident has caused, the overall system—the trading infrastructure—basically worked. Certainly Knight itself was affected, but the problem was localized: we didn't have another "flash crash." We don't know yet whether this is because we got lucky or because the "circuit breakers" in the NYSE system were sufficient, but it's clear that such an error has the potential to cause much larger problems.

What is needed is a change in the way that such critical software is developed and deployed. Safety-critical domains such as commercial avionics, where software failure could directly cause or contribute to the loss of human life, have known about this for decades. These industries have produced standards for software certification that heavily emphasize appropriate "life cycle" processes for software development, verification, and quality assurance. A "safety culture" has infused the entire industry, with hazard/safety analysis a key part of the overall process. Until the software has been certified as compliant with the standard, the plane does not fly. The result is an impressive record in practice: no human fatality on a commercial aircraft has been attributed to a software error.

High-frequency automated trading is not avionics flight control, but the aviation industry has demonstrated that safe, reliable real-time software is possible, practical, and necessary. It requires appropriate development technology and processes as well as a culture that thinks in terms of safety (or reliability) first. That is the real lesson to be learned from last week's incident. It doesn't come for free, but it certainly costs less than $440M.

Robert Dewar is the president and CEO of AdaCore. He is the principal author GNAT, the free software Ada compiler, and earlier of the Realia COBOL compiler.