On August 1, 2012, Knight Capital Group had a very bad day, losing $440 million in forty-five minutes. More than two weeks later, there has been no official detailed explanation of what happened. Knight CEO Thomas Joyce has stated “Sadly it was a very simple breakdown — a very large breakdown — but a very simple breakdown…”, but exactly what that “simple breakdown” was remains unknown.
In the absence of facts, anonymous statements and speculation about the cause of the disaster have been rife. In a Dr. Dobb’s article, “Wall Street and the Mismanagement of Software”, Robert Dewar, president and CEO of AdaCore, blamed testing:
It’s clear that Knight’s software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.
Other articles have focused on deployment issues. According to an August 14 article on Businessweek, the problem stemmed from an “old set of computer software that was inadvertently reactivated when a new program was installed”. On August 3, Nanex, LLC published “The Knightmare Explained” with the tagline “The following theory fits all available facts”:
We believe Knight accidentally released the test software they used to verify that their new market making software functioned properly, into NYSE’s live system.
In the safety of Knight’s test laboratory, this test software (we’ll call it, the Tester) sends patterns of buy and sell orders to its new Retail Liquidity Provider (RLP) Market Making software, and the resulting mock executions are recorded. This is how they could ensure their new market making software worked properly before deploying to the NYSE live system.
When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE’s live system. On the morning of August 1st, the Tester is ready to do its job: test market making software. Except this time it’s no longer in the lab, it’s running on NYSE’s live system. And it’s about to test any market making software running, not just Knights. With real orders and real dollars. And it won’t tell anyone about it, because that’s not its function.
Last December, I posted “Do you have releases or escapes?”, discussing the importance of release management. In that post, I stated that excellent code poorly delivered is effectively poor code. A professional release management practice is essential to creating and maintaining quality systems.
Obviously there will be configuration differences between environments and these represent a risk that must be managed. However, failing to standardize the deployment of code is needlessly introducing a risk. An effective release management process should promote repeatable (preferably automated) deployments across all environments. Deployments should be seen as an opportunity to test this process, with the goal of ensuring that the release to production is thoroughly uneventful.
If Nanex’s assessment is correct, either Knight Capital failed to have one standard release process or their process allowed the test harness access to the real world. Either case would make the events of August 1 possible. Avoidable errors are bad enough; one that costs $10 million per minute is epic.