Plug and Play or Punt and Pray?

Grenade

On August 1, 2012, Knight Capital Group had a very bad day, losing $440 million in forty-five minutes. More than two weeks later, there has been no official detailed explanation of what happened. Knight CEO Thomas Joyce has stated “Sadly it was a very simple breakdown — a very large breakdown — but a very simple breakdown…”, but exactly what that “simple breakdown” was remains unknown.

In the absence of facts, anonymous statements and speculation about the cause of the disaster have been rife. In a Dr. Dobb’s article, “Wall Street and the Mismanagement of Software”, Robert Dewar, president and CEO of AdaCore, blamed testing:

It’s clear that Knight’s software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.

Other articles have focused on deployment issues. According to an August 14 article on Businessweek, the problem stemmed from an “old set of computer software that was inadvertently reactivated when a new program was installed”. On August 3, Nanex, LLC published “The Knightmare Explained” with the tagline “The following theory fits all available facts”:

We believe Knight accidentally released the test software they used to verify that their new market making software functioned properly, into NYSE’s live system.

In the safety of Knight’s test laboratory, this test software (we’ll call it, the Tester) sends patterns of buy and sell orders to its new Retail Liquidity Provider (RLP) Market Making software, and the resulting mock executions are recorded. This is how they could ensure their new market making software worked properly before deploying to the NYSE live system.

When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE’s live system. On the morning of August 1st, the Tester is ready to do its job: test market making software. Except this time it’s no longer in the lab, it’s running on NYSE’s live system. And it’s about to test any market making software running, not just Knights. With real orders and real dollars. And it won’t tell anyone about it, because that’s not its function.

Last December, I posted “Do you have releases or escapes?”, discussing the importance of release management. In that post, I stated that excellent code poorly delivered is effectively poor code. A professional release management practice is essential to creating and maintaining quality systems.

Obviously there will be configuration differences between environments and these represent a risk that must be managed. However, failing to standardize the deployment of code is needlessly introducing a risk. An effective release management process should promote repeatable (preferably automated) deployments across all environments. Deployments should be seen as an opportunity to test this process, with the goal of ensuring that the release to production is thoroughly uneventful.

If Nanex’s assessment is correct, either Knight Capital failed to have one standard release process or their process allowed the test harness access to the real world. Either case would make the events of August 1 possible. Avoidable errors are bad enough; one that costs $10 million per minute is epic.

Closing a door

Closed door

While following the Tweets that sparked my last post, I read the following regarding software architecture as a set of design decisions:

Yes, these decisions shape–give form to–the system. But they also shape–set direction; constrain; bring integrity, consistency and unifying aesthetic to–the system. And, yes, the structure of the system encompasses decisions, so there is at least an overlap between these definitions, but the emphasis is shaded differently

Ruth Malan, A Trace in the Sand, Software Architecture Journal (1/13/2011 entry)

The second sentence, particularly the word “constrain”, stood out to me. Architecturally significant decisions, while setting the path for an application’s evolution, can also eliminate alternate paths from consideration. Our decisions frequently close doors, some of which may be extremely difficult to re-open.

Later in her blog entry, Ruth notes that architecture expresses intent: “what” (functional requirements) and “how well” (quality of service requirements). In her words, “…the form should deliver and serve the system function(s) or capabilities, and it should be sustainable (so evolvable and, as pertinent, scalable, and so forth)”. She cautions, however that “…there will also be choices that are so subtle we didn’t even know we were making them”. This is particularly important, because “…deciding that something is architecturally significant means it is one more thing the architect (or architecture team) is going to need advocate and explain and defend and influence and coach and tend…nurturing and evolving the design in the context of organization needs, forces, personalities and more”. This applies to unconscious, unintended decisions as much as those that are made deliberately.

There is a common thread that runs through this post, “It worked in my head”, and “The Most Important Question”: care. Exercising due diligence to provide a sustainable architecture is the hallmark of an effective architect. Painting yourself into a corner when you had alternatives is the unforgivable sin.

In “What’s driving your architecture”, I cataloged a dozen different factors that drive architectural choices. The complexity inherent in considering and balancing these factors is a prime reason that I have no confidence in the concept of emergent architecture. It is difficult enough to evolve and maintain a cohesive, sustainable architecture of any real size when working with an architect/architecture team with a well articulated vision. To suggest that it will “just happen” as a result of individuals doing the “simplest thing that could possibly work” seems a bit Panglossian. As Charlie Alfred noted in a comment on “The Most Important Question”: “Decomposing a system into its parts can only tell you how they work together, but never why they work that way”. Without some unitary guidance, it’s unlikely that the “why” for the system as whole was actually adequately considered.

Another potential source of inadequate decision-making is deferring decisions for too long. Rebecca Wirfs-Brock, in “Agile Architecture Myths #2 Architecture Decisions Should Be Made At the Last Responsible Moment” made the case for just enough up front design thusly:

So what is it about forcing decision-making to be just-in-time at the last responsible moment that bugs me, the notorious non-planner? Well, one thing I’ve observed on complex projects is that it takes time to disseminate decisions. And decisions that initially appear to be localized (and not to impact others who are working in other areas) can and frequently do have ripple affects outside their initially perceived sphere of influence. And, sometimes, in the thick of development, it can be hard to consciously make any decisions whatsoever. How I’ve coded up something for one story may inadvertently dictate the preferred style for implementing future stories, even though it turns out to be wrongheaded. The last responsible moment mindset can at times lull me (erroneously) into thinking that I’ll always have time to change my mind if I need to.

Making decisions early that are going to have huge implications isn’t bad or always wasteful. Just be sure they are vetted and revisited if need be. Deferring decisions until you know more is OK, too. Just don’t dawdle or keep changing your mind. And don’t just make decisions only to eliminate alternatives, but make them to keep others from being delayed or bogged down waiting for you to get your act together. Remember you are collaborating with others. Delaying decisions may put others in a bind.

In a follow-up post, she buttresses this with scientific evidence from a study that found “…those who were exposed to stress of any kind tended to offer solutions before they considered all available alternatives”. This does not mean that the decisions made were automatically “bad”. However, neither would it follow that the decision made was optimal for the given context. Lots of “good enough” decisions can add up to failure if they do not cohere over the long term and are not revisited.

Care must also be taken not to rely too much on the ability to revisit and refactor. Rework to address changing circumstances and unforeseen issues is a necessary evil from the customer’s viewpoint. Rework that could have been avoided, however, is rightly seen as waste. A little extra work to ensure future flexibility can be easier justified than major revisions to accommodate something you should have known was on the horizon.