Go-to People Considered Harmful

Neck of Codd bottle

Okay, so the title’s a little derivative, but it’s both accurate and it fits in with the “organizations as systems” theme of recent posts. Just as dependency management is important for software systems, it’s likewise just as critical for social systems. Failures anywhere along the chain of execution can potentially bring the whole system to a halt if resilience isn’t considered in the design (and evolution) of the system.

Dependency issues in social systems can take a variety of forms. One that comes easily to mind is what is referred to as the “bus factor” – how badly the team is affected if a person is lost (e.g. hit by a bus). Roy Osherove’s post from today, “A Critical Chain of Bus Factors”, expands on this. Interlocking chains of dependencies can multiply the bus factor:

A chain of bus factors happens when you have bus factors depending on bus factors:

Your one developer who knows how to configure the pipeline can’t test the changes because the agent is down. The one guy in IT who has access to the agent needs to reboot it, but does not have access. The one person who has access to reboot it (in the Infra team) is sick, so now there are three people waiting, and there is nothing in this earth that can help that situation.

The “bus factor”, either individually or as a cascading chain, is only part of the problem, however. A column on CIO.com, “The hazards of go-to people”, identifies the potential negative impacts on the go-to person:

They may:

  • Resent that they shoulder so much of the burden for the entire group.
  • Feel underpaid.
  • Burn out from the stress of being on the never-ending-crisis treadmill.
  • Feel trapped and unable to progress in their careers since they are so important in the role that they are in.
  • Become arrogant and condescending to their peers, drunk with the glory of being important.

The same column also lists potential problems for those who are not the go-to person:

When they realize that they are not one of the go-to people they might:

  • Feel underappreciated and untrusted.
  • Lose the desire to work hard since they don’t feel that their work will be recognized or rewarded.
  • Miss out on the opportunities to work on exciting or important things, since they are not considered dedicated and capable.
  • Feel underappreciated and untrusted.

A particularly nasty effect of relying on go-to people is that it’s self-reinforcing if not recognized and actively worked against. People get used to relying on the specialist (which is, admittedly, very effective right up until the bus arrives) and neglect learning to do for themselves. Osherove suggests several methods to mitigate these problems: pairing, teaching, rotating positions, etc. The key idea being, spreading the knowledge around.

Having individuals with deep knowledge can be a good thing if they’re a reservoir supplying others and not a pipeline constraining the flow. Intentional management of dependencies is just as important in social systems as in software systems.

Abuse Cases – What Could Go Wrong?

Trainwreck

Last week, in a post titled “The Flaw in All Things”, John Vincent discussed the problem of seeing “the flaw in all things”:

It’s overwhelming. It’s paralyzing.

I can’t finish a project because I keep finding things that could cause problems. I even mentioned this to our CTO and CEO at one point when we were trying to size some private deploys of our stack.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

I’m frustrated with putting everything in Docker containers because all I see is having to take down EVERYTHING running on one node because there’s may be a critical Docker upgrade. I see Elasticsearch rebalancing because of it. I see Kafka elections. Mind you the system is designed for this to happen but why add something that makes it a regular occurance?

I can certainly sympathize. For what it’s worth, it sounds like those making the trade-offs he’s worried about could stand to be a bit more inclusive. That doesn’t necessarily mean the decisions would change, but at least being heard and knowing the answers to his questions might reduce some of the stress (not to mention perhaps helping out those responsible for the decisions).

When making design decisions, having (or, at least, having access to) this level of knowledge and experience has a great deal of value. As I noted in “NPM, Tay, and the Need for Design”, you have to consider both the use cases and the abuse cases for a given system (whether software or social).

It’s not possible to foresee every potential flaw and it probably won’t be feasible to eliminate every risk that’s discovered. That being said, it doesn’t mean time spent in risk evaluation is wasted. Dealing with foreseeable issues before they become problems (where “dealing” is defined as either mitigating it outright or at least planning for a response should it occur) will work better than figuring it out on the fly when the problem occurs.

Understanding why “Should I?” is a more important question than “Can I?” is something I’ve touched on before. Snapchat is finding this out by way of a lawsuit over their filter that allows users to record their speed. Who would have thought someone might cause an accident using that?

Trial and error/experimenting is one method of learning, but it’s not the only method and is frequently not the best method. Fear of failure can hold back learning, but a cavalier attitude toward risk can make experiments just as, if not more costly. It’s the difference between testing a 9 volt battery by touching it to your tongue and using the same technique on a 240 volt circuit.

Talking about TayandYou on Architecture Corner

I had the pleasure of appearing on episode #37 of Architecture Corner, “Fail fast, learn fast”, with Greger Wikstrand and Casimir Artmann. In the episode, we discuss learning, experiments, and the idea of “fail fast” in relation to the recent incident with Microsoft’s artificial intelligence chatbot, @TayandYou.

I hope you enjoy the discussion as much as I did!

[updated 4/5/2016 to fix the episode number in the link]

NPM, Tay, and the Need for Design

Take a couple of seconds and watch the clip in the tweet below:

While it would be incredibly difficult to predict that exact outcome, it is also incredibly easy to foresee that it’s a possibility. As the saying goes, “forewarned is forearmed”.

Being forewarned and forearmed is an important part of what an architect does. An architect is supposed to focus on the architecturally significant aspects of a system. I like to use Ruth Malan‘s definition of architectural significance due to its flexibility:

Decisions (both those that were made and those that were left unmade) that end up taking systems offline and causing very public embarrassment are, in my opinion, architecturally significant.

Last week, two very public, very foreseeable failures took place: first was the chaos caused by a developer removing his modules from NPM, which was followed by Microsoft having to pull the plug on its Tay chatbot when it was “trained” to spew offensive comments in less than 24 hours. In my opinion, these both represented design failures resulting from a lack of due consideration of the context in which these systems would operate.

After all, can anyone really claim that no one would expect that people on the internet would try to “corrupt” a chatbot? According to Azeem Azhar as quoted in Business Insider, not really:

“Of course, Twitter users were going to tinker with Tay and push it to extremes. That’s what users do — any product manager knows that.

“This is an extension of the Boaty McBoatface saga, and runs all the way back to the Hank the Angry Drunken Dwarf write in during Time magazine’s Internet vote for Most Beautiful Person. There is nearly a two-decade history of these sort of things being pushed to the limit.”

The current claim, as reported in CIO.com, is that Tay was developed with filtering built-in, but there was a “critical oversight” for a specific kind of attack. According to the article, it’s believed that the attack vector involved asking Tay to “repeat after me”.

Or, as Matt Ballantine put it:

Likewise, who could imagine issues with a centralized repository of cascading dependencies? Failing to consider what would happen if someone suddenly pulled one of the bottom blocks out led to a huge inconvenience to anyone depending on that module or any downstream module. There’s plenty of blame to go around: the developer who took his toys and went home, those responsible for NPM’s design, and those who depended on it without understanding its weaknesses.

“The Iron Law of Tools” is “that which does for you will also do to you”. Understanding the trade-offs allows you to plan for risk mitigation in advance. Ignoring them merely ensures that they will have to be dealt with in crisis mode. This is something I covered in a previous post, “Dependency Management is Risk Management”.

Effective design involves not only the internals of a system but its externals as well. The conditions under which the system will be used, it’s context, is highly significant. That means considering not only the system’s use cases, but also its abuse cases. A post written almost a year ago by Brandon Harris, “Designing for Evil”, conveys this well:

When all is said and done, when you’ve set your ideas to paper, you have to sit down and ask yourself a very specific question:

How could this feature be exploited to harm someone?

Now, replace the word “could” with the word “will.”

How will this feature be exploited to harm someone?

You have to ask that question. You have to be unflinching about the answers, too.

Because if you don’t, someone else will.


When I began working on this post, the portion above was what I had in mind to say. In essence, I planned a longer-form version of what I’d tweeted about the Tay fiasco:

However, before I had finished writing the post, Greger Wikstrand posted “The fail fast fallacy”. Greger and I have been carrying on a conversation about innovation over the last few months. While I had initially intended to approach this as a general issue of architectural practice rather than innovation, the points he makes are just too apropos to leave out.

In the post, Greger points out that the focus seems to have shifted from learning to failure. Learning from experience can be the best way to test an idea. However, it’s not the only way:

Evolution and nature has shown us that there are two, equally valid, approaches to winning the gene game. The first approach is to get as much offspring as possible and “hope” many of them survive (r-selection). The second approach is to have few offspring but raise them and nurture them carefully (K-selection). Biologists tell us that the first strategy works best in a harsh, unpredictable environment where the effort of creating offspring is low. The second strategy works better in an environment where there is less change and offspring are more expensive to produce. Some of the factors that favour r-selection seems to be large uncompeted resources. K-selection is more favourable in resource scarce, low predator areas.

The phrase “…where the effort of creating offspring is low” is critical here. The higher the “cost” of the experiment, the more risk is involved in failure. This makes it advisable to tilt the playing field by supporting and nurturing the “offspring”.

In response to Greger’s post, Casimir Artmann posted two excellent articles that further elaborated on this. In “Fail Fast During Adventures”, he noted that “There is a fine line between fail fast and Darwin Awards in IRL.” His point, preparation beforehand and being willing to abort during an experiment before failure is equivalent to suffering a fatality can be effective learning strategies. Lessons that you don’t live to apply aren’t worth much.

Casimir followed with “Fail is not an Option”, in which he stated:

I want the project to succeed, but I plan for things going wrong so that the consequences wouldn’t be to huge. Some risk are manageable, as walking alone, but not alone and off-trail. That’s to risky. If you doing outdoor adventures, you are probably more prepared and skilled than a ordinarie project member, and thats a huge benefit.

I guess the best advice, when doing completely new things with IT, is to start really small so that the majority of your business is not impacted if there is a failure. When something goes wrong, be sure that you could go back to safe place. Point of no return is like being on a sailing boot in the middle of the Atlantic where you can’t go back.

That’s excellent advice. “Fail Fast” has the advantage of being able to fit on a bumper sticker, but the longer, more nuanced version is more likely to serve you well.

No, Uncle Bob, No – the Obligatory healthcare.gov Post

Good for what ails you?

I tried to avoid this one. First of all, I don’t do politics on this site and this topic has way too much political baggage. Second, a great many people have already written about it, so I didn’t think I really had anything to add.

Then, Uncle Bob Martin chimed in.

I agree with some of what he has to say. I have no doubt that this particular debacle has harmed the image of software development in the eyes of the general public. Then he falls over the edge, comparing the launch of healthcare.gov with the Challenger disaster. After all, in both cases, political considerations overrode technical concerns. Regardless of this, Bob puts the blame on those far down the ladder:

Perhaps you disagree. Perhaps you think this was a failure of government, or of management. Of course I agree. Government failed and management failed. But government and management don’t know how to build software. We do. We were hired because of that knowledge. And we are expected to use that knowledge to communicate to the managers and administrators who don’t have it.

The thing is, the Centers for Medicare and Medicaid Services (CMS) is both a government agency and the system integrator on the healthcare.gov project. While there’s plenty of evidence of really poor code across the various parts, the integration of those parts is where the project fell down. Had the various contractors hired numerous Bob Martin clones and obtained the cleanest of clean code, the result would have still been the same.

Those with the technical knowledge and experience are, without a doubt, obligated to provide their best advice to the managers and administrators. When those managers and administrators ignore that advice, it is incorrect to allege that the fault lies elsewhere.

The end of the post, however, is the worst:

So, if I were in government right now, I’d be thinking about laws to regulate the Software Industry. I’d be thinking about what languages and processes we should force them to use, what auditing should be done, what schooling is necessary, etc. etc. I’d be thinking about passing laws to get this unruly and chaotic industry under some kind of control.

If I were the President right now, I might even be thinking about creating a new Czar or Cabinet position: The Secretary of Software Quality. Someone who could regulate this misbehaving industry upon which so much of our future depends.

Considering that all indications are that the laws and regulations around government purchasing and contracting contributed to this mess, I’m not sure how additional regulation is supposed to fix it. Likewise, it’s a little boneheaded to suggest that those responsible for this debacle (by attempting to manage what they should have known they were unqualified to manage) should now regulate the entire software development industry. For a fact, the very diversity of the industry should make it obvious that a one-size-fits-all mandate would make matters irretrievably worse.

Handing out aspirin to treat Ebola is just bad medicine.

Plug and Play or Punt and Pray?

Grenade

On August 1, 2012, Knight Capital Group had a very bad day, losing $440 million in forty-five minutes. More than two weeks later, there has been no official detailed explanation of what happened. Knight CEO Thomas Joyce has stated “Sadly it was a very simple breakdown — a very large breakdown — but a very simple breakdown…”, but exactly what that “simple breakdown” was remains unknown.

In the absence of facts, anonymous statements and speculation about the cause of the disaster have been rife. In a Dr. Dobb’s article, “Wall Street and the Mismanagement of Software”, Robert Dewar, president and CEO of AdaCore, blamed testing:

It’s clear that Knight’s software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor. They did not choose wisely.

Other articles have focused on deployment issues. According to an August 14 article on Businessweek, the problem stemmed from an “old set of computer software that was inadvertently reactivated when a new program was installed”. On August 3, Nanex, LLC published “The Knightmare Explained” with the tagline “The following theory fits all available facts”:

We believe Knight accidentally released the test software they used to verify that their new market making software functioned properly, into NYSE’s live system.

In the safety of Knight’s test laboratory, this test software (we’ll call it, the Tester) sends patterns of buy and sell orders to its new Retail Liquidity Provider (RLP) Market Making software, and the resulting mock executions are recorded. This is how they could ensure their new market making software worked properly before deploying to the NYSE live system.

When the time comes to deploy the new market making software, which is likely handled by a different group, the Tester is accidentally included in the release package and started on NYSE’s live system. On the morning of August 1st, the Tester is ready to do its job: test market making software. Except this time it’s no longer in the lab, it’s running on NYSE’s live system. And it’s about to test any market making software running, not just Knights. With real orders and real dollars. And it won’t tell anyone about it, because that’s not its function.

Last December, I posted “Do you have releases or escapes?”, discussing the importance of release management. In that post, I stated that excellent code poorly delivered is effectively poor code. A professional release management practice is essential to creating and maintaining quality systems.

Obviously there will be configuration differences between environments and these represent a risk that must be managed. However, failing to standardize the deployment of code is needlessly introducing a risk. An effective release management process should promote repeatable (preferably automated) deployments across all environments. Deployments should be seen as an opportunity to test this process, with the goal of ensuring that the release to production is thoroughly uneventful.

If Nanex’s assessment is correct, either Knight Capital failed to have one standard release process or their process allowed the test harness access to the real world. Either case would make the events of August 1 possible. Avoidable errors are bad enough; one that costs $10 million per minute is epic.