Learning Organizations: When Wrens Take Down Wolfpacks

A Women's Royal Naval Service plotter at work in the Operations Room at Derby House in Liverpool, the headquarters of the Commander-in-Chief Western Approaches, September 1944.

What does the World War II naval campaign known as the Battle of the Atlantic have to do with learning and innovation?

Quite a lot, as it turns out. Early in the war, Britain found itself in a precarious position. While being an island nation provided defensive advantages, it also came with logistical challenges. Food, armaments, and other vital supplies as well as reinforcements had to come to it by sea. The shipping lanes were heavily threatened, primarily by the German u-boat (submarine) fleet. Needing more than a million tons of imports per week, maintaining the flow of goods was a matter of survival.

Businesses may not have to worry about literal torpedoes severing their lifelines, but they are at risk due to a number of factors. Whether its changing technology or tastes, competitive pressures, or even criminal activity, organizations cannot afford to sit idle. In his post “Heraclitus was wrong about innovation”, Greger Wikstrand talked about the mismatch between the speed of change (high) and rate of innovation (not fast enough). This is a recurrent theme in our ongoing discussion of innovation (we’ve been trading posts on the subject for over a year now).

The British response to the threat involved many facets, but an article I saw yesterday about one response in particular struck a chord. “The Wargaming “Wrens” of the Western Approaches Tactical Unit” told the story of a group of officers and ratings of the Women’s Royal Naval Service (nicknamed “Wrens”) who, under the command of a naval officer, Captain Gilbert Roberts, revolutionized British anti-submarine warfare (ASW). Their mandate was to “…explore and evaluate new tactics and then to pass them on to escort captains in a dedicated ASW course”.

Using simulation (wargaming) to develop and improve tactics was an unorthodox proposition, particularly in the eyes of Admiral Percy Noble, who was responsible for Britain’s shipping lifeline. However, Admiral Noble was capable of appreciating the value of unorthodox methods:

A sceptical Sir Percy Noble arrived with his staff the next day and watched as the team worked through a series of attacks on convoy HG.76. As Roberts described the logic behind their assumptions about the tactics being used by the U-Boats and demonstrated the counter move, one that Wren Officer Laidlaw had mischievously named Raspberry, Sir Percy changed his view of the unit. From now on the WATU would be regular visitors to the Operations Room and all escort officers were expected to attend the course.

Each of the courses looked at ASW and surface attacks on a convoy and the students were encouraged to take part in the wargames that evaluated potential new tactics. Raspbery was soon followed by Strawberry, Goosebery and Pineapple and as the RN went over to the offensive, the tactical priority shifted to hunting and killing U Boats. Roberts continued as Director of WATU but was also appointed as Assistant Chief of Staff Intelligence at Western Approaches Command.

This type of learning culture, such as I described in “Learning to Deal with the Inevitable”, was key to winning the naval war. Clinging to tradition would have led to a fatal inertia.

One aspect of the WATU approach that I find particularly interesting is the use of simulation to limit risk during learning. Experiments involving real ships cost real lives when they don’t pan out. Simulation (assuming sufficient validity of the theoretical underpinnings of the model used) is a technique that can be used to explore more without sending costs through the roof.

Dealing with Technical Debt Like We Mean it

What’s the biggest problem with technical debt?

In my opinion, the biggest problem is that it works. Just like the electrical outlet pictured above, systems with technical debt get the job done, even when there’s a hidden surprise or two waiting to make life interesting for us at some later date. If it flat-out failed, getting it fixed would be far easier. Making the argument to spend time (money) changing something that “works” can be difficult.

Failing to make the argument, however, is not the answer:

Brenda Michelson‘s observation is half the battle. The argument for paying down technical debt needs to be made in business-relevant terms (cost, risk, customer impact, etc.). We need more focus on the “debt” part and remember “technical” is just a qualifier:

The other half of the battle is communicating, in the same business-relevant manner, the costs and/or risks involved when taking on technical debt is considered:

Tracking what technical debt exists and managing the payoff (or write off, removing failed experiments is a reduction technique) is important. Likewise, managing the assumption of technical debt is critical to avoid being swamped by it.

Of course, one could take the approach that the only acceptable level of technical debt is zero. This is equivalent to saying “if we can’t have a perfect product, we won’t have a product”. That might be a difficult position to sell to those writing the checks.

Even if you could get an agreement for that position, reality will conspire to frustrate you. Entropy emerges. Even if the code is perfected and then left unchanged, the system can rot as its platform ages and the needs of the business change. When a system is actively maintained over time without an eye to maintaining a coherent, intentional architecture, then the situation becomes worse. In his post “Enterprise Modernization – The Next Big Thing!”, David Sprott noted:

The problem with modernization is that it is widely perceived as slow, very expensive and high risk because the core business legacy systems are hugely complex as a result of decades of tactical change projects that inevitably compromise any original architecture. But modernization activity must not be limited to the old, core systems; I observe all enterprises old and new, traditional and internet based delivering what I call “instant legacy” [Note 1] generally as outcomes of Agile projects that prioritize speed of delivery over compliance with a well-defined reference architecture that enables ongoing agility and continuous modernization.

Kellan Elliot-McCrea, in “Towards an understanding of technical debt”, captured the problem:

All code is technical debt. All code is, to varying degrees, an incorrect bet on what the future will look like.

This means that assessing and managing technical debt should be an ongoing activity with a responsible owner rather than a one-off event that “somebody” will take care of. The alternative is a bit like using a credit card at every opportunity and ignoring the statements until the repo-man is at the door.

What’s Innovation Worth?

Animated GIF of Sherman Tank Variants

What does an old World War II tank have to do with innovation?

I’ve mentioned it before, but it bears repeating – one of benefits of having a blog is the ability to interact with and learn from people all over the world. For example, Greger Wikstrand and I have been trading blog posts on innovation for six months now. His latest post, “Switcher’s curse and legacy decisions”,is the 18th installment in the series. In this post, Greger discusses switcher’s curse, “a trap in which a decision maker systematically switches too often”.

Just as the sunk cost fallacy can keep you holding on to a legacy system long past its expiration date, switcher’s curse can cause you to waste money on too-frequent changes. As Greger points out in his post, the net benefit of the new system must outweigh both the net benefit of the old, plus the cost of switching (with a significant safety margin to account for estimation errors in assessing the costs and benefits). Newer isn’t automatically better.

“Disruption” is a two-edged sword when it comes to innovation. As Greger notes regarding legacy systems:

Existing software is much more than a series of decisions to keep it. It embodies a huge number of decisions on how the business of the company should work. The software is full of decisions about business objects and what should be done with them. These decisions, embodied in the software, forms the operating system of the company. The decision to switch is bigger than replacing some immaterial asset with another. It is a decision about replacing a proven way of working with a new way of working.

Disruption involves risk. Change involves cost; disruptive change involves higher costs. In “Innovate or Execute?”, Earl Beede asked:

So, do our employers really want us taking the processes they have paid dearly to implement and products they have scheduled out for the next 15 quarters and, individually, do something disruptive? Every team member taking a risk to see what they can learn and then build on?

Wouldn’t that be chaos?

Beede’s answer to the dilemma:

Now, please don’t think I am completely cynical. I do think that the board of directors and maybe even the C-level officers want to have innovative companies. I really believe that there needs to be parts of a company whose primary mission is to make the rest of the company obsolete. But those disruptive parts need to be small, isolated groups, kept out of the day-to-day delivery of the existing products or services.

What employers should be asking for is for most of the company to be focused on executing the existing plans and for some of the company to be trying to put the executing majority into a whole new space.

This meshes well with Greger’s recommendations:

Conservatism is often the best approach. But it needs to be a prudent conservatism. Making changes smaller and more easily reversible decreases the need for caution. We should consider a prudent application of fail fast mentality in our decision-making process. (But I prefer to call it learn fast.)

Informed decision-making (i.e. making decisions that make sense in light of your context) is critical. The alternative is to rely on blind luck. Being informed requires learning, and as Greger noted, fast turn-around on that learning is to be preferred. Likewise, limiting risk during learning is to be preferred as well. Casimir Artmann, in his post “Fail is not an option”, discussed this concept in relation to hiking in the wilderness. Assessing and controlling risk in that environment can be a matter of life in death. In a business context, it’s the same (even if the “death” is figurative, it’s not much comfort considering the lives impacted). Learning is only useful if you survive to put it to use.

Lastly, it must be understood that decision-making is not a one-time activity. Context is not static, neither should your decision-making process be. An iterative cycle of sense-making and decision-making is required to maintain the balance between innovation churn and stagnation.

So, why the tank?

The M-4 Sherman, in addition to being the workhorse of the U.S. Army’s armor forces in World War II, is also an excellent illustration of avoiding the switcher’s curse. When it was introduced, it was a match for existing German armored vehicles. Shortly afterwards, however, it was outclassed as newer, heavier, better armed German models came online. The U.S. stuck with their existing design, and were able to produce almost three times the number of tanks as Germany (not counting German tanks inferior to the Sherman). As the saying goes, “quantity has a quality all its own”, particularly when paired with other weapon systems in a way that did not disrupt production. The German strategy of producing multiple models hampered their ability to produce in quantity, negating their qualitative advantage. In this instance, progressive enhancement and innovating on the edges was a winning strategy for the U.S.

Abuse Cases – What Could Go Wrong?


Last week, in a post titled “The Flaw in All Things”, John Vincent discussed the problem of seeing “the flaw in all things”:

It’s overwhelming. It’s paralyzing.

I can’t finish a project because I keep finding things that could cause problems. I even mentioned this to our CTO and CEO at one point when we were trying to size some private deploys of our stack.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

I’m frustrated with putting everything in Docker containers because all I see is having to take down EVERYTHING running on one node because there’s may be a critical Docker upgrade. I see Elasticsearch rebalancing because of it. I see Kafka elections. Mind you the system is designed for this to happen but why add something that makes it a regular occurance?

I can certainly sympathize. For what it’s worth, it sounds like those making the trade-offs he’s worried about could stand to be a bit more inclusive. That doesn’t necessarily mean the decisions would change, but at least being heard and knowing the answers to his questions might reduce some of the stress (not to mention perhaps helping out those responsible for the decisions).

When making design decisions, having (or, at least, having access to) this level of knowledge and experience has a great deal of value. As I noted in “NPM, Tay, and the Need for Design”, you have to consider both the use cases and the abuse cases for a given system (whether software or social).

It’s not possible to foresee every potential flaw and it probably won’t be feasible to eliminate every risk that’s discovered. That being said, it doesn’t mean time spent in risk evaluation is wasted. Dealing with foreseeable issues before they become problems (where “dealing” is defined as either mitigating it outright or at least planning for a response should it occur) will work better than figuring it out on the fly when the problem occurs.

Understanding why “Should I?” is a more important question than “Can I?” is something I’ve touched on before. Snapchat is finding this out by way of a lawsuit over their filter that allows users to record their speed. Who would have thought someone might cause an accident using that?

Trial and error/experimenting is one method of learning, but it’s not the only method and is frequently not the best method. Fear of failure can hold back learning, but a cavalier attitude toward risk can make experiments just as, if not more costly. It’s the difference between testing a 9 volt battery by touching it to your tongue and using the same technique on a 240 volt circuit.

Talking about TayandYou on Architecture Corner

I had the pleasure of appearing on episode #37 of Architecture Corner, “Fail fast, learn fast”, with Greger Wikstrand and Casimir Artmann. In the episode, we discuss learning, experiments, and the idea of “fail fast” in relation to the recent incident with Microsoft’s artificial intelligence chatbot, @TayandYou.

I hope you enjoy the discussion as much as I did!

[updated 4/5/2016 to fix the episode number in the link]

NPM, Tay, and the Need for Design

Take a couple of seconds and watch the clip in the tweet below:

While it would be incredibly difficult to predict that exact outcome, it is also incredibly easy to foresee that it’s a possibility. As the saying goes, “forewarned is forearmed”.

Being forewarned and forearmed is an important part of what an architect does. An architect is supposed to focus on the architecturally significant aspects of a system. I like to use Ruth Malan‘s definition of architectural significance due to its flexibility:

Decisions (both those that were made and those that were left unmade) that end up taking systems offline and causing very public embarrassment are, in my opinion, architecturally significant.

Last week, two very public, very foreseeable failures took place: first was the chaos caused by a developer removing his modules from NPM, which was followed by Microsoft having to pull the plug on its Tay chatbot when it was “trained” to spew offensive comments in less than 24 hours. In my opinion, these both represented design failures resulting from a lack of due consideration of the context in which these systems would operate.

After all, can anyone really claim that no one would expect that people on the internet would try to “corrupt” a chatbot? According to Azeem Azhar as quoted in Business Insider, not really:

“Of course, Twitter users were going to tinker with Tay and push it to extremes. That’s what users do — any product manager knows that.

“This is an extension of the Boaty McBoatface saga, and runs all the way back to the Hank the Angry Drunken Dwarf write in during Time magazine’s Internet vote for Most Beautiful Person. There is nearly a two-decade history of these sort of things being pushed to the limit.”

The current claim, as reported in CIO.com, is that Tay was developed with filtering built-in, but there was a “critical oversight” for a specific kind of attack. According to the article, it’s believed that the attack vector involved asking Tay to “repeat after me”.

Or, as Matt Ballantine put it:

Likewise, who could imagine issues with a centralized repository of cascading dependencies? Failing to consider what would happen if someone suddenly pulled one of the bottom blocks out led to a huge inconvenience to anyone depending on that module or any downstream module. There’s plenty of blame to go around: the developer who took his toys and went home, those responsible for NPM’s design, and those who depended on it without understanding its weaknesses.

“The Iron Law of Tools” is “that which does for you will also do to you”. Understanding the trade-offs allows you to plan for risk mitigation in advance. Ignoring them merely ensures that they will have to be dealt with in crisis mode. This is something I covered in a previous post, “Dependency Management is Risk Management”.

Effective design involves not only the internals of a system but its externals as well. The conditions under which the system will be used, it’s context, is highly significant. That means considering not only the system’s use cases, but also its abuse cases. A post written almost a year ago by Brandon Harris, “Designing for Evil”, conveys this well:

When all is said and done, when you’ve set your ideas to paper, you have to sit down and ask yourself a very specific question:

How could this feature be exploited to harm someone?

Now, replace the word “could” with the word “will.”

How will this feature be exploited to harm someone?

You have to ask that question. You have to be unflinching about the answers, too.

Because if you don’t, someone else will.

When I began working on this post, the portion above was what I had in mind to say. In essence, I planned a longer-form version of what I’d tweeted about the Tay fiasco:

However, before I had finished writing the post, Greger Wikstrand posted “The fail fast fallacy”. Greger and I have been carrying on a conversation about innovation over the last few months. While I had initially intended to approach this as a general issue of architectural practice rather than innovation, the points he makes are just too apropos to leave out.

In the post, Greger points out that the focus seems to have shifted from learning to failure. Learning from experience can be the best way to test an idea. However, it’s not the only way:

Evolution and nature has shown us that there are two, equally valid, approaches to winning the gene game. The first approach is to get as much offspring as possible and “hope” many of them survive (r-selection). The second approach is to have few offspring but raise them and nurture them carefully (K-selection). Biologists tell us that the first strategy works best in a harsh, unpredictable environment where the effort of creating offspring is low. The second strategy works better in an environment where there is less change and offspring are more expensive to produce. Some of the factors that favour r-selection seems to be large uncompeted resources. K-selection is more favourable in resource scarce, low predator areas.

The phrase “…where the effort of creating offspring is low” is critical here. The higher the “cost” of the experiment, the more risk is involved in failure. This makes it advisable to tilt the playing field by supporting and nurturing the “offspring”.

In response to Greger’s post, Casimir Artmann posted two excellent articles that further elaborated on this. In “Fail Fast During Adventures”, he noted that “There is a fine line between fail fast and Darwin Awards in IRL.” His point, preparation beforehand and being willing to abort during an experiment before failure is equivalent to suffering a fatality can be effective learning strategies. Lessons that you don’t live to apply aren’t worth much.

Casimir followed with “Fail is not an Option”, in which he stated:

I want the project to succeed, but I plan for things going wrong so that the consequences wouldn’t be to huge. Some risk are manageable, as walking alone, but not alone and off-trail. That’s to risky. If you doing outdoor adventures, you are probably more prepared and skilled than a ordinarie project member, and thats a huge benefit.

I guess the best advice, when doing completely new things with IT, is to start really small so that the majority of your business is not impacted if there is a failure. When something goes wrong, be sure that you could go back to safe place. Point of no return is like being on a sailing boot in the middle of the Atlantic where you can’t go back.

That’s excellent advice. “Fail Fast” has the advantage of being able to fit on a bumper sticker, but the longer, more nuanced version is more likely to serve you well.

Twitter, Timelines, and the Open/Closed Principle

Consider this Tweet for a moment. I’ll be coming back to it at the end.

In my last post, I brought up Twitter’s rumored changes to the timeline feature as a poor example of customer awareness in connection with an attempt to innovate. The initial rumor set off a storm of protest that brought out CEO Jack Dorsey to deny (sort of) that the timeline will change. Today, the other shoe dropped, the timeline will change (sort of):

Essentially, it will be a re-implementation of the “While You Were Away” feature with an opt-out:

In the “coming weeks,” Twitter will turn on the feature for users by default, and put a notification in the timeline when it does, Haq says. But even then, you’ll be able to turn it off again.

Of course, Twitter’s expectation is that most people will like the timeline tweak—or at least not hate it—once they’re exposed to it. “We have the opt-out because we also prioritize user control,” Haq says. “But we do encourage people to give it a chance.”

So, what does this have to do with the Open/Closed Principle? The Wikipedia article for it contains a quote from Bertrand Meyer’s Object-Oriented Software Construction (emphasis is mine):

A class is closed, since it may be compiled, stored in a library, baselined, and used by client classes. But it is also open, since any new class may use it as parent, adding new features. When a descendant class is defined, there is no need to change the original or to disturb its clients.

Just as change to the code of class may disturb its clients, change to user experience of a product may disturb the clientele. Sometimes extension won’t work and change must take place. As it turns out, the timeline has been extended with optional behavior rather than changed unconditionally as was rumored.

Some thoughts:

  • Twitter isn’t the only social media application out there with a timeline for updates. Perhaps that chronological timeline (among other features) provides some value to the user base?
  • Assuming that value and the risk of upsetting the user base if that value was taken away, wouldn’t it have been wise to communicate in advance? Wouldn’t it have been even wiser to communicate when the rumor hit?

Innovation will involve change, but not all change is necessarily innovative. Understanding customer wants and needs is a good first step to identifying risky changes to user experience (whether real or just perceived). I’d argue this is even more pronounced when you consider that Twitter’s user base is really its product. Twitter’s customers are advertisers and data consumers who want and need an engaged, growing user base to view promoted Tweets and generate usage data.

Returning to the Tweet at the beginning of this post. Considering the accuracy of that recommendation, would it be reasonable to think turning over your timeline to their algorithms might degrade your user experience?