Stopping Accidental Technical Debt

Buster Keaton looking at a poorly constructed house

In one of my earlier posts about technical debt, I differentiated between intentional debt (that taken on deliberately and purposefully) and accidental debt (that which just accrues over time without rhyme or reason or record). Dealing with (in the sense of evaluating, tracking, and resolving it) technical debt is obviously a consideration for someone in an application architect role. While someone in that role absolutely should be aware of the intentional debt, is there a way to be more attuned to the accidental debt as well?

Last summer, I published a post titled “Distance…is the one true enemy…”. The post started with a group of tweets from Gregory Brown talking about the corrosive effects of distance on software development (distance between compile and run, between failure and correction, between development and feedback, etc.). I then extended the concept to management, talking about how distance between sense-maker and decision-maker could negatively affect the quality of the decisions being made.

There’s also a distance that neither Greg nor I covered at the time, design distance. Design distance is the distance between the design and the outcome. Reducing design distance makes it easier to keep a handle on the accidental debt as well as the intentional.

Distance between the architectural decisions and the implementation can introduce technical debt. This distance can come from remote decision-makers, architecture pigeons who swoop in, deposit their “wisdom”, and then fly away home. It can come from failing to communicate the design considerations effectively across the entire team. It can also come from failing to monitor the system as it evolves. The design and the implementation need to be in alignment. Even more so, the design and the implementation need to align with particular problems to be solved/jobs to be done. Otherwise, the result may look like this:

Distance between development of the system and keeping the system running can introduce technical debt as well. The platform a system runs on is a vital part of the system, as critical as the code it supports. As with the code, the design, implementation, and context all need to be kept in alignment.

Alignment of design, implementation, and context can only be maintained by on-going architectural assessment. Stefan Dreverman’s “Using Philosophy in IT architecture” identified four questions to be asked as part of an assessment:

  1. “What is my purpose?”
  2. “What am I composed of?”
  3. “What’s in my environment?”
  4. “What do I communicate?”

These questions are applicable not only to the beginning of a system, but throughout its life-cycle. Failing to re-evaluate the architecture as a whole as the system evolves can lead to inconsistencies as design distance grows. We can get so busy dealing with the present that we create a future of pain:

At first glance, this approach might seem to be expensive, but rewriting legacy systems is expensive as well (assuming the rewrite would be successful, which is a tenuous assumption). Building applications with a one-and-done mindset is effectively building a legacy system.

Fear of Failure, Fear and Failure

Capricho 43, Goya's 'The Sleep of Reason Produces Monsters'

Some things seem so logically inconsistent that you just have to check them out.

Such was the title of a post on LinkedIn that I saw the other day: “Innovation In Fear-Based Cultures? Or, why hire lions to be dogs?”. In it, Michael Graber noted that “…top-down organizations have the most trouble innovating.”:

In particular, the fearful mindsets that review, align, and sign off on “decks” to be presented to Vice President-level colleagues often edit out the insights and recommendations that have the power to grow the business in new ways.

These well-trained, obedient keepers of the status quo are rewarded for not taking risks and for not thinking outside of the existing paradigm of the business.

None of this is particularly shocking, a culture of fear is pretty much the antithesis of a learning culture and innovation in the absence of a learning culture is a bit like snow in the desert – not impossible, but certainly remarkable.

Learning involves risk. Whether the method is “move fast and break things” or something more deliberate and considered (such as that outlined in Greger Wikstrand‘s post “Jobs to be done innovation”), there is a risk of failure. Where there is a culture of fear, people will avoid all failure. Even limited risk failure in the context of an acknowledged experiment will be avoided because people won’t trust in the powers that be not to punish the failure. In avoiding this type of failure, learning that leads to innovation is avoided as well. You can still learn from what others have done (or failed to do), but even then there’s the problem of finding someone foolhardy enough to propose an action that’s out of the norm for the organization.

Why would an organization foster this kind of culture?

Seth Godin’s post, “What bureaucracy can’t do for you”, holds the key:

It lets us off the hook in many ways. It creates systems and momentum and eliminates many decisions for its members.

“I’m just doing my job.”

“That’s the way the system works.”

Decisions involve risk, someone could make the wrong one. For that reason, the number of people making decisions should be minimized (not a position I endorse, mind you).

That’s the irony of top-down, bureaucratic organizations – often the culture is by design, intended to eliminate risk. By succeeding in doing so on the mundane level, the organization actually introduces an existential risk, the risk of stagnation. The law of unintended consequences has a very long arm.

This type of culture actually introduces perverse incentives that further threaten the organization’s long-term health. Creativity is a huge risk, you could be wrong. Even if you’re right, you’ve become noticeable. Visibility becomes the same as risk. Likewise, responsibility means appearing on the radar. This not only discourages positive actions, but can easily be a corrupting influence.

Fear isn’t the only thing we have to fear, but sometimes it’s something we really need to be concerned about.


This post is another installment of an ongoing conversation about innovation with Greger Wikstrand.

Learning Organizations: When Wrens Take Down Wolfpacks

A Women's Royal Naval Service plotter at work in the Operations Room at Derby House in Liverpool, the headquarters of the Commander-in-Chief Western Approaches, September 1944.

What does the World War II naval campaign known as the Battle of the Atlantic have to do with learning and innovation?

Quite a lot, as it turns out. Early in the war, Britain found itself in a precarious position. While being an island nation provided defensive advantages, it also came with logistical challenges. Food, armaments, and other vital supplies as well as reinforcements had to come to it by sea. The shipping lanes were heavily threatened, primarily by the German u-boat (submarine) fleet. Needing more than a million tons of imports per week, maintaining the flow of goods was a matter of survival.

Businesses may not have to worry about literal torpedoes severing their lifelines, but they are at risk due to a number of factors. Whether its changing technology or tastes, competitive pressures, or even criminal activity, organizations cannot afford to sit idle. In his post “Heraclitus was wrong about innovation”, Greger Wikstrand talked about the mismatch between the speed of change (high) and rate of innovation (not fast enough). This is a recurrent theme in our ongoing discussion of innovation (we’ve been trading posts on the subject for over a year now).

The British response to the threat involved many facets, but an article I saw yesterday about one response in particular struck a chord. “The Wargaming “Wrens” of the Western Approaches Tactical Unit” told the story of a group of officers and ratings of the Women’s Royal Naval Service (nicknamed “Wrens”) who, under the command of a naval officer, Captain Gilbert Roberts, revolutionized British anti-submarine warfare (ASW). Their mandate was to “…explore and evaluate new tactics and then to pass them on to escort captains in a dedicated ASW course”.

Using simulation (wargaming) to develop and improve tactics was an unorthodox proposition, particularly in the eyes of Admiral Percy Noble, who was responsible for Britain’s shipping lifeline. However, Admiral Noble was capable of appreciating the value of unorthodox methods:

A sceptical Sir Percy Noble arrived with his staff the next day and watched as the team worked through a series of attacks on convoy HG.76. As Roberts described the logic behind their assumptions about the tactics being used by the U-Boats and demonstrated the counter move, one that Wren Officer Laidlaw had mischievously named Raspberry, Sir Percy changed his view of the unit. From now on the WATU would be regular visitors to the Operations Room and all escort officers were expected to attend the course.

Each of the courses looked at ASW and surface attacks on a convoy and the students were encouraged to take part in the wargames that evaluated potential new tactics. Raspbery was soon followed by Strawberry, Goosebery and Pineapple and as the RN went over to the offensive, the tactical priority shifted to hunting and killing U Boats. Roberts continued as Director of WATU but was also appointed as Assistant Chief of Staff Intelligence at Western Approaches Command.

This type of learning culture, such as I described in “Learning to Deal with the Inevitable”, was key to winning the naval war. Clinging to tradition would have led to a fatal inertia.

One aspect of the WATU approach that I find particularly interesting is the use of simulation to limit risk during learning. Experiments involving real ships cost real lives when they don’t pan out. Simulation (assuming sufficient validity of the theoretical underpinnings of the model used) is a technique that can be used to explore more without sending costs through the roof.

Dealing with Technical Debt Like We Mean it

What’s the biggest problem with technical debt?

In my opinion, the biggest problem is that it works. Just like the electrical outlet pictured above, systems with technical debt get the job done, even when there’s a hidden surprise or two waiting to make life interesting for us at some later date. If it flat-out failed, getting it fixed would be far easier. Making the argument to spend time (money) changing something that “works” can be difficult.

Failing to make the argument, however, is not the answer:

Brenda Michelson‘s observation is half the battle. The argument for paying down technical debt needs to be made in business-relevant terms (cost, risk, customer impact, etc.). We need more focus on the “debt” part and remember “technical” is just a qualifier:

The other half of the battle is communicating, in the same business-relevant manner, the costs and/or risks involved when taking on technical debt is considered:

Tracking what technical debt exists and managing the payoff (or write off, removing failed experiments is a reduction technique) is important. Likewise, managing the assumption of technical debt is critical to avoid being swamped by it.

Of course, one could take the approach that the only acceptable level of technical debt is zero. This is equivalent to saying “if we can’t have a perfect product, we won’t have a product”. That might be a difficult position to sell to those writing the checks.

Even if you could get an agreement for that position, reality will conspire to frustrate you. Entropy emerges. Even if the code is perfected and then left unchanged, the system can rot as its platform ages and the needs of the business change. When a system is actively maintained over time without an eye to maintaining a coherent, intentional architecture, then the situation becomes worse. In his post “Enterprise Modernization – The Next Big Thing!”, David Sprott noted:

The problem with modernization is that it is widely perceived as slow, very expensive and high risk because the core business legacy systems are hugely complex as a result of decades of tactical change projects that inevitably compromise any original architecture. But modernization activity must not be limited to the old, core systems; I observe all enterprises old and new, traditional and internet based delivering what I call “instant legacy” [Note 1] generally as outcomes of Agile projects that prioritize speed of delivery over compliance with a well-defined reference architecture that enables ongoing agility and continuous modernization.

Kellan Elliot-McCrea, in “Towards an understanding of technical debt”, captured the problem:

All code is technical debt. All code is, to varying degrees, an incorrect bet on what the future will look like.

This means that assessing and managing technical debt should be an ongoing activity with a responsible owner rather than a one-off event that “somebody” will take care of. The alternative is a bit like using a credit card at every opportunity and ignoring the statements until the repo-man is at the door.

What’s Innovation Worth?

Animated GIF of Sherman Tank Variants

What does an old World War II tank have to do with innovation?

I’ve mentioned it before, but it bears repeating – one of benefits of having a blog is the ability to interact with and learn from people all over the world. For example, Greger Wikstrand and I have been trading blog posts on innovation for six months now. His latest post, “Switcher’s curse and legacy decisions”,is the 18th installment in the series. In this post, Greger discusses switcher’s curse, “a trap in which a decision maker systematically switches too often”.

Just as the sunk cost fallacy can keep you holding on to a legacy system long past its expiration date, switcher’s curse can cause you to waste money on too-frequent changes. As Greger points out in his post, the net benefit of the new system must outweigh both the net benefit of the old, plus the cost of switching (with a significant safety margin to account for estimation errors in assessing the costs and benefits). Newer isn’t automatically better.

“Disruption” is a two-edged sword when it comes to innovation. As Greger notes regarding legacy systems:

Existing software is much more than a series of decisions to keep it. It embodies a huge number of decisions on how the business of the company should work. The software is full of decisions about business objects and what should be done with them. These decisions, embodied in the software, forms the operating system of the company. The decision to switch is bigger than replacing some immaterial asset with another. It is a decision about replacing a proven way of working with a new way of working.

Disruption involves risk. Change involves cost; disruptive change involves higher costs. In “Innovate or Execute?”, Earl Beede asked:

So, do our employers really want us taking the processes they have paid dearly to implement and products they have scheduled out for the next 15 quarters and, individually, do something disruptive? Every team member taking a risk to see what they can learn and then build on?

Wouldn’t that be chaos?

Beede’s answer to the dilemma:

Now, please don’t think I am completely cynical. I do think that the board of directors and maybe even the C-level officers want to have innovative companies. I really believe that there needs to be parts of a company whose primary mission is to make the rest of the company obsolete. But those disruptive parts need to be small, isolated groups, kept out of the day-to-day delivery of the existing products or services.

What employers should be asking for is for most of the company to be focused on executing the existing plans and for some of the company to be trying to put the executing majority into a whole new space.

This meshes well with Greger’s recommendations:

Conservatism is often the best approach. But it needs to be a prudent conservatism. Making changes smaller and more easily reversible decreases the need for caution. We should consider a prudent application of fail fast mentality in our decision-making process. (But I prefer to call it learn fast.)

Informed decision-making (i.e. making decisions that make sense in light of your context) is critical. The alternative is to rely on blind luck. Being informed requires learning, and as Greger noted, fast turn-around on that learning is to be preferred. Likewise, limiting risk during learning is to be preferred as well. Casimir Artmann, in his post “Fail is not an option”, discussed this concept in relation to hiking in the wilderness. Assessing and controlling risk in that environment can be a matter of life in death. In a business context, it’s the same (even if the “death” is figurative, it’s not much comfort considering the lives impacted). Learning is only useful if you survive to put it to use.

Lastly, it must be understood that decision-making is not a one-time activity. Context is not static, neither should your decision-making process be. An iterative cycle of sense-making and decision-making is required to maintain the balance between innovation churn and stagnation.

So, why the tank?

The M-4 Sherman, in addition to being the workhorse of the U.S. Army’s armor forces in World War II, is also an excellent illustration of avoiding the switcher’s curse. When it was introduced, it was a match for existing German armored vehicles. Shortly afterwards, however, it was outclassed as newer, heavier, better armed German models came online. The U.S. stuck with their existing design, and were able to produce almost three times the number of tanks as Germany (not counting German tanks inferior to the Sherman). As the saying goes, “quantity has a quality all its own”, particularly when paired with other weapon systems in a way that did not disrupt production. The German strategy of producing multiple models hampered their ability to produce in quantity, negating their qualitative advantage. In this instance, progressive enhancement and innovating on the edges was a winning strategy for the U.S.

Abuse Cases – What Could Go Wrong?

Trainwreck

Last week, in a post titled “The Flaw in All Things”, John Vincent discussed the problem of seeing “the flaw in all things”:

It’s overwhelming. It’s paralyzing.

I can’t finish a project because I keep finding things that could cause problems. I even mentioned this to our CTO and CEO at one point when we were trying to size some private deploys of our stack.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

I’m frustrated with putting everything in Docker containers because all I see is having to take down EVERYTHING running on one node because there’s may be a critical Docker upgrade. I see Elasticsearch rebalancing because of it. I see Kafka elections. Mind you the system is designed for this to happen but why add something that makes it a regular occurance?

I can certainly sympathize. For what it’s worth, it sounds like those making the trade-offs he’s worried about could stand to be a bit more inclusive. That doesn’t necessarily mean the decisions would change, but at least being heard and knowing the answers to his questions might reduce some of the stress (not to mention perhaps helping out those responsible for the decisions).

When making design decisions, having (or, at least, having access to) this level of knowledge and experience has a great deal of value. As I noted in “NPM, Tay, and the Need for Design”, you have to consider both the use cases and the abuse cases for a given system (whether software or social).

It’s not possible to foresee every potential flaw and it probably won’t be feasible to eliminate every risk that’s discovered. That being said, it doesn’t mean time spent in risk evaluation is wasted. Dealing with foreseeable issues before they become problems (where “dealing” is defined as either mitigating it outright or at least planning for a response should it occur) will work better than figuring it out on the fly when the problem occurs.

Understanding why “Should I?” is a more important question than “Can I?” is something I’ve touched on before. Snapchat is finding this out by way of a lawsuit over their filter that allows users to record their speed. Who would have thought someone might cause an accident using that?

Trial and error/experimenting is one method of learning, but it’s not the only method and is frequently not the best method. Fear of failure can hold back learning, but a cavalier attitude toward risk can make experiments just as, if not more costly. It’s the difference between testing a 9 volt battery by touching it to your tongue and using the same technique on a 240 volt circuit.

Talking about TayandYou on Architecture Corner

I had the pleasure of appearing on episode #37 of Architecture Corner, “Fail fast, learn fast”, with Greger Wikstrand and Casimir Artmann. In the episode, we discuss learning, experiments, and the idea of “fail fast” in relation to the recent incident with Microsoft’s artificial intelligence chatbot, @TayandYou.

I hope you enjoy the discussion as much as I did!

[updated 4/5/2016 to fix the episode number in the link]