Dealing with Technical Debt Like We Mean it

What’s the biggest problem with technical debt?

https://twitter.com/jetpack/status/724637235459547136

In my opinion, the biggest problem is that it works. Just like the electrical outlet pictured above, systems with technical debt get the job done, even when there’s a hidden surprise or two waiting to make life interesting for us at some later date. If it flat-out failed, getting it fixed would be far easier. Making the argument to spend time (money) changing something that “works” can be difficult.

Failing to make the argument, however, is not the answer:

Brenda Michelson‘s observation is half the battle. The argument for paying down technical debt needs to be made in business-relevant terms (cost, risk, customer impact, etc.). We need more focus on the “debt” part and remember “technical” is just a qualifier:

https://twitter.com/jetpack/status/703327984107782144

The other half of the battle is communicating, in the same business-relevant manner, the costs and/or risks involved when taking on technical debt is considered:

Tracking what technical debt exists and managing the payoff (or write off, removing failed experiments is a reduction technique) is important. Likewise, managing the assumption of technical debt is critical to avoid being swamped by it.

Of course, one could take the approach that the only acceptable level of technical debt is zero. This is equivalent to saying “if we can’t have a perfect product, we won’t have a product”. That might be a difficult position to sell to those writing the checks.

Even if you could get an agreement for that position, reality will conspire to frustrate you. Entropy emerges. Even if the code is perfected and then left unchanged, the system can rot as its platform ages and the needs of the business change. When a system is actively maintained over time without an eye to maintaining a coherent, intentional architecture, then the situation becomes worse. In his post “Enterprise Modernization – The Next Big Thing!”, David Sprott noted:

The problem with modernization is that it is widely perceived as slow, very expensive and high risk because the core business legacy systems are hugely complex as a result of decades of tactical change projects that inevitably compromise any original architecture. But modernization activity must not be limited to the old, core systems; I observe all enterprises old and new, traditional and internet based delivering what I call “instant legacy” [Note 1] generally as outcomes of Agile projects that prioritize speed of delivery over compliance with a well-defined reference architecture that enables ongoing agility and continuous modernization.

Kellan Elliot-McCrea, in “Towards an understanding of technical debt”, captured the problem:

All code is technical debt. All code is, to varying degrees, an incorrect bet on what the future will look like.

This means that assessing and managing technical debt should be an ongoing activity with a responsible owner rather than a one-off event that “somebody” will take care of. The alternative is a bit like using a credit card at every opportunity and ignoring the statements until the repo-man is at the door.

Form Follows Function on SPaMCast 395

SPaMCAST logo

This week’s episode of Tom Cagley’s Software Process and Measurement (SPaMCast) podcast, number 395, features Tom’s essay on productivity, Kim Pries on “how software developers leverage assimilation and accommodation in the acquisition of knowledge”, and a Form Follows Function installment on accidental innovation.

Tom and I discuss my post “Accidental Innovation”. Without an environment that intentionally melds process, structure, and technology, we’re left hoping for a happy accident when it comes to innovation.

You can find all my SPaMCast episodes using under the SPAMCast Appearances category on this blog. Enjoy!

The Hidden Cost of Cheap – UX and Internal Applications

Sisyphus by Titian

Why would anyone worry about user experience for anything that’s not customer-facing?

This question was the premise of Maurice Roach’s post in the Zühlke blog, “Empathise with your users or you won’t solve their problems”:

Bring up the subject of user empathy with some engineers or product owners and you’ll probably hear comments that fall into one of the following categories:

  • Why do we need to empathise when the requirements tell us all we need to know about the problem at hand?
  • Is this really going to improve anything?
  • Sounds like an expensive waste of time
  • They’ll have to use whatever they’re given

These aren’t unexpected responses, it’s easy to put empathy into the “touchy feely”, “let’s all hug and get along” box of product management.

Roach’s answers:

Empathy does a number of things, but mainly it increases the likelihood that the delivery team will think of a user and their pain points when delivering a feature.

If an engineer, UX designer or product owner will has sat with a user, watched them interact with their current software or device they will have an understanding of their frustrations, concerns and impediments to success. The team will be focused on creating features with the things they have witnessed in mind, they’re thinking about how their software will affect a human being and no amount of requirement documentation will give them that emotional connection.

Empathy can also help to develop a shared trust in the application development process. The users see that the delivery team are interested in helping to solve their problems and the product delivery team see the real users behind the application.

All of these are valid reasons, but the list is incomplete. All of these answer the question from a software development point of view. To his credit, Roach pushes past the purely technical aspects into the world of the user. This expanded exploration of the context is, in my opinion, absolutely essential. What’s presented above is an IT-centric viewpoint that needed to be married with a business-centric viewpoint in order to get a fuller picture.

Nick Shackleton-Jones, in his post “The Future Is… Organisational Usability!”, outlined on the problem:

Here’s how your organisation works: you hire people who are increasingly used to a world where they can do pretty much anything via an app on their iPhone, and you subject them to a blizzard of process, policy, antiquated systems and outdated ways of working which pretty much stop them in their tracks, leaving them unproductive and demoralised. Frankly, it’s a miracle they manage to accomplish anything at all.

As he notes, enterprises are putting a lot of effort into digital initiatives aimed at making it easier for customers to engage with them. However:

…if we are going to be successful in future we need to make it much easier for our people to do their jobs: because they are going to be spending less time with us, and because we want engagement and retention, and because if we require high levels of capability (to work our complex systems) then our resourcing costs will go through the roof. We have to simplify ‘getting stuff done’. To put it another way: in an ideal world, any job in your organisation should be do-able by a 12-yr old.

While I disagree that “any job in your organisation should be do-able by a 12-yr old”, Shackleton-Jones point is well-taken that it is in the interests of the business to make it easier for people to do their jobs. All aspects of the system, whether organizational, procedural or technological, should be facilitating, not hindering, the mission. Self-inflicted, unnecessary impediments are morale-killers and degrade both effectiveness and efficiency. All three of these directly impact customer-experience.

While this linkage between employee user experience and customer experience makes usability important for line of business systems (both technological and social), it has value for peripheral systems as well. Time people spend on ancillary tasks (filling out time sheets, requesting supplies, etc.) is time not spent on their primary duties. You may not be able to eliminate those tasks, but you can minimize their expense by making them quick and easy to complete. The further someone’s knowledge/skill/experience level gets from “do-able by a 12-yr old”, the bigger the savings by paying attention to this.

Rather than asking if you can afford to pay attention to user experience, you might want to ask whether you can afford not to.

What’s Innovation Worth?

Animated GIF of Sherman Tank Variants

What does an old World War II tank have to do with innovation?

I’ve mentioned it before, but it bears repeating – one of benefits of having a blog is the ability to interact with and learn from people all over the world. For example, Greger Wikstrand and I have been trading blog posts on innovation for six months now. His latest post, “Switcher’s curse and legacy decisions”,is the 18th installment in the series. In this post, Greger discusses switcher’s curse, “a trap in which a decision maker systematically switches too often”.

Just as the sunk cost fallacy can keep you holding on to a legacy system long past its expiration date, switcher’s curse can cause you to waste money on too-frequent changes. As Greger points out in his post, the net benefit of the new system must outweigh both the net benefit of the old, plus the cost of switching (with a significant safety margin to account for estimation errors in assessing the costs and benefits). Newer isn’t automatically better.

“Disruption” is a two-edged sword when it comes to innovation. As Greger notes regarding legacy systems:

Existing software is much more than a series of decisions to keep it. It embodies a huge number of decisions on how the business of the company should work. The software is full of decisions about business objects and what should be done with them. These decisions, embodied in the software, forms the operating system of the company. The decision to switch is bigger than replacing some immaterial asset with another. It is a decision about replacing a proven way of working with a new way of working.

Disruption involves risk. Change involves cost; disruptive change involves higher costs. In “Innovate or Execute?”, Earl Beede asked:

So, do our employers really want us taking the processes they have paid dearly to implement and products they have scheduled out for the next 15 quarters and, individually, do something disruptive? Every team member taking a risk to see what they can learn and then build on?

Wouldn’t that be chaos?

Beede’s answer to the dilemma:

Now, please don’t think I am completely cynical. I do think that the board of directors and maybe even the C-level officers want to have innovative companies. I really believe that there needs to be parts of a company whose primary mission is to make the rest of the company obsolete. But those disruptive parts need to be small, isolated groups, kept out of the day-to-day delivery of the existing products or services.

What employers should be asking for is for most of the company to be focused on executing the existing plans and for some of the company to be trying to put the executing majority into a whole new space.

This meshes well with Greger’s recommendations:

Conservatism is often the best approach. But it needs to be a prudent conservatism. Making changes smaller and more easily reversible decreases the need for caution. We should consider a prudent application of fail fast mentality in our decision-making process. (But I prefer to call it learn fast.)

Informed decision-making (i.e. making decisions that make sense in light of your context) is critical. The alternative is to rely on blind luck. Being informed requires learning, and as Greger noted, fast turn-around on that learning is to be preferred. Likewise, limiting risk during learning is to be preferred as well. Casimir Artmann, in his post “Fail is not an option”, discussed this concept in relation to hiking in the wilderness. Assessing and controlling risk in that environment can be a matter of life in death. In a business context, it’s the same (even if the “death” is figurative, it’s not much comfort considering the lives impacted). Learning is only useful if you survive to put it to use.

Lastly, it must be understood that decision-making is not a one-time activity. Context is not static, neither should your decision-making process be. An iterative cycle of sense-making and decision-making is required to maintain the balance between innovation churn and stagnation.

So, why the tank?

The M-4 Sherman, in addition to being the workhorse of the U.S. Army’s armor forces in World War II, is also an excellent illustration of avoiding the switcher’s curse. When it was introduced, it was a match for existing German armored vehicles. Shortly afterwards, however, it was outclassed as newer, heavier, better armed German models came online. The U.S. stuck with their existing design, and were able to produce almost three times the number of tanks as Germany (not counting German tanks inferior to the Sherman). As the saying goes, “quantity has a quality all its own”, particularly when paired with other weapon systems in a way that did not disrupt production. The German strategy of producing multiple models hampered their ability to produce in quantity, negating their qualitative advantage. In this instance, progressive enhancement and innovating on the edges was a winning strategy for the U.S.

Back to the OODA – Making Design Decisions

OODA Loop Diagram

A few weeks back, in my post “Enterprise Architecture and the Business of IT”, I mentioned that I was finding myself drawn more and more toward Enterprise Architecture (EA) as a discipline, given its impact on my work as a software architect. Rather than a top-down approach, seeking to design an enterprise as a whole, I find myself backing into it from the bottom-up, seeking to ensure that the systems I design fit the context in which they will live. This involves understanding not only the technology, but also how it interacts, or not, with multiple social systems in order to (ideally) carry out the purpose of the enterprise.

Tom Graves is currently in the middle of a series on whole-enterprise architecture on his Tetradian blog. The third post of the series, “Towards a whole-enterprise architecture standard – 3: Method”, focuses on the need for a flexible design method:

But as we move towards whole-enterprise architecture, we need architecture-methods that can self-adapt for any scope, any scale, any level, any domains, any forms of implementation – and that’s a whole new ball-game…

He further states:

the methods we need for the kind of architecture-work we’re facing now and, even more, into the future, will need not only to work well with any scope, any scale and so on, but must have deep support for non-linearity built right into the core – yet in a way that fully supports formal rigour and discipline as well.

To begin answering the question “But where do we start?”, Tom looks at the Plan-Do-Check-Act (PDCA) cycle, which he finds wanting because:

… the reality is that PDCA doesn’t go anything like far enough for what we need. In particular, it doesn’t really touch all that much on the human aspects of change

This is a weakness that I mentioned in “OODA vs PDCA – What’s the Difference?”. PDCA, by starting with “Plan” and without any reference to context, is flawed in my opinion. Even if one argues that assessing the context is implied, leaving it to an implication fails to give it the prominence it deserves. In his post, Tom refers to ‘the squiggle’, a visualization of the design process:

Damien Newman's squiggle diagram

In an environment of uncertainty (which pretty much includes anything humans are even peripherally involved with), exploration of the context space will be required to understand the gross architecture of the problem. In reconciling the multiple contexts involved, challenges will emerge and will need to be integrated into the architecture of the solution as well. This fractal and iterative exploratory process is well represented by ‘the squiggle’ and, in my opinion, well described by the OODA (Observe, Orient, Decide, and Act) loop.

In “Architecture and OODA Loops – Fast is not Enough”, I discussed how the OODA loops maps to this kind of messy, multi-level process of sense-making and decision-making. The “and” is extremely important in that while decision-making with little or no sense-making may be quick, it’s less likely to effective due to the disconnect from reality. On the other hand, filtering and prioritizing (parts of the Orient component of the loop) is also needed to prevent analysis paralysis.

In my opinion, its recognition and handling of the tension between informed decision-making and quick decision-making makes OODA an excellent candidate for a design meta-method. It is heavily subjective, relying on context to guide appropriate response. That being said, an objective method that’s divorced from context imposes a false sense of simplicity with no real benefit (and very real risks).

Reality is messy; our design methodology should work with, not against that.

[OODA Loop diagram by Patrick Edwin Moran via Wikimedia Commons]

Abuse Cases – What Could Go Wrong?

Trainwreck

Last week, in a post titled “The Flaw in All Things”, John Vincent discussed the problem of seeing “the flaw in all things”:

It’s overwhelming. It’s paralyzing.

I can’t finish a project because I keep finding things that could cause problems. I even mentioned this to our CTO and CEO at one point when we were trying to size some private deploys of our stack.

I couldn’t see anything but the largest configuration because all I could see was places where there was a risk. There were corners I wasn’t willing to cut (not bad corners like risking availability but more like “use a smaller instance here”) because I could see and feel and taste the pain that would come from having to grow the environment under duress.

I’m frustrated with putting everything in Docker containers because all I see is having to take down EVERYTHING running on one node because there’s may be a critical Docker upgrade. I see Elasticsearch rebalancing because of it. I see Kafka elections. Mind you the system is designed for this to happen but why add something that makes it a regular occurance?

I can certainly sympathize. For what it’s worth, it sounds like those making the trade-offs he’s worried about could stand to be a bit more inclusive. That doesn’t necessarily mean the decisions would change, but at least being heard and knowing the answers to his questions might reduce some of the stress (not to mention perhaps helping out those responsible for the decisions).

When making design decisions, having (or, at least, having access to) this level of knowledge and experience has a great deal of value. As I noted in “NPM, Tay, and the Need for Design”, you have to consider both the use cases and the abuse cases for a given system (whether software or social).

It’s not possible to foresee every potential flaw and it probably won’t be feasible to eliminate every risk that’s discovered. That being said, it doesn’t mean time spent in risk evaluation is wasted. Dealing with foreseeable issues before they become problems (where “dealing” is defined as either mitigating it outright or at least planning for a response should it occur) will work better than figuring it out on the fly when the problem occurs.

Understanding why “Should I?” is a more important question than “Can I?” is something I’ve touched on before. Snapchat is finding this out by way of a lawsuit over their filter that allows users to record their speed. Who would have thought someone might cause an accident using that?

Trial and error/experimenting is one method of learning, but it’s not the only method and is frequently not the best method. Fear of failure can hold back learning, but a cavalier attitude toward risk can make experiments just as, if not more costly. It’s the difference between testing a 9 volt battery by touching it to your tongue and using the same technique on a 240 volt circuit.