Form Follows Function on SPaMCast 446

SPaMCAST logo

It’s time for another appearance on Tom Cagley’s Software Process and Measurement (SPaMCast) podcast.

This week’s episode, number 446, features Tom’s essay on questions, a powerful tool for coaches and facilitators. A Form Follows Function installment based on my post “Go-to People Considered Harmful” comes next and Kim Pries rounds out the podcast with a Software Sensei column on servant leadership.

Our conversation in this episode continues with the organizations as system concept and how concentrating institutional knowledge in go-to people creates a dependency management nightmare. Social systems run on relationships and when we allow knowledge and skill bottlenecks to form, we set our organization up for failure. Specialists with deep knowledge are great, but if they don’t spread that knowledge around, we risk avoidable disasters when they’re unavailable. Redundancy aids resilience.

You can find all my SPaMCast episodes using under the SPaMCast Appearances category on this blog. Enjoy!

Advertisements

Go-to People Considered Harmful

Neck of Codd bottle

Okay, so the title’s a little derivative, but it’s both accurate and it fits in with the “organizations as systems” theme of recent posts. Just as dependency management is important for software systems, it’s likewise just as critical for social systems. Failures anywhere along the chain of execution can potentially bring the whole system to a halt if resilience isn’t considered in the design (and evolution) of the system.

Dependency issues in social systems can take a variety of forms. One that comes easily to mind is what is referred to as the “bus factor” – how badly the team is affected if a person is lost (e.g. hit by a bus). Roy Osherove’s post from today, “A Critical Chain of Bus Factors”, expands on this. Interlocking chains of dependencies can multiply the bus factor:

A chain of bus factors happens when you have bus factors depending on bus factors:

Your one developer who knows how to configure the pipeline can’t test the changes because the agent is down. The one guy in IT who has access to the agent needs to reboot it, but does not have access. The one person who has access to reboot it (in the Infra team) is sick, so now there are three people waiting, and there is nothing in this earth that can help that situation.

The “bus factor”, either individually or as a cascading chain, is only part of the problem, however. A column on CIO.com, “The hazards of go-to people”, identifies the potential negative impacts on the go-to person:

They may:

  • Resent that they shoulder so much of the burden for the entire group.
  • Feel underpaid.
  • Burn out from the stress of being on the never-ending-crisis treadmill.
  • Feel trapped and unable to progress in their careers since they are so important in the role that they are in.
  • Become arrogant and condescending to their peers, drunk with the glory of being important.

The same column also lists potential problems for those who are not the go-to person:

When they realize that they are not one of the go-to people they might:

  • Feel underappreciated and untrusted.
  • Lose the desire to work hard since they don’t feel that their work will be recognized or rewarded.
  • Miss out on the opportunities to work on exciting or important things, since they are not considered dedicated and capable.
  • Feel underappreciated and untrusted.

A particularly nasty effect of relying on go-to people is that it’s self-reinforcing if not recognized and actively worked against. People get used to relying on the specialist (which is, admittedly, very effective right up until the bus arrives) and neglect learning to do for themselves. Osherove suggests several methods to mitigate these problems: pairing, teaching, rotating positions, etc. The key idea being, spreading the knowledge around.

Having individuals with deep knowledge can be a good thing if they’re a reservoir supplying others and not a pipeline constraining the flow. Intentional management of dependencies is just as important in social systems as in software systems.

NPM, Tay, and the Need for Design

Take a couple of seconds and watch the clip in the tweet below:

While it would be incredibly difficult to predict that exact outcome, it is also incredibly easy to foresee that it’s a possibility. As the saying goes, “forewarned is forearmed”.

Being forewarned and forearmed is an important part of what an architect does. An architect is supposed to focus on the architecturally significant aspects of a system. I like to use Ruth Malan‘s definition of architectural significance due to its flexibility:

Decisions (both those that were made and those that were left unmade) that end up taking systems offline and causing very public embarrassment are, in my opinion, architecturally significant.

Last week, two very public, very foreseeable failures took place: first was the chaos caused by a developer removing his modules from NPM, which was followed by Microsoft having to pull the plug on its Tay chatbot when it was “trained” to spew offensive comments in less than 24 hours. In my opinion, these both represented design failures resulting from a lack of due consideration of the context in which these systems would operate.

After all, can anyone really claim that no one would expect that people on the internet would try to “corrupt” a chatbot? According to Azeem Azhar as quoted in Business Insider, not really:

“Of course, Twitter users were going to tinker with Tay and push it to extremes. That’s what users do — any product manager knows that.

“This is an extension of the Boaty McBoatface saga, and runs all the way back to the Hank the Angry Drunken Dwarf write in during Time magazine’s Internet vote for Most Beautiful Person. There is nearly a two-decade history of these sort of things being pushed to the limit.”

The current claim, as reported in CIO.com, is that Tay was developed with filtering built-in, but there was a “critical oversight” for a specific kind of attack. According to the article, it’s believed that the attack vector involved asking Tay to “repeat after me”.

Or, as Matt Ballantine put it:

Likewise, who could imagine issues with a centralized repository of cascading dependencies? Failing to consider what would happen if someone suddenly pulled one of the bottom blocks out led to a huge inconvenience to anyone depending on that module or any downstream module. There’s plenty of blame to go around: the developer who took his toys and went home, those responsible for NPM’s design, and those who depended on it without understanding its weaknesses.

“The Iron Law of Tools” is “that which does for you will also do to you”. Understanding the trade-offs allows you to plan for risk mitigation in advance. Ignoring them merely ensures that they will have to be dealt with in crisis mode. This is something I covered in a previous post, “Dependency Management is Risk Management”.

Effective design involves not only the internals of a system but its externals as well. The conditions under which the system will be used, it’s context, is highly significant. That means considering not only the system’s use cases, but also its abuse cases. A post written almost a year ago by Brandon Harris, “Designing for Evil”, conveys this well:

When all is said and done, when you’ve set your ideas to paper, you have to sit down and ask yourself a very specific question:

How could this feature be exploited to harm someone?

Now, replace the word “could” with the word “will.”

How will this feature be exploited to harm someone?

You have to ask that question. You have to be unflinching about the answers, too.

Because if you don’t, someone else will.


When I began working on this post, the portion above was what I had in mind to say. In essence, I planned a longer-form version of what I’d tweeted about the Tay fiasco:

However, before I had finished writing the post, Greger Wikstrand posted “The fail fast fallacy”. Greger and I have been carrying on a conversation about innovation over the last few months. While I had initially intended to approach this as a general issue of architectural practice rather than innovation, the points he makes are just too apropos to leave out.

In the post, Greger points out that the focus seems to have shifted from learning to failure. Learning from experience can be the best way to test an idea. However, it’s not the only way:

Evolution and nature has shown us that there are two, equally valid, approaches to winning the gene game. The first approach is to get as much offspring as possible and “hope” many of them survive (r-selection). The second approach is to have few offspring but raise them and nurture them carefully (K-selection). Biologists tell us that the first strategy works best in a harsh, unpredictable environment where the effort of creating offspring is low. The second strategy works better in an environment where there is less change and offspring are more expensive to produce. Some of the factors that favour r-selection seems to be large uncompeted resources. K-selection is more favourable in resource scarce, low predator areas.

The phrase “…where the effort of creating offspring is low” is critical here. The higher the “cost” of the experiment, the more risk is involved in failure. This makes it advisable to tilt the playing field by supporting and nurturing the “offspring”.

In response to Greger’s post, Casimir Artmann posted two excellent articles that further elaborated on this. In “Fail Fast During Adventures”, he noted that “There is a fine line between fail fast and Darwin Awards in IRL.” His point, preparation beforehand and being willing to abort during an experiment before failure is equivalent to suffering a fatality can be effective learning strategies. Lessons that you don’t live to apply aren’t worth much.

Casimir followed with “Fail is not an Option”, in which he stated:

I want the project to succeed, but I plan for things going wrong so that the consequences wouldn’t be to huge. Some risk are manageable, as walking alone, but not alone and off-trail. That’s to risky. If you doing outdoor adventures, you are probably more prepared and skilled than a ordinarie project member, and thats a huge benefit.

I guess the best advice, when doing completely new things with IT, is to start really small so that the majority of your business is not impacted if there is a failure. When something goes wrong, be sure that you could go back to safe place. Point of no return is like being on a sailing boot in the middle of the Atlantic where you can’t go back.

That’s excellent advice. “Fail Fast” has the advantage of being able to fit on a bumper sticker, but the longer, more nuanced version is more likely to serve you well.

Microservices, Monoliths, and Conflicts to Resolve

Two tweets, opposites in position, and both have merit. Welcome to the wonderful world of architecture, where the only rule is that there is no rule that survives contact with reality.

Enhancing resilience via redundancy is a technique with a long pedigree. While microservices are a relatively recent and extreme example of this, they’re hardly groundbreaking in that respect. Caching, mirroring, load-balancing, etc. has been with us a long, long time. Redundancy is a path to high availability.

Centralization (as exemplified by monolithic systems) can be a useful technique for simplification, increasing data and behavioral integrity, and promoting cohesion. Like redundancy, it’s a system design technique older than automation. There was a reason that “all roads lead to Rome”. Centralization provides authoritativeness and, frequently, economies of scale.

The problem with both techniques is that neither comes without costs. Redundancy introduces complexity in order to support distributing changes between multiple points and reconciling conflicts. Centralization constrains access and can introduce a single point of failure. Getting the benefits without the incurring the costs remains a known issue.

The essence of architectural design is decision-making. Given that those decisions will involve costs as well as benefits, both must be taken into account to ensure that, on balance, the decision achieves its aims. Additionally, decisions must be evaluated in the greater context rather than in isolation. As Tom Graves is fond of saying “things work better when they work together, on purpose”.

This need for designs to not only be internally optimal, but also optimized for their ecosystem means that these, as well as other principles, transcend the boundaries between application architecture, enterprise IT architecture, and even enterprise architecture. The effectiveness of this fractal architecture of systems of systems (both automated and human) is a direct result of the appropriateness of the decisions made across the full range of the organization to the contexts in play.

Since there is no one context, no rule can suffice. The answer we’re looking for is neither “microservice” nor “monolith” (or any other one tactic or technique), but fit to purpose for our context.

Planning and Designing – Intentional, Accidental, Emergent

Over the last three years, I’ve written eleven posts tagged with “Emergence”. In a discussion over the past week of my previous post, I’ve come to the realization that I’ve been misusing that term. In essence, I’ve been conflating emergent architecture with accidental architecture when they’re very different concepts:

In both cases, aspects of the architecture emerge, but the circumstance under which that occurs is vastly different. When architectural design is intentional, emergence occurs as a result of learning. For example, as multiple contexts are reconciled, challenges will emerge. This learning process will continue over the lifetime of the product. With accidental architecture, emergence occurs via lack of preparation, either through inadequate analysis or perversely, through intentionally ignoring needs that aren’t required for the task at hand (even if those needs are confirmed). With this type of emergence, lack of a clear direction leads to conflicting ad hoc responses. If time is not spent to rework these responses, then system coherence suffers. The fix for the problem of Big Design Up Front (BDUF) is appropriate design, not absence of design.

James Coplien, in his recent post “Procrastination”, takes issue with the idea of purposeful ignorance:

There is a catch phrase for it which we’ll examine in a moment: “defer decisions to the last responsible moment.” The agile folks add an interesting twist (with a grain of truth) that the later one defers a decision, the more information there will be on which to base the decision.

Alarmingly, this agile posture is used either as an excuse or as an admonition to temper up-front planning. The attitude perhaps arose as a rationalisation against the planning fanaticism of 1980s methodologies. It’s true that time uncovers more insight, but the march of time also opens the door both to entropy and “progress.” Both constrain options. And to add our own twist, acting early allows more time for feedback and course correction. A stitch in time saves nine. If you’re on a journey and you wait until the end to make course corrections, when you’re 40% off-course, it takes longer to remedy than if you adjust your path from the beginning. Procrastination is the thief of time.

Rebecca Wirfs-Brock has also blogged on feeling “discomfort” and “stress” when making decisions at the last responsible moment. That stress is significant, given study findings she quoted:

Giora Keinan, in a 1987 Journal of Personal and Social Psychology article, reports on a study that examined whether “deficient decision making” under stress was largely due to not systematically considering all relevant alternatives. He exposed college student test subjects to “controllable stress”, “uncontrollable stress”, or no stress, and measured how it affected their ability to solve interactive decision problems. In a nutshell being stressed didn’t affect their overall performance. However, those who were exposed to stress of any kind tended to offer solutions before they considered all available alternatives. And they did not systematically examine the alternatives.

Admittedly, the test subjects were college students doing word analogy puzzles. And the uncontrolled stress was the threat of a small random electric shock….but still…the study demonstrated that once you think you have a reasonable answer, you jump to it more quickly under stress.

It should be noted that although this study didn’t show a drop in performance due to stress, the problems involved were more black and white than design decisions which are best fit type problems. Failure to “systematically examine the alternatives” and the tendency to “offer solutions before they considered all available alternatives” should be considered red flags.

Coplien’s connection of design and planning is significant. Merriam-Webster defines “design” as a form of planning (and the reverse works as well if you consider organizations to be social systems). A tweet from J. B. Rainsberger illustrates an extremely important point about planning (and by extension, design):

In my opinion, a response to “unexpected results” is more likely to be effective if you have conducted the systematic examination of the alternatives beforehand when the stress that leads you to jump to a solution without considering all available alternatives is absent. What needs to be avoided is failing to ensure that the plan/design aligns with the context. This type of intentional planning/design can provide resilience for systems by taking future needs and foreseeable issues into account, giving you options for when the context changes. Even if those needs are not implemented, you can avoid constraints that would make dealing with them when they arise more difficult. Likewise, having options in place for dealing with likely issues can make the difference between a brief problem and a prolonged outage. YAGNI is only a virtue when you really aren’t going to need it.

As Ruth Malan has noted, architectural design involves shaping:

Would you expect that shaping to result in something coherent if it was merely a collection of disconnected tactical responses?

Microservices – The Too Good to be True Parts

Label for Clark Stanley's Snake Oil Liniment

Over the last several months, I’ve written several posts about microservices. My attitude toward this architectural style is one of guarded optimism. I consider the “purist” version of it to be overkill for most applications (are you really creating something Netflix-scale?), but see a lot of valuable ideas developing out of it. Smaller, focused, service-enabled applications are, in my opinion, an excellent way to increase systems agility. Where the benefits outweigh the costs and you’ve done your homework, systems of systems make sense.

However, the history of Service-Oriented Architecture (SOA), is instructive. A tried and true method of discrediting yourself is to over-promise and under-deliver. Microservice architectures, as the latest hot topic, currently receive a lot of uncritical press, just as SOA did a few years back. An article on ZDNet, “How Nike thinks about app development: Lots of micro services”, illustrates this (emphasis is mine):

Nike is breaking down all the parts of its apps to crate (sic) building blocks that can be reused and tweaked as needed. There’s also a redundancy benefit: Should one micro service fail the other ones will work in the app.

Reuse and agility tend to be antagonists. The governance needed to promote reuse impedes agility. Distribution increase complexity on its own; reuse adds additional complexity. This complexity comes not only from communication issues but also from coordination and coupling. Rationalization, reuse and the ability to compose applications from the individual service is absolutely a feature of this style. The catch is the cost involved in achieving it.

A naive reading of Nike’s strategy would imply that breaking everything up “auto-magically” yields reuse and agility. Without an intentional design, this is very unlikely. Cohesion of the individual services, rather than their size is the important factor in achieving those goals. As Stefan Tilkov notes in “How Small Should Your Microservice Be?”:

In other words, I think it’s not a goal to make your services as small as possible. Doing so would mean you view the separation into individual, stand-alone services as your only structuring mechanism, while it should be only one of many.

Redundancy and resilience are likewise issues that need careful consideration. The quote from the Nike article might lead you to believe that resilience and redundancy are a by-product of deploying microservices. Far from it. Resilience and distribution are orthogonal concepts; in fact, breaking up a monolith can have a negative impact on resilience if resilience is not specifically accounted for in the design. Coupling, in all its various forms, reduces resilience. Jeppe Cramon, in “SOA: synchronous communication, data ownership and coupling”, has shown that distribution, in and of itself, does not eliminate coupling. This means that “Should one micro service fail the other ones will work in the app” may prove false if the service that fails is coupled with and depended on by other services. Decoupling is unlikely to happen accidentally. Likewise, redundant instances of the same service will do little good if a resource shared by those instances (e.g. the data store) is down.

Even where a full-blown microservice architecture is inappropriate, many of the principles behind the style are useful. Swimming with the tide of Conway’s Law, rather than against it, is more likely to yield successful application architectures and enterprise IT architectures. The coherence that makes it successful is a product of design, however, and not serendipity. Microservices are most definitely not snake oil. Selling the style like it is snake oil is a really bad idea.

Design Away Error Handling?

Evil Monkey Pointing

Writing is an interesting process. Some posts spring to life; ignited by some inspiration, they swiftly flow from fingertips to (virtual) page. Other posts simmer. An idea is half-conceived, then languishes incomplete. It sits in the corner staring at you balefully, a reproach for your lack of commitment. In the case of this one, it sat for the better part of a year because I wasn’t quite sure which side I wanted to come down on.

It started with a fairly uncontroversial tweet from Michael Feathers: “Spend more time designing away errors so that you don’t have to handle them.” On its face, this is reasonable; eliminating error vectors should lead to a more robust product. Some warning flags appear, however, when you read the stream leading up to that tweet:

Francis Fish’s point about context (“…medical equipment and (say) android app have totally diff needs…”) certainly applies, as does Feather’s reply. Whenever I see the word “best” devoid of context, the credibility detector bottoms out. It’s the response to Brian Knapp (“Yeah, they are better not used at all. :)”) that is worrisome. Under some circumstances, throwing an exception when an error condition occurs is the right answer.

Having default values for parameters is one technique for designing away errors. Checking for problem conditions such as disk space or network connectivity prior to use can be used as well. The key thing to remember is that these techniques assume that the problem is an expected one and that something can be done about it. Checking for space or connectivity is useless if you don’t have an alternate location to write to or if you lack the ability to restore the connection. Likewise, use of a default value is only appropriate when there is a meaningful default.

The thing to remember is that avoiding an exception is not the goal, correct execution/valid state is. If you’re transferring money between accounts, you want to be able to trust that either the transaction completed and the balances are adjusted or that you know something went wrong. Silent failures are much more of a problem than noisy errors. As Jef Claes noted in “Tests as part of your code”, silent failures can put you in the newspaper (and not in a good way).

A more recent Twitter exchange involving Feathers returned to this same issue:

The short answer, is yes, it’s a bug. Otherwise things found in code reviews would not count as defects because they had not happened “live”. The last tweet in that stream summed it up nicely:

We cannot rely on design alone to eliminate error conditions because we cannot foresee all potential issues. Testing shares this same dilemma. I believe Arlo Belshee strikes the right balance in “Treat bugs as fires”. Fire departments concentrate foremost on preventing fires while still extinguishing those that fall through the cracks. Where one occurs, it’s treated as a learning experience. So too should we treat error conditions. Dan Cresswell put it nicely:

Handling exceptions is tedious, but critical. Where we can remove risks, or at least reduce them via design, so much the better. We cannot, however, rely on our ability to foresee every circumstance. Chaos Monkey chooses you.

[Hat tip to Lorrie MacVittie for tweeting the evil monkey image above]