Go-to People Considered Harmful

Neck of Codd bottle

Okay, so the title’s a little derivative, but it’s both accurate and it fits in with the “organizations as systems” theme of recent posts. Just as dependency management is important for software systems, it’s likewise just as critical for social systems. Failures anywhere along the chain of execution can potentially bring the whole system to a halt if resilience isn’t considered in the design (and evolution) of the system.

Dependency issues in social systems can take a variety of forms. One that comes easily to mind is what is referred to as the “bus factor” – how badly the team is affected if a person is lost (e.g. hit by a bus). Roy Osherove’s post from today, “A Critical Chain of Bus Factors”, expands on this. Interlocking chains of dependencies can multiply the bus factor:

A chain of bus factors happens when you have bus factors depending on bus factors:

Your one developer who knows how to configure the pipeline can’t test the changes because the agent is down. The one guy in IT who has access to the agent needs to reboot it, but does not have access. The one person who has access to reboot it (in the Infra team) is sick, so now there are three people waiting, and there is nothing in this earth that can help that situation.

The “bus factor”, either individually or as a cascading chain, is only part of the problem, however. A column on CIO.com, “The hazards of go-to people”, identifies the potential negative impacts on the go-to person:

They may:

  • Resent that they shoulder so much of the burden for the entire group.
  • Feel underpaid.
  • Burn out from the stress of being on the never-ending-crisis treadmill.
  • Feel trapped and unable to progress in their careers since they are so important in the role that they are in.
  • Become arrogant and condescending to their peers, drunk with the glory of being important.

The same column also lists potential problems for those who are not the go-to person:

When they realize that they are not one of the go-to people they might:

  • Feel underappreciated and untrusted.
  • Lose the desire to work hard since they don’t feel that their work will be recognized or rewarded.
  • Miss out on the opportunities to work on exciting or important things, since they are not considered dedicated and capable.
  • Feel underappreciated and untrusted.

A particularly nasty effect of relying on go-to people is that it’s self-reinforcing if not recognized and actively worked against. People get used to relying on the specialist (which is, admittedly, very effective right up until the bus arrives) and neglect learning to do for themselves. Osherove suggests several methods to mitigate these problems: pairing, teaching, rotating positions, etc. The key idea being, spreading the knowledge around.

Having individuals with deep knowledge can be a good thing if they’re a reservoir supplying others and not a pipeline constraining the flow. Intentional management of dependencies is just as important in social systems as in software systems.

NPM, Tay, and the Need for Design

Take a couple of seconds and watch the clip in the tweet below:

While it would be incredibly difficult to predict that exact outcome, it is also incredibly easy to foresee that it’s a possibility. As the saying goes, “forewarned is forearmed”.

Being forewarned and forearmed is an important part of what an architect does. An architect is supposed to focus on the architecturally significant aspects of a system. I like to use Ruth Malan‘s definition of architectural significance due to its flexibility:

Decisions (both those that were made and those that were left unmade) that end up taking systems offline and causing very public embarrassment are, in my opinion, architecturally significant.

Last week, two very public, very foreseeable failures took place: first was the chaos caused by a developer removing his modules from NPM, which was followed by Microsoft having to pull the plug on its Tay chatbot when it was “trained” to spew offensive comments in less than 24 hours. In my opinion, these both represented design failures resulting from a lack of due consideration of the context in which these systems would operate.

After all, can anyone really claim that no one would expect that people on the internet would try to “corrupt” a chatbot? According to Azeem Azhar as quoted in Business Insider, not really:

“Of course, Twitter users were going to tinker with Tay and push it to extremes. That’s what users do — any product manager knows that.

“This is an extension of the Boaty McBoatface saga, and runs all the way back to the Hank the Angry Drunken Dwarf write in during Time magazine’s Internet vote for Most Beautiful Person. There is nearly a two-decade history of these sort of things being pushed to the limit.”

The current claim, as reported in CIO.com, is that Tay was developed with filtering built-in, but there was a “critical oversight” for a specific kind of attack. According to the article, it’s believed that the attack vector involved asking Tay to “repeat after me”.

Or, as Matt Ballantine put it:

Likewise, who could imagine issues with a centralized repository of cascading dependencies? Failing to consider what would happen if someone suddenly pulled one of the bottom blocks out led to a huge inconvenience to anyone depending on that module or any downstream module. There’s plenty of blame to go around: the developer who took his toys and went home, those responsible for NPM’s design, and those who depended on it without understanding its weaknesses.

“The Iron Law of Tools” is “that which does for you will also do to you”. Understanding the trade-offs allows you to plan for risk mitigation in advance. Ignoring them merely ensures that they will have to be dealt with in crisis mode. This is something I covered in a previous post, “Dependency Management is Risk Management”.

Effective design involves not only the internals of a system but its externals as well. The conditions under which the system will be used, it’s context, is highly significant. That means considering not only the system’s use cases, but also its abuse cases. A post written almost a year ago by Brandon Harris, “Designing for Evil”, conveys this well:

When all is said and done, when you’ve set your ideas to paper, you have to sit down and ask yourself a very specific question:

How could this feature be exploited to harm someone?

Now, replace the word “could” with the word “will.”

How will this feature be exploited to harm someone?

You have to ask that question. You have to be unflinching about the answers, too.

Because if you don’t, someone else will.

When I began working on this post, the portion above was what I had in mind to say. In essence, I planned a longer-form version of what I’d tweeted about the Tay fiasco:

However, before I had finished writing the post, Greger Wikstrand posted “The fail fast fallacy”. Greger and I have been carrying on a conversation about innovation over the last few months. While I had initially intended to approach this as a general issue of architectural practice rather than innovation, the points he makes are just too apropos to leave out.

In the post, Greger points out that the focus seems to have shifted from learning to failure. Learning from experience can be the best way to test an idea. However, it’s not the only way:

Evolution and nature has shown us that there are two, equally valid, approaches to winning the gene game. The first approach is to get as much offspring as possible and “hope” many of them survive (r-selection). The second approach is to have few offspring but raise them and nurture them carefully (K-selection). Biologists tell us that the first strategy works best in a harsh, unpredictable environment where the effort of creating offspring is low. The second strategy works better in an environment where there is less change and offspring are more expensive to produce. Some of the factors that favour r-selection seems to be large uncompeted resources. K-selection is more favourable in resource scarce, low predator areas.

The phrase “…where the effort of creating offspring is low” is critical here. The higher the “cost” of the experiment, the more risk is involved in failure. This makes it advisable to tilt the playing field by supporting and nurturing the “offspring”.

In response to Greger’s post, Casimir Artmann posted two excellent articles that further elaborated on this. In “Fail Fast During Adventures”, he noted that “There is a fine line between fail fast and Darwin Awards in IRL.” His point, preparation beforehand and being willing to abort during an experiment before failure is equivalent to suffering a fatality can be effective learning strategies. Lessons that you don’t live to apply aren’t worth much.

Casimir followed with “Fail is not an Option”, in which he stated:

I want the project to succeed, but I plan for things going wrong so that the consequences wouldn’t be to huge. Some risk are manageable, as walking alone, but not alone and off-trail. That’s to risky. If you doing outdoor adventures, you are probably more prepared and skilled than a ordinarie project member, and thats a huge benefit.

I guess the best advice, when doing completely new things with IT, is to start really small so that the majority of your business is not impacted if there is a failure. When something goes wrong, be sure that you could go back to safe place. Point of no return is like being on a sailing boot in the middle of the Atlantic where you can’t go back.

That’s excellent advice. “Fail Fast” has the advantage of being able to fit on a bumper sticker, but the longer, more nuanced version is more likely to serve you well.

Microservices – Sharpening the Focus

Motion Blurred London Bus

While it was not the genesis of the architectural style known as microservices, the March 2014 post by James Lewis and Martin Fowler certainly put it on the software development community’s radar. Although the level of interest generated has been considerable, the article was far from an unqualified endorsement:

Despite these positive experiences, however, we aren’t arguing that we are certain that microservices are the future direction for software architectures. While our experiences so far are positive compared to monolithic applications, we’re conscious of the fact that not enough time has passed for us to make a full judgement.

One reasonable argument we’ve heard is that you shouldn’t start with a microservices architecture. Instead begin with a monolith, keep it modular, and split it into microservices once the monolith becomes a problem. (Although this advice isn’t ideal, since a good in-process interface is usually not a good service interface.)

So we write this with cautious optimism. So far, we’ve seen enough about the microservice style to feel that it can be a worthwhile road to tread. We can’t say for sure where we’ll end up, but one of the challenges of software development is that you can only make decisions based on the imperfect information that you currently have to hand.

In the course of roughly fourteen months, Fowler’s opinion has gelled around the “reasonable argument”:

So my primary guideline would be don’t even consider microservices unless you have a system that’s too complex to manage as a monolith. The majority of software systems should be built as a single monolithic application. Do pay attention to good modularity within that monolith, but don’t try to separate it into separate services.

This mirrors what Sam Newman stated in “Microservices For Greenfield?”:

I remain convinced that it is much easier to partition an existing, “brownfield” system than to do so up front with a new, greenfield system. You have more to work with. You have code you can examine, you can speak to people who use and maintain the system. You also know what ‘good’ looks like – you have a working system to change, making it easier for you to know when you may have got something wrong or been too aggressive in your decision making process.

You also have a system that is actually running. You understand how it operates, how it behaves in production. Decomposition into microservices can cause some nasty performance issues for example, but with a brownfield system you have a chance to establish a healthy baseline before making potentially performance-impacting changes.

I’m certainly not saying ‘never do microservices for greenfield’, but I am saying that the factors above lead me to conclude that you should be cautious. Only split around those boundaries that are very clear at the beginning, and keep the rest on the more monolithic side. This will also give you time to assess how how mature you are from an operational point of view – if you struggle to manage two services, managing 10 is going to be difficult.

In short, the application architectural style known as microservice architecture (MSA), is unlikely to be an appropriate choice for the early stages of an application. Rather it is a style that is most likely migrated to from a more monolithic beginning. Some subset of applications may benefit from that form of distributed componentization at some point, but distribution, at any degree of granularity, should be based on need. Separation of concerns and modularity does not imply a need for distribution. In fact, poorly planned distribution may actually increase complexity and coupling while destroying encapsulation. Dependencies must be managed whether local or remote.

This is probably a good point to note that there is a great deal of room between a purely monolithic approach and a full-blown MSA. Rather than a binary choice, there is a wide range of options between the two. The fractal nature of the environment we inhabit means that responsibilities can be described as singular and separate without their being required to share the same granularity. Monoliths can be carved up and the resulting component parts still be considered monolithic compared to an extremely fine-grained sub-application microservice and that’s okay. The granularity of the partitioning (and the associated complexity) can be tailored to the desired outcome (such as making components reusable across multiple applications or more easily replaceable).

The moral of the story, at least in my opinion, is that intentional design concentrating on separation of concerns, loose coupling, and high cohesion is beneficial from the very start. Vertical (functional) slices, perhaps combined with layers (what I call “dicing”), can be used to achieve these ends. Regardless of whether the components are to be distributed at first, designing them with that in mind from the start will ease any transition that comes in the future without ill effects for the present. Neglecting these issues, risks hampering, if not outright preventing, breaking them out at a later date without resorting to a re-write.

These same concerns apply higher levels of abstraction as well. Rather than blindly growing a monolith that is all things to all people, adding new features should be treated as an opportunity to evaluate whether that functionality coheres with the existing application or is better suited to being a service from an external provider. Just as the application architecture should aim for modularity, so too should the solution architecture.

A modular design is a flexible design. While we cannot know up front the extent of change an application will undergo over its lifetime, we can be sure that there will be change. Designing with flexibility in mind means that change, when it comes, is less likely to be an existential crisis. As Hayim Makabee noted in his write-up of Rotem Hermon’s talk, “Change Driven Design”: “Change should entail extending the system rather than refactoring.”

A full-blown MSA architecture is one possible outcome for an application. It is, however, not the most likely outcome for most applications. What is important is to avoid unnecessary constraints and retain sufficient flexibility to deal with the needs that arise.

[London Bus Image by E01 via Wikimedia Commons.]

Microservices, SOA, Reuse and Replaceability


While it’s not as elusive as the unicorn, the concept of reuse tends to be talked about more often talked about than seen. Over the years, object-orientation, design patterns, and services have all held out the promise of reuse of either code or at least, design. Similar claims have been made regarding microservices.

Reuse is a creature of extremes. Very fine grained components (e.g. the classes that make up the standard libraries of Java and .Net) are highly reusable but require glue code to coordinate their interaction in order to yield something useful. This will often be the case with microservices, although not always; it is possible to have very small services with few or no dependencies on other services (it’s important to remember, unlike libraries, services generally share both behavior and data.). Coarse grained components, such as traditional SOA services, can be reused across an enterprise’s IT architecture to provide standard high-level interfaces into centralized systems for other applications.

The important thing to bear in mind, though, is that reuse is not an end in itself. It can be a means of achieving consistency and/or efficiency, but its benefits come from avoiding cost and duplication rather than from the extra usage. Just as other forms of reuse have had costs in addition to benefits, so it is with microservices as well.

Anything that is reused rather than duplicated becomes a dependency of its client application. This dependency relationship is a form of coupling, tying the two codebases together and constraining the ability of the dependency to change. Within the confines of an application, it is generally better for reuse to emerge. Inter-application reuse will require more coordination and tend to be more deliberately designed. As with most things, there is no free lunch. Context is required to determine whether the trade is a good one or not.

Replaceability is, in my opinion, just as important, if not more so, than reuse. Being able to switch from one dependency to another (or from one version of a dependency to another) because that dependency has its own independent lifecycle and is independently deployed enables a great deal of flexibility. That flexibility enables easier upgrades (rolling migration rather than a big bang). Reducing the friction inherent in migrations reduces the likelihood of technical debt due to inertia.

While a shared service may well find more constraints with each additional client, each client can determine how much replaceability is appropriate for itself.

Microservices – The Too Good to be True Parts

Label for Clark Stanley's Snake Oil Liniment

Over the last several months, I’ve written several posts about microservices. My attitude toward this architectural style is one of guarded optimism. I consider the “purist” version of it to be overkill for most applications (are you really creating something Netflix-scale?), but see a lot of valuable ideas developing out of it. Smaller, focused, service-enabled applications are, in my opinion, an excellent way to increase systems agility. Where the benefits outweigh the costs and you’ve done your homework, systems of systems make sense.

However, the history of Service-Oriented Architecture (SOA), is instructive. A tried and true method of discrediting yourself is to over-promise and under-deliver. Microservice architectures, as the latest hot topic, currently receive a lot of uncritical press, just as SOA did a few years back. An article on ZDNet, “How Nike thinks about app development: Lots of micro services”, illustrates this (emphasis is mine):

Nike is breaking down all the parts of its apps to crate (sic) building blocks that can be reused and tweaked as needed. There’s also a redundancy benefit: Should one micro service fail the other ones will work in the app.

Reuse and agility tend to be antagonists. The governance needed to promote reuse impedes agility. Distribution increase complexity on its own; reuse adds additional complexity. This complexity comes not only from communication issues but also from coordination and coupling. Rationalization, reuse and the ability to compose applications from the individual service is absolutely a feature of this style. The catch is the cost involved in achieving it.

A naive reading of Nike’s strategy would imply that breaking everything up “auto-magically” yields reuse and agility. Without an intentional design, this is very unlikely. Cohesion of the individual services, rather than their size is the important factor in achieving those goals. As Stefan Tilkov notes in “How Small Should Your Microservice Be?”:

In other words, I think it’s not a goal to make your services as small as possible. Doing so would mean you view the separation into individual, stand-alone services as your only structuring mechanism, while it should be only one of many.

Redundancy and resilience are likewise issues that need careful consideration. The quote from the Nike article might lead you to believe that resilience and redundancy are a by-product of deploying microservices. Far from it. Resilience and distribution are orthogonal concepts; in fact, breaking up a monolith can have a negative impact on resilience if resilience is not specifically accounted for in the design. Coupling, in all its various forms, reduces resilience. Jeppe Cramon, in “SOA: synchronous communication, data ownership and coupling”, has shown that distribution, in and of itself, does not eliminate coupling. This means that “Should one micro service fail the other ones will work in the app” may prove false if the service that fails is coupled with and depended on by other services. Decoupling is unlikely to happen accidentally. Likewise, redundant instances of the same service will do little good if a resource shared by those instances (e.g. the data store) is down.

Even where a full-blown microservice architecture is inappropriate, many of the principles behind the style are useful. Swimming with the tide of Conway’s Law, rather than against it, is more likely to yield successful application architectures and enterprise IT architectures. The coherence that makes it successful is a product of design, however, and not serendipity. Microservices are most definitely not snake oil. Selling the style like it is snake oil is a really bad idea.

Organizing an Application – Layering, Slicing, or Dicing?

stump cut to show wood grain

How did you choose the architecture for your last greenfield product?

Perhaps a better question is did you consciously choose the architecture of your last greenfield product?

When I tweeted an announcement of my previous post, “Accidental Architecture”, I received a couple of replies from Simon Brown:

While I tend to use layers in my designs, Simon’s observation is definitely on point. Purposely using a particular architectural style is a far different thing than using that style “just because”. Understanding the design considerations that drove a particular set of architectural choices can be extremely useful in making sure the design is worked within rather than around. This understanding is, of course, impossible, if there was no consideration involved.

Structure is fundamental to the architecture of an application, existing as both logical relationships between modules (typically classes) and the packaging of those modules into executable components. Packaging provides stronger constraint of relationships. It’s easier to break a convention that classes in namespace A doesn’t use those in namespace C except via classes in namespace B if they’re all in the same package. Packaged separately, constraints on the visibility of classes can provide an extra layer of enforcement. Packaging also affects deployment, dependency management, and other capabilities. For developers, the package structure of an application is probably the most visible aspect of the application’s architecture.

Various partitioning strategies exists. Many choose one that reflects a horizontally layered design, with one or more packages per layer. Some, such as Jimmy Bogard, prefer a single package strategy. Simon Brown, however, has been advocating a vertically sliced style that he calls “architecturally -evident”, where the packaging reflects a bounded context and the layering is internal. Johannes Brodwall is another prominent proponent of this partitioning strategy.

The all in one strategy is, in my opinion, a limited one. Because so many of the of the products I’ve worked on have evolved to include service interfaces (often as separate sites), I favor keeping the back-end separate from the front as a rule. While a single package application could be refactored to tease it apart when necessary, architectural refactoring can be a much more involved process than code refactoring. With only soft limits in both the horizontal and vertical dimensions, it’s likely that the overall design will become muddled as the application grows. Both the horizontal and the vertical partitioning strategies allow for greater control over component relationships.

Determining the optimal partitioning strategy will involve understanding what goals are most important and what scenarios are most likely. Horizontal partitions promote adherence to logical layering rules (e.g. the UI does not make use of data access except via the business layer) and can be used for tiered deployments where the front-end and back-end are on different machines. Horizontal layers are also useful for dynamically configuring dependencies (e.g. swappable persistence layers). Vertical partitions promote semantic cohesion by providing hard contextual boundaries. Vertical partitioning enables easier architectural refactoring from a monolithic structure to a distributed one built around microservices.

Another option would be to combine layering with slicing – dicing, if you will. This technique would allow you to combine the benefits of both approaches, albeit with greater complexity. There is also the danger of weakening the contextual cohesion when layering a vertical slice. The common caution (at least in the .Net world) of harming performance seems to be more an issue during coding and builds rather than at run time.

As with most architectural decisions, the answer to which partitioning scheme is best is “it depends”. Knowing what you want and the priority of those wants in relation to each other is vitally important. You won’t be able to answer “The Most Important Question” otherwise.

Coordinating Microservices – Playing Well with Others

Eugene Ormandy Conducting

In “More on Microservices – Boundaries, Governance, Reuse & Complexity”, I made the statement that I loved feedback on my posts. Thomas Cagley and Alexander Samarin submitted two comments that reinforced that sentiment and led directly to this post.

Thomas’ comment asked about the risks inherent in microservice architectures. It was a good, straight-forward question that was right on point with the post. It also foreshadowed Alexander’s comment that “An explicit coordination between services is still missing…Coordination should be externalized…” because coordination of microservices is a significant area of risk.

In his comment, Alexander provided links to two of his own posts “Ideas for #BPMshift – Delenda est “vendor-centric #BPM” – How to modernise a legacy ERP” and “Enterprise patterns: eclipse”. These posts deal with decomposing monoliths into services and then composing services into larger coordinating services. They support his position that the coordination should be external to the various component services, a position that I agree with for the most part. However, according to my understanding of those posts, his position rests on considerations of dependency management and ease of composition. While these are very important, other factors are equally important to consider when designing how the components of a distributed system work together.

There is a tendency for people to design and implement distributed applications in the same manner they would a monolith, resulting in a web of service dependencies. Services are not distributed objects. Arnon Rotem-Gal-Oz’s “Fallacies of Distributed Computing Explained” explains in detail why treating them as a such introduces risk (all of these fallacies affect the quality of coordination of collaborating services). That people are still making these mistaken assumptions so many years later (Peter Deutsch contributed the first 7 fallacies in 1994 and James Gosling added the 8th in 1997) is mind-boggling:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

In addition to the issues illustrated by the fallacies, coupling in distributed systems becomes more of an area of operational concern than just an element of “good” design. Ben Morris’ “How loose coupling can be undermined in service-orientated architectures” is a good resource on types of coupling that can be present in service architectures.

Synchronous request/response communication is a style familiar to most developers in that it mimics the communication pattern between objects in object-oriented software systems. It is a simple style to comprehend. That familiarity and simplicity, however, make it a particularly troublesome style in that it is subject to many of the issues listed above (items 1 through 3 and 7 especially). The synchronous nature introduces a great deal of problematic coupling, noted by Jeppe Cramon in “Micro services: It’s not (only) the size that matters, it’s (also) how you use them – part 1”:

Coupling has a tendency of creating cascading side effects: When a service changes its contract it becomes something ALL dependent services must deal with. When a service is unavailable, all services that depend upon the service are also unavailable. When a service failsduring a data update, all other services involved in the same coordinated process / update also have to deal with the failed update (process coupling)

Systems using the synchronous request/response style can be structured to minimize the effects of some of the fallacies, but there is a cost for doing so. The more provision one makes for reliability, for example, the more complicated the client system becomes. Additionally, one can further aggravate the amount of coupling via the use of distributed transactions to improve reliability, which Jeppe Cramon addresses in “Micro services: It’s not (only) the size that matters, it’s (also) how you use them – part 2”.

In the Lewis and Fowler post, orchestration was dealt with in the section named “Smart endpoints and dumb pipes”. Their approach emphasized pipe and filter composition (with a nod to reducing the chattiness of the communication compared to that within the process space of a monolith) and/or lightweight messaging systems instead of Enterprise Service Bus (ESB) products. While complex ESBs may be overkill, at least initially, I would not necessarily counsel avoiding them. Once the need moves beyond simple composition into routing and transformation, then the value proposition for these types of products becomes clearer (especially where they include management, monitoring and logging features). The message routing and transformation capabilities in particular can allow you to decouple from a particular service implementation providing that the potential providers have similar data profiles.

Asynchronous communication methods are more resilient to the issues posed by the eight fallacies and can also reduce some types of coupling (temporal coupling at a minimum). As Jeppe Cramon states in “Microservices: It’s not (only) the size that matters, it’s (also) how you use them – part 3”, asynchronous communication can be either one way (events) or it can still be two way (request/reply as opposed to request/response). Jeppe’s position is that true one way communication is superior and in many cases, I would agree. There will still be many situations, however, where a degree of process coupling, however reduced, must be lived with.

In summary, composing services is far more complex than composing method calls within the single process space of a monolithic application. A microservice architecture that looks like a traditional layered monolith with services at the layer boundaries betrays a poor understanding of the constraints that distributed applications operate under. The cost of going out of process should not be a surprise to architects and developers. Even with custom protocols narrowly tailored to their function, database accesses are a recognized source of performance issues and managed accordingly. We shouldn’t expect equivalent, much less better performance from services running over HTTP.