Microservices Under the Microscope

Microscope

Buzz and backlash seems to describe the technology circle of life. Something (language, process, platform, etc.; it doesn’t seem to matter) gets noticed, interest increases, then the reaction sets in. This was seen with Service-Oriented Architecture (SOA) in the early 2000s. More was promised than could ever be realistically attained and eventually the hype collapsed under its own weight (with a little help from the economic downturn). While the term SOA has died out, services, being useful, have remained.

2014 appears to be the year of microservices. While neither the term nor the architectural style itself are new, James Lewis and Martin Fowler’s post from earlier this year appears to have significantly raised the level of interest in it. In response to the enthusiasm, others have pointed out that the microservices architectural style, like any technique, involves trade offs. Michael Feathers pointed out in “Microservices Until Macro Complexity”:

If services are often bigger than classes in OO, then the next thing to look for is a level above microservices that provides a more abstract view of an architecture. People are struggling with that right now and it was foreseeable. Along with that concern, we have the general issue of dealing with asynchrony and communication patterns between services. I strongly believe that there is a law of conservation of complexity in software. When we break up big things into small pieces we invariably push the complexity to their interaction.

Robert, “Uncle Bob”, Martin has recently been a prominent voice questioning the silver bullet status of microservices. In “Microservices and Jars”, he pointed out that applications can achieve separation of concerns via componentization (using jars/Gems/DLLs depending on the platform) without incurring the overhead of over-the-wire communication. According to Uncle Bob, by using a plugin scheme, components can be as independently deployable as a microservice.

Giorgio Sironi responded with the post “Microservices are not Jars”. In it, Giorgio pointed out independent deployment is only part of the equation, independent scalability is possible via microservices but not via plugins. Giorgio questioned the safety of swapping out libraries, but I can vouch for the fact that plugins can be hot-swapped at runtime. One important point made was in regard to this quote from Uncle Bob’s post:

If I want two jars to get into a rapid chat with each other, I can. But I don’t dare do that with a MS because the communication time will kill me.

Sironi’s response:

Of course, chatty fine-grained interfaces are not a microservices trait. I prefer accept a Command, emit Events as an integration style. After all, microservices can become dangerous if integrated with purely synchronous calls so the kind of interfaces they expose to each other is necessarily different from the one of objects that work in the same process. This is a property of every distributed system, as we know from 1996.

Remember this for later.

Uncle Bob’s follow-up post, “Clean Micro-service Architecture”, concentrated on scalability. It made the point that microservices are not the only method for scaling an application (true); and stated that “the deployment model is a detail” and “details are never part of an architecture” (not true, at least in my opinion and that of others):

While Uncle Bob may consider the idea of designing for distribution to be “BDUF Baloney”, that’s wrong. That’s not only wrong, but he knows it’s wrong – see his quote above re: “a rapid chat”. In the paper that’s referenced in the Sironi quote above, Waldo et al. put it this way:

We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.

You can design a system with components that can run in the same process, across multiple processes, and across multiple machines. To do so, however, you must design them as if they were going to be distributed from the start. If you begin chatty, you will find yourself jumping through hoops to adapt to a coarse-grained interface later. If you start with the assumption of synchronous and/or reliable communications, you may well find a lot of obstacles when you need to change to a model that lacks one or both of those qualities. I’ve seen systems that work reasonably well on a single machine (excluding the database server) fall over when someone attempts to load balance them because of a failure to take scaling into account. Things like invalidating and refreshing caches as well as event publication become much more complex starting with node number two if a “simplest thing that can work” approach is taken.

Distributed applications in general and microservice architectures in particular are not universal solutions. There are costs as well as benefits to every architectural style and sometimes having everything in-process is the right answer for a given point in time. On the other hand, you can’t expect to scale easily if you haven’t taken scalability into consideration previously.

Advertisements

Fears for Tiers – Do You Need a Service Layer?

Tiered Wedding Cake

One of the perils of popularity (at least in technology) is that some people equate popularity with efficacy. Everyone’s using this everywhere, so it must be great for everything!

Actually, not so much.

No one, to my knowledge, has created the secret sauce that makes everything better with no trade-offs or side effects. Context still rules. Martin Fowler recently published “Microservices and the First Law of Distributed Objects”, which touches on this principle. In it, Fowler refers to what he calls the First Law of Distributed Object Design: “don’t distribute your objects”. To make a long story short, distributed objects are a very bad idea because object interfaces tend to be fine-grained and fine-grained interfaces perform poorly in a distributed environment. “Chunkier” interfaces are more appropriate to distributed scenarios due to the difference in context between in-process and out of process communication (particularly when out of process also crosses machine, even network boundaries). Additionally, distributed architectures introduce complexity, adding another contextual element to be accounted for when evaluating whether to use them.

In his post, under “Further Readings”, Fowler references an older post of his on Dr. Dobb’s titled “Errant Architectures” (a re-packaging of Chapter 7 of his book Patterns of Enterprise Application Architecture). In that, he discusses encountering what was a common architectural anti-pattern of the time (early 2000s):

There’s a recurring presentation I used to see two or three times a year during design reviews. Proudly, the system architect of a new OO system lays out his plan for a new distributed object system—let’s pretend it’s some kind of ordering system. He shows me a design that looks rather like “Architect’s Dream, Developer’s Nightmare” with separate remote objects for customers, orders, products and deliveries. Each one is a separate component that can be placed in a separate processing node.
I ask, “Why do you do this?”

“Performance, of course,” the architect replies, looking at me a little oddly. “We can run each component on a separate box. If one component gets too busy, we add extra boxes for it so we can load-balance our application.” The look is now curious, as if he wonders if I really know anything about real distributed object stuff at all.

The quintessential example of this style was the web site presentation layer with its business and data layer behind a set of web services. Particularly in the .Net world, web services were the cutting edge and everyone needed them. Of course the system architect in the story was confusing scalability for performance (as were most who jumped onto the web service layer bandwagon). In order to increase the former, a hit is taken to the latter. In most cases, this hit is entirely unnecessary. A web application with an in-process back-end can be load-balanced much more simply than separate UI and service sites, without suffering the performance penalty of remote communication.

There are valid use cases for a service layer, such as when an application provides a both an interactive user interface and is service-enabled. When there are multiple user interfaces, some front ends will benefit from physical separation from the back-end (native mobile apps, desktop clients, SharePoint web parts, etc.). Applications can be structured to accommodate the same back-end either in-process or out of process. This structure can be particularly important for applications that have both internal (where internal is defined as built and deployed simultaneously with the back-end) and external service clients as these have different needs in terms of versioning. Wrapping the back-end with a service layer rather than allowing it to invade a particular service implementation can allow one back-end to serve internal and external customers either directly or via services (RESTful and/or SOAP) without violating the DRY principle.

Many of the same caveats that apply to composite applications formed from microservices (network latency, serialization costs, etc.) apply to these types of applications. It makes little sense to take on these downsides in the absence of a clear demonstrable need. Doing so looks more like falling prey to a fad than creating a well thought out design.

[Wedding Cake Image by shine oa via Wikimedia Commons.]

Why does software development have to be so hard?

Untangling this could be tricky

A series of 8 tweets by Dan Creswell paints a familiar, if depressing, picture of the state of software development:

(1) Developers growing up with modern machinery have no sense of constrained resource.

(2) Thus these developers have not developed the mental tools for coping with problems that require a level of computational efficiency.

(3) In fact they have no sensitivity to the need for efficiency in various situations. E.g. network services, mobile, variable rates of change.

(4) Which in turn means they are prone to delivering systems inadequate for those situations.

(5) In a world that is increasingly networked & demanding of efficiency at scale, we would expect to see substantial polarisation.

(6) The small number of successful products and services built by a few and many poor attempts by the masses.

(7) Expect commodity dev teams to repeatedly fail to meet these challenges and many wasted dollars.

(8) Expect smart startups to limit themselves to hiring a few good techies that will out-deliver the big orgs and define the future.

The Fallacies of Distributed Computing are more than twenty years old, but Arnon Rotem-Gal-Oz’s observations (five years after he first made them) still apply:

With almost 15 years since the fallacies were drafted and more than 40 years since we started building distributed systems – the characteristics and underlying problems of distributed systems remain pretty much the same. What is more alarming is that architects, designers and developers are still tempted to wave some of these problems off thinking technology solves everything.

Why?

Is it really this hard to get it right?

More importantly, how do we change this?

In order to determine a solution, we first have to understand the nature of the problem. Dan’s tweets point to the machines developers are used to, although in fairness, those of us who lived through the bad old days of personal computing can attest that developers were getting it wrong back then. In “Most software developers are not architects”, Simon Brown points out that too many teams are ignorant of or downright hostile to the need for architectural design. Uncle Bob Martin in “Where is the Foreman?”, suggests the lack of a gatekeeper to enforce standards and quality is why “our floors squeak”. Are we over-emphasizing education and underestimating training? Has the increasing complexity and the amount of abstraction used to manage it left us with too generalized a knowledge base relative to our needs?

Like any wicked problem, I suspect that the answer to “why?” lies not in any one aspect but in the combination. Likewise, no one aspect is likely, in my opinion, to hold the answer in any given case, much less all cases.

People can be spoiled by the latest and greatest equipment as well as the optimal performance that comes for working and testing on the local network. However, reproducing real-world conditions is a bit more complicated than giving someone an older machine. You can simulate load and traffic on your site, but understand and accounting for competing traffic on the local network and the internet is a bit more difficult. We cannot say “application x will handle y number of users”, only that it will handle that number of users under the exact same conditions and environment as we have simulated – a subtle, but critical difference.

Obviously, I’m partial to Simon Brown’s viewpoint. The idea of a coherent, performant design just “emerging” from doing the simplest thing that could possibly work is ludicrous. The analogy would be walking into an auto parts store, buying components individually, and expecting them to “just work” – you have to have some sort of idea of the end product in mind. On the other hand, attempting to specify too much up front is as bad as too little – the knowledge needed is not there and even if it were, a single designer doesn’t scale when dealing with any system that has a team of any real size.

Uncle Bob’s idea of a “foreman” could work under some circumstances. Like Big Design Up Front, however, it doesn’t scale. Collaboration is as important to the team leader as it is to the architect. The consequences of an all-knowing, all-powerful personality can be just as dire in this role as for an architect.

In “Hordes of Novices”, Bob Martin observed “When it’s possible to get a degree in computer science without writing any code, the quality of the graduates is questionable at best”. The problem here is that universities are geared to educate, not train. Just because training is more useful to an employer (at least in the short term), does not make education unimportant. Training deals with this tool at this time while how to determine which tool is right for a given situation is more in the province of education. It’s the difference between how to do versus how to figure out how to do. Both are necessary.

As I’ve already noted, it’s a thorny issue. Rather than offering an answer, I’d rather offer the opportunity for others to add to the conversation in the comments below. How did we get here and how do we go forward?

Think asynch for performance

It’s fairly natural to think in terms of synchronous steps when defining processes to be automated. You do this, then that, followed by a third thing and then you’re done. When applied to applications, this paradigm provides a relatively simple, easy to understand flow. The problem is that each of the steps take time and as the number of steps accumulate (due to enhancements added over time), the duration of the process increases. As the duration of the process increases, the probability that a user takes exception to the wait approaches 1.

There are several approaches to dealing with this dilemma. Hoping it goes away is a non-starter. Removing functionality, unless it’s blatant gold-plating, is probably not going to be an easy sell either. One workable option is scaling up and/or out to reduce the impact of user load on the time to completion. Another is to adjust the user’s perception of performance by executing selected steps, if not entire processes, asynchronously. Combining scale up/scale out and asynchronous operation can provide further performance gains, both actual and perceived.

When scaling up, the trade-off for increased performance is normally the cost of the hardware resources (processor/cores, memory, and/or storage). Either price or physical constraints will prove the limiting factor. For scaling out, the main trade-off will again be price, with a secondary consideration of some added complexity (e.g. session state considerations when load balancing a web application). Price would generally provide more of a limit to this method than any physical considerations. Greater complexity will be the main trade-off for asynchronous operations. Costs can also be incurred if coupled with one or more of the scaling methods above and/or if using commercial tools to host the asynchronous steps.

While a full treatment on the tools available to support asynchronous execution is beyond the scope of this post, they include everything from full-blown messaging systems to queuing to home-grown solutions. High-end tools can be expensive, complex and have considerable hardware requirements. However, when evaluating methods, it is important to include maintenance considerations. A complex system developed in-house may soon equal the cost of a third party product when you factor in both initial development and on-going maintenance.

Steps from a synchronous process can be carved out for asynchronous execution provided they meet certain criteria. These steps must be those that do not change the semantics of the process in the event of a failure or affect the response to the caller. Additionally, such steps must be amenable to retry logic (either automated or manual) or compensating transactions in order to deal with any failures. Feedback to the user must either be handled by periodically checking for messages or by the caller providing a callback mechanism. These considerations are a significant source of the additional complexity incurred.

As an example, consider a process where:

  • the status of an order is updated
  • a document is created and emailed to the participant assigned to fulfill the order
  • an accounting entry is made reflecting the expected cost of fulfilling the order
  • an audit entry is made reflecting the status change
  • the current state of the order is returned to the caller

Over a number of releases, the process is enhanced so that now:

  • the status of an order is updated
  • a document is created and emailed to the participant assigned to fulfill the order
  • an accounting entry is made reflecting the expected cost of fulfilling the order
  • an audit entry is made reflecting the status change
  • a snapshot of the document data is saved
  • a copy of the document is forwarded to the document management system
  • the current state of the order is returned to the caller

The creation and emailing of the document, as well as sending it to the document management system, are prime candidates for asynchronous execution. Both tasks are reproducible in the event of a failure and neither leaves the system in an unstable state should they fail to complete. Because the interaction with the document management system crosses machine boundaries and involves a large message, the improvement should be noticeable. As noted above, the trade-offs for this will include the additional complexity in dealing with failures (retry logic: manual, automatic, or both) as well as the costs and complexity associated with whatever method is used to accomplish the step out of band.

Implementing a process that is entirely asynchronous can have a broader range of complexity. Simple actions that require little feedback to the user, such as re-sending the email from the example above, should be no more difficult than the carved out step. Processes that require greater user interaction will require more design and development effort to accomplish. Processes that were originally implemented in a synchronous manner will require the most effort.

Converting a process from a synchronous communications model to an asynchronous one will require substantial refactoring across all layers. When this refactoring crosses application boundaries (as in the case of a service application supporting other client applications), then the complexity increases. Crossing organizational boundaries (e.g. company to company) will entail the greatest complexity. In these cases, providing a new asynchronous method while maintaining the old synchronous one will simplify the migration. Trying to coordinate multiple releases, particularly across enterprises, is asking for an ulcer.

In spite of the number of times I’ve used the word “complexity” in this post, my intent is not to discourage the use of the technique. Where appropriate, it is an extremely powerful tool that can allow you to meet the needs of the customer without requiring them to pick between function and performance. They tend to like the “have your cake and eat it too” scenario.