Architecture – Finding Simple Solutions Over a Lifetime of Problems

On Roger Sessions’ LinkedIn group, Simpler IT, the discussion “What do I mean by Simplifying” talks about finding simple solutions to problems. Roger’s premise is that every problem has its own inherent complexity:

Let’s say P is some problem that we need to solve. For example, P could be the earthquake in Tom’s example or P could be the need of a bank to process credit cards or P could be my car that needs its oil changed. P may range in complexity from low (my car needs its oil changed) to high (a devastating earthquake has occurred.)

For a given P, the complexity of P is a constant. There is no strategy that will change the complexity of P.

Complexity and Effectiveness

Roger goes on to say that for any given problem, there will be a set of solutions to that problem. He further states “…if P is non-trivial, then the cardinality of the solution set of P is very very large”. Each solution can be characterized by how well it solves the problem at hand and how complex the solution is. These attributes can be graphed, as in the image to the right, yielding quadrants that range from the most effective and least complex (best) to least effective and most complex (worst). Thus, simplifying means:

The best possible s in the solution set is the one that lives in the upper left corner of the graph, as high on the Y axis as possible and as low on the X axis as possible.

When I talk about simplifying, I am talking about finding that one specific s out of all the possible solutions in the solution set.

Simplification, as a strategy, makes a great deal of sense in my opinion. There is, however, another aspect to be considered. While the complexity of a given problem P is constant, P represents the problem space of a system at a given time, not the entire lifecycle. The lifecycle of a system will consist of a set of problem spaces over time, from first release to decommissioning. An architect must take this lifecycle into consideration or risk introducing an ill-considered constraint on the future direction of the product. This is complicated by the fact that there will be uncertainty in how the problem space evolves over time, with the uncertainty being the greatest at the point furthest from the present (as represented by the image below).

product timeline

Some information regarding the transition from one problem space to the next will be available. Product roadmaps and deferred issues provide some insight into what will be needed next. That being said, emergent circumstances (everything from unforeseen changes in business direction to unexpected increases in traffic) will conspire to prevent the trajectory of the set of problem spaces from being completely predictable.

Excessive complexity will certainly constrain the options for evolving a system. However, flexibility can come with a certain amount of complexity as well. The simplest solution may also complicate the evolution of a system.

Form Follows Function on SPaMCAST

The latest episode (#268) of Tom Cagley’s excellent series of podcasts features an interview with me on the subjects of architecture, process, and management, as well as why I blog. It was not only an honor to be asked, but also a very enjoyable half hour of conversation on subjects near and dear to me – well worth the time it takes to listen to (in my not so humble opinion).

Error Handling – No News is Really Bad News

Did you think I wouldn't notice?

A recent post on The Daily WTF highlighted a system that “…throws the fewest errors of any of our code, so it should be very stable”. The punchline, of course, was that the system threw so few errors because it was catching and suppressing almost all the errors that were occurring. Once the “no news is good news” code was removed, the dysfunctional nature of the system was revealed.

On one level, it’s funny to think of a system being considered “very stable” on the basis of it destroying the evidence of its failures. Anyone who has been in software development for any length of time probably has a war story about a colleague who couldn’t tell the difference between getting rid of error messages and correcting the error condition. However, if the system in question is critical to the user’s personal or financial well-being, then it’s not so amusing. Imagine thinking you had health insurance because the site where you enrolled said you did, and finding out later that you really didn’t.

Developing software that accomplishes something isn’t trivial, but then again, it isn’t rocket science either. Performing a task when all is correct is the easy part. We earn our money by how we handle the other cases. This is not only a matter of technical professionalism, but also a business issue. End users are likely to be annoyed if our applications leave them stranded out of town or jumping through hoops only to be frustrated at the end of the process.

Better an obvious failure than a mystery that leaves the user wondering if the system did what it said it did. Mystery impairs trust, which is a key ingredient in the customer relationship.

All of the above was written with the assumption of incompetence rather than malice. However, a comment from Charlie Alfred made during a Twitter discussion about technical debt raised another possibility:

Wonder if such a thing as “Technical Powerball”? Poor design, unreadable code, no doc, but hits jackpot anyway 🙂

Charlie’s question doesn’t assume bad intent, but it occurred to me that if “jackpot” is defined as “it just has to hold together ’til I clear the door and cash the check”, then perhaps a case of technical debt is really a case of “Technical Powerball”. Geek and Poke put it well:

Knowing the Price of Everything and the Value of Nothing

Are they worth it?

Late on a September afternoon in 1812, outside a village on the road to Moscow, Napoleon had a problem. After nine hours of grinding battle in which both armies sustained massive losses, the Russian armies were on the verge of disintegration. Napoleon’s staff was begging him to commit his elites, the Imperial Guard, and complete the victory, but Marshall Bessières asked “Will you risk your last reserves eight hundred miles from Paris?” He would not, and although the French army would march into Moscow a week later on September 14, it would also march back out five weeks later, retreating back to the Polish border. Nearly five sixths of the 685,000 man army that started the campaign had been lost, and the end of Napoleon’s control of the continent was in sight.

While my inner history geek finds this fascinating in and of itself, there is an architecturally significant moral to this story. The lack of reserves limits your options and the limited set of options you’re left with tends to range from bad to worse. Rather than reserves of soldiers, supplies, and ammunition, we deal in reserves of storage, memory, processor, and bandwidth. Exhausting these reserves can lead to catastrophe:

It can be tempting for some to seek out high levels of utilization. In their minds, a system that spends the majority of its time at fifty percent utilization is wasting fifty percent of its resources. After all, while storage, memory, processor, and bandwidth are cheaper than in the past, the cost is still non-zero. Far better, in their opinion, to more closely manage the allocation and eliminate the waste.

The problem, of course, is that exceeding the critical level of a resource will degrade a system. Whether that degradation is in the form of a crash or is handled more gracefully via throttling, queuing, suspending functionality, etc. it is still degradation. The only way to prevent degradation is to insure that sufficient excess capacity is in place to handle peak loads. For storage, this would be an amount sufficient to hold data, etc. generated by peak load for the period it would take to recognize and respond to the impending shortage.

Peak load is the critical metric as this reflects the worst case scenario. In situations where resources are shared, peak load across all systems should be used. Average load is useless in this context due to its smoothing out the peaks and valleys (remember, if you stick one foot in a bucket of ice water and the other in bucket of boiling water, on average, you’re comfortable).

Maintaining reserve capacity is guaranteed to allocate excess resources that will be wasted money – just as backups, disaster recovery environments, and insurance all represent wasted money (until needed). Obsessing about the cost of those excess resources without factoring in the cost of an outage or slowdown is a perfect example of being penny-wise and pound foolish.