Late on a September afternoon in 1812, outside a village on the road to Moscow, Napoleon had a problem. After nine hours of grinding battle in which both armies sustained massive losses, the Russian armies were on the verge of disintegration. Napoleon’s staff was begging him to commit his elites, the Imperial Guard, and complete the victory, but Marshall Bessières asked “Will you risk your last reserves eight hundred miles from Paris?” He would not, and although the French army would march into Moscow a week later on September 14, it would also march back out five weeks later, retreating back to the Polish border. Nearly five sixths of the 685,000 man army that started the campaign had been lost, and the end of Napoleon’s control of the continent was in sight.
While my inner history geek finds this fascinating in and of itself, there is an architecturally significant moral to this story. The lack of reserves limits your options and the limited set of options you’re left with tends to range from bad to worse. Rather than reserves of soldiers, supplies, and ammunition, we deal in reserves of storage, memory, processor, and bandwidth. Exhausting these reserves can lead to catastrophe:
A system without spare capacity means that a single unexpected event can have major knock-on consequences. cc @SW_Trains
— Kris Coverdale (@kriscoverdale) November 22, 2013
It can be tempting for some to seek out high levels of utilization. In their minds, a system that spends the majority of its time at fifty percent utilization is wasting fifty percent of its resources. After all, while storage, memory, processor, and bandwidth are cheaper than in the past, the cost is still non-zero. Far better, in their opinion, to more closely manage the allocation and eliminate the waste.
The problem, of course, is that exceeding the critical level of a resource will degrade a system. Whether that degradation is in the form of a crash or is handled more gracefully via throttling, queuing, suspending functionality, etc. it is still degradation. The only way to prevent degradation is to insure that sufficient excess capacity is in place to handle peak loads. For storage, this would be an amount sufficient to hold data, etc. generated by peak load for the period it would take to recognize and respond to the impending shortage.
Peak load is the critical metric as this reflects the worst case scenario. In situations where resources are shared, peak load across all systems should be used. Average load is useless in this context due to its smoothing out the peaks and valleys (remember, if you stick one foot in a bucket of ice water and the other in bucket of boiling water, on average, you’re comfortable).
Maintaining reserve capacity is guaranteed to allocate excess resources that will be wasted money – just as backups, disaster recovery environments, and insurance all represent wasted money (until needed). Obsessing about the cost of those excess resources without factoring in the cost of an outage or slowdown is a perfect example of being penny-wise and pound foolish.
As I was reading Roger Sessions’ latest white paper, “The Thirteen Laws of Highly Complex IT Systems”, Laws 1 and 2 immediately caught my eye:
Law 1. There are three categories of complexity: business, architectural and implementation.
Law 2. The three categories of complexity are largely independent of each other.
That complexity in these categories can vary independently (e.g. complex business processes can be designed and implemented simply just as simple processes can be designed and implemented in an extremely complex manner) is important to the understanding of complexity in IT. Likewise, it serves to remind that the function of the architecture and implementation of a system can vary independently from the underlying business process(es) it was intended to enable. That variance is an insidious form of technical debt, whether it occurs over time or was a foundational aspect of the system. In either case (though perhaps more so in the latter), customer satisfaction is going to be negatively affected.
Poor customer service, particularly in the form of ignoring (or being perceived as ignoring) the needs of the business, is a prime trigger for rogue IT implementations. The uncoordinated nature of these implementations leads to an overly complex “accidental architecture”. These accidental architectures pose problems not only in that they tend to be fragmented and more expensive than a well-designed solution, but also in that their existence constrains future architectures. Structure follows strategy when building anew, but then strategy will find itself constrained by structure.
The antithesis of this IT dystopia is the “fluid enterprise”, described by Brenda Michelson as one where “…assets in our portfolios are no longer sole-purposed applications or databases; they are also potential multiuse components and triggers to be exploited in the new architecture”. In order to evolve applications that come together as an enterprise platform, it is necessary to start from a base of applications that meet the needs of their users. While rationalizing a collection of shadow IT components is likely to be a long and expensive task, that does not mean that gluing together a bunch of inadequate (albeit “official”) systems will be a better solution.
I tried to avoid this one. First of all, I don’t do politics on this site and this topic has way too much political baggage. Second, a great many people have already written about it, so I didn’t think I really had anything to add.
Then, Uncle Bob Martin chimed in.
I agree with some of what he has to say. I have no doubt that this particular debacle has harmed the image of software development in the eyes of the general public. Then he falls over the edge, comparing the launch of healthcare.gov with the Challenger disaster. After all, in both cases, political considerations overrode technical concerns. Regardless of this, Bob puts the blame on those far down the ladder:
Perhaps you disagree. Perhaps you think this was a failure of government, or of management. Of course I agree. Government failed and management failed. But government and management don’t know how to build software. We do. We were hired because of that knowledge. And we are expected to use that knowledge to communicate to the managers and administrators who don’t have it.
The thing is, the Centers for Medicare and Medicaid Services (CMS) is both a government agency and the system integrator on the healthcare.gov project. While there’s plenty of evidence of really poor code across the various parts, the integration of those parts is where the project fell down. Had the various contractors hired numerous Bob Martin clones and obtained the cleanest of clean code, the result would have still been the same.
Those with the technical knowledge and experience are, without a doubt, obligated to provide their best advice to the managers and administrators. When those managers and administrators ignore that advice, it is incorrect to allege that the fault lies elsewhere.
The end of the post, however, is the worst:
So, if I were in government right now, I’d be thinking about laws to regulate the Software Industry. I’d be thinking about what languages and processes we should force them to use, what auditing should be done, what schooling is necessary, etc. etc. I’d be thinking about passing laws to get this unruly and chaotic industry under some kind of control.
If I were the President right now, I might even be thinking about creating a new Czar or Cabinet position: The Secretary of Software Quality. Someone who could regulate this misbehaving industry upon which so much of our future depends.
Considering that all indications are that the laws and regulations around government purchasing and contracting contributed to this mess, I’m not sure how additional regulation is supposed to fix it. Likewise, it’s a little boneheaded to suggest that those responsible for this debacle (by attempting to manage what they should have known they were unqualified to manage) should now regulate the entire software development industry. For a fact, the very diversity of the industry should make it obvious that a one-size-fits-all mandate would make matters irretrievably worse.
Handing out aspirin to treat Ebola is just bad medicine.
According to Dwight D. Eisenhower, “…plans are useless but planning is indispensable”. How can the production of something “useless” be “indispensable”?
Read more at CitizenTekk
In spite of all that’s been written on the subject of technical debt, it’s still a common occurrence to see it defined as simply “bad code”. Likewise, it’s still common to see the solution offered being “stop writing bad code”. Technical debt encompasses much more than that simplistic definition, so while “stop writing bad code” is good advice, it’s wholly insufficient to deal with the subject.
Steve McConnell’s definition is much more comprehensive (and, in my opinion, closer to the mark):
A design or construction approach that’s expedient in the short term but that creates a technical context in which the same work will cost more to do later than it would cost to do now (including increased cost over time)
While it’s a better definition, I’d differ with it in three ways. Technical debt may not only incur costs due to rework of the original item, but also by making more difficult changes that are dependent on the original item. Technical debt may also end up costing nothing extra over time (due to a risk not materializing or because the feature associated with the debt is eliminated). Lastly, it should be noted that the cost of technical debt can extend beyond just effort by also affecting customer satisfaction.
In short, I define technical debt as any technical deficit that involves a risk of greater cost and/or end user dissatisfaction.
This definition encompasses debts that are taken on deliberately and rationally, those that are taken on impulsively, and those that are taken on unconsciously.
Code that is brittle, redundant, unnecessary, unclear, insecure, and/or untested is, of course, a type of technical debt. Although Bob Martin argues otherwise, the risk of costs to be paid clearly makes it so. Likewise, aspects of design can be considered technical debt, whether in the form of poor decisions, intentional shortcuts, decisions deferred too long, or architectural “drift” (losing design coherence via new features being added using new technologies/techniques without bringing older components up to date, or failing to evolve the system as the needs of the business change). Deferring bug fixes is a form of technical debt as is deferring automation of recurring support tasks. Dependencies can be a source of technical debt, both in regard to debt they carry and in terms of their fitness to your purpose. The platform that hosts your application is yet another potential source of technical debt if not maintained.
As noted above, the “interest” on technical debt can manifest as the need for rework and/or more effort in implementing changes over time. This increase in effort can come through the proliferation of code to compensate for the effects of unresolved debt or even just through increased time to comprehend the existing code base prior to attempting a change. As Ruth Malan has noted, strategy may drive architecture, but once the initial architecture is in place, it serves to both enable and constrain the strategy of the system going forward (strategies requiring major architectural changes typically must offer extremely high ROI to get approval). Time spent on manual maintenance tasks (e.g. running scripts to add new reference values) can also be a form of interest, considering that time spent there is time that could be spent on other tasks.
Costs associated with technical debt are not always a gradual payback over time as with an ordinary loan. Some can be like a debt to the mob: “they come at night, pointing a gun to your head, and they want their money NOW”. Security issues are a prime example of this type of debt. Obviously, debts that carry the danger of coming due with little or no notice should be considered too risky to take on.
Having proposed a definition for the term “technical debt” and identified the risks that it entails, it remains to discuss what to do about it. The first step is to recognize it when it’s incurred (or as soon as possible thereafter). For debt taken on deliberately, recognition should be trivial going forward. Recognition of existing debt in an established system may require discovery if it has not been cataloged previously. Debt that has been taken on unconsciously will always require effort to discover. In all cases, the goal is to maintain a technical debt backlog that is as comprehensive as possible. Maintaining this backlog provides insight into both the current state of the system and can inform risk assessments for future decisions.
Becoming aware of existing debt is a critical first step, but is insufficient in itself. Taking steps to actively manage the system’s debt portfolio is essential. The first step should be to stop unconsciously taking on new debt. Latent debt tends to fit into the immediate, unexpected payback model mentioned above. Likewise, steps taken to improve the quality up front (unit testing, code review, static analysis, process changes, etc.) should reduce the effort needed for detection and remediation on the back end. Architectural and design practices should also be examined. Too little design can be as counter-productive as too much. Striking the right balance can yield savings over the life of the application.
Deciding whether or not to take on intentional technical debt is less black and white. Often this type of debt is taken on for rational reasons. An example of this is what Ruth Malan characterizes as “…trading technical learning and code design improvement for market learning (getting features into user hands to improve them)”. Other times, the balance between risk and reward (whether time to market or budget) may tilt in favor of taking on a debt. When this is the case, it is critical that the owner(s) of the system make the decision in possession of the best possible information you can provide. An impulsive decision taken on the basis of “feel” rather than information will likely carry more risk.
Retiring old debt should be the final link in the chain. Just as the taking on of new debt should be done in a rational manner, so should the retirement of old debt. Not all debt carries the same risk/reward ratio and efforts that carry more bang for the buck will be an easier sell. Although some may disagree, I firmly believe that better outcomes will result from making those who own the system active partners in its development and evolution.
It’s highly unlikely that a system will be free of technical debt. Perversely, being free of such debt could actually be a liability. That being said, there is a world of difference between the two poles of debt-free and technical anarchy. Effort spent to rationally manage a system’s debt load will free up time to be put to better use.
(originally posted on CitizenTekk)
Like most industrial concerns at the time, Black & Decker became an integral part of the United States’ war effort during World War II, producing power tools used in the defense industry. Alonzo G. Decker, Jr., a Vice President of the company (as well as the son of one of the founders), noticed that a defense contractor was buying drills at an unusual rate:
“Are they breaking down?” Mr. Decker asked.
“No, they are disappearing. Women are taking them home in their lunch baskets,” he was told.
Mr. Decker said he replied, “When females are taking drills home, we ought to be making something just for the home.”
“Toolmaker innovator Decker Jr. dead at 94″ Baltimore Sun, 3/19/2002
Aside from the sexism (it was the 1940s), two things should stand out: Decker was proactively looking for anomalies (including orders that were “too good”) that might indicate problems and he was alert to new opportunities. With the end of the war, Black & Decker launched a line of power tools for the home market, selling their millionth drill in less than five years. After nearly seventy years, the market has grown to $14.5 billion, of which Stanley Black & Decker holds $5.2 billion. Imagine the difference it might have made had he ignored either the anomaly or the opportunity.
In my previous post, “Faster Horses – Henry Ford and Customer Development”, I pointed out the importance of understanding the problem space in terms of customer needs and wants. While listening to customers is a valuable technique, Decker’s story shows that observing them can be just as valuable, if not more so. Some people dislike confrontation, some people dislike complaining, but their actions will generally conform to their feelings.
While the latest Big Data techniques can potentially give great insight into a customer’s motivations, Decker’s experience proves that lower tech metrics can work as well. The key is determine what the needs are and then pay attention to how well those needs are being met using the best methods at your disposal. Detecting issues before the customer complains can pay dividends – Alonzo Decker was looking for a problem when he found a multi-billion dollar market segment.
An easy way to start a fight these days is to refer to “the business”. Whether as a grumble or a shout, the response will inevitably come: “IT is part of the business”.
Of course it is. Having written a post or two (or three or four) about that very subject, I fully agree. That being said, not all differentiation is evidence that one believes “the business” exists to be the life support system for IT.
“The business” can be a useful shorthand for “not IT” (also “not HR”, “not legal”, “not purchasing”, etc., depending on who is using the phrase). This type of differentiation is extremely useful to help ensure that we’re associating the different aspects of a system (both features and qualities of service) with their appropriate stakeholder(s) and prioritizing them appropriately. In dealing with storage, we might have concerns around time to back up (predominately an IT concern) and available capacity (more likely to concern “the business”). We can play word games to avoid using those exact terms, but ultimately we need to make sure the concerns of “the business” take priority. We also need to make sure that where IT concerns have business impact, that impact is presented in terms of business outcomes.
This is the main reason I prefer having IT’s funding reside in the budgets of those using it – nothing destroys customer service quicker than giving someone a target (e.g. cost cutting) that is at odds with serving their customer. However, it should also be noted that customers that are paying their own bills tend to be more responsible consumers. Helping business units find the most cost-effective way to achieve their ends meets the goal of providing good service to those units and serves to further the enterprise’s goal of reducing costs. Satisfying its immediate customer (the business units) and the enterprise as a whole is more important than merely paying lip service to alignment. As a support organization, it’s how IT contributes to the satisfaction of the external customer. Since IT is a part of the business, that should be our concern – whether our concept of service is contributing to the success of the enterprise, not whether we’re using the term of the day.