Holistic Architecture – Keeping the Gears Turning

Gears Turning Animation

In last week’s post, “Trash or Treasure – What’s Your Legacy?”, I talked about how to define “legacy systems”. Essentially, as the divergence grows between the needs of social systems and the fitness for purpose of the software systems that enable them, the more likely that those software systems can considered “legacy”. The post attracted a few comments.

I love comments.

It’s nearly impossible to have writers’ block when you’ve got smart people commenting on your work and giving you more to think about. I got just that courtesy of theslowdiyer. The comment captured a critical point:

Agree that ALM is important, and actually also for a different reason – a financial one:

First of all, the cost of operating the system though the full Application Life Cycle (up to and including decommissioning) needs to be incorporated in the investment calculation. Some organisations will invariably get this wrong – by accident or by (poor) design (of processes).

But secondly (and this is where I have seen things go really wrong): If you invest a capability in the form of a new system then once that system is no longer viable to maintain, you probably still need the capability. Which means that if you are adding new capabilities to your system landscape, some form of accruals to sustain the capability ad infinitum will probably be required.

The most important thing is the capability, not the software system.

The capability is an organizational/enterprise concern. It belongs to the social systems that comprise the organization and the over-arching enterprise. This is not to say that software systems are not important – lack of automation or systems that have slipped into the legacy category can certainly impede the enterprise. However, without the enterprise, there is no purpose for the software system. Accordingly, we need to keep our focus centered on the key concern, the capability. So long as the capability is important to enterprise, then all the components, both social and technical, need to be working in harmony. In short, there’s a need for cohesion.

Last fall, Grady Booch tweeted:

Ruth Malan replied with a great illustration of it from her “Design Visualization: Smoke and Mirrors” slide deck:

Obviously, no one would want to fly on a plane in that state (which illustrates the enterprise IT architecture of too many organizations). The more important thing, however, is that even if the plane (the technical architecture of the enterprise) is perfectly cohesive, if the social system maintaining and operating it is similarly fractured, it’s still unsafe. If I thought that pilots, mechanics, and air traffic controllers were all operating at cross purposes (or at least without any thought of common cause), I’d become a fan of travel by train.

Unfortunately, for too many organizations, accidental architecture is the most charitable way to describe the enterprise. Both social and technical systems have been built up on an ad hoc basis and allowed to evolve without reference to any unifying plan. Technical systems tend to be built (and worse, maintained) according to project-oriented mindset (aka “done and run”) leading to an expensive cycle of decay, then fix. The social systems can become self-perpetuating fiefs. The level of cohesion between the two, to the extent that it existed, breaks down even more.

A post from Matt Balantine, “Garbage In” illustrates the cohesion issue across both social and technical systems. Describing an attempt to analyze spending data across a large organization composed of federated subsidiaries:

The theory was that if we could find the classifications that existed across each of the organisations, we could then map them, Rosetta Stone-like, to a standard schema. As we spoke to each of the organisations we started to realise that there may be a problem.

The classification systems that were in use weren’t being managed to achieve integrity of data, but instead to deliver short-term operational needs. In most cases the classification was a drop-down list in the Finance system. It hadn’t been modelled – it just evolved over time, with new codes being added as necessary (and old ones not being removed because of previous use). Moreover, the classifications weren’t consistent. In the same field information would be encapsulated in various ways.

Even in more homogeneous organizations, I would expect to find something similar. It’s extremely common for aspects of one capability to bear on others. What is the primary concern for one business unit may be one of many subsidiary concerns for another (see “Making and Taming Monoliths” for an illustrated example). Because of the disconnected way capabilities (and their supporting systems) are traditionally developed, however, there tends to be a lot of redundant data. This isn’t necessarily a bad thing (e.g. a cache is redundant data maintained for performance purposes). What is a bad thing is when the disconnects cause disagreements and no governance exists to mediate the disputes. Not having an authoritative source is arguably worse than having no data at all since you don’t know what to trust.

Having an idea of what pieces exist, how they fit together, and how they will evolve while remaining aligned is, in my opinion, critical for any system. When it’s a complex socio-technical system, this awareness needs to span the whole enterprise stack (social and technical). Time and effort spent maintaining coherence across the enterprise, rather than detracting from the primary concerns will actually enhance them.

Are you confident that the plane will stay in the air or just hoping that the wing doesn’t fall off?

Building a Legacy

Greek Trireme image from Deutsches Museum, Munich, Germany

 

Over the last few weeks, I’ve run across a flurry of articles dealing with the issue of legacy systems used by the U.S. government.

An Associated Press story on the findings from the Government Accountability Office (GAO) issued in May reported that roughly three-fourths of the $80 billion IT budget was used to maintain legacy systems, some more than fifty years old and without an end of life date in sight. An article on CIO.com about the same GAO report detailed seven of the oldest systems. Two were over 56 years old, two 53, one 51, one 35, and one 31. Four of the seven have plans to be replaced, but the two oldest have no replacement yet planned.

Cost was not the only issue, reliability is a problem as well. An article on Timeline.com noted:

Then there’s the fact that, up until 2010, the Secret Service’s computer systems were only operational about 60% of the time, thanks to a highly outdated 1980s mainframe. When Senator Joe Lieberman spoke out on the issue back in 2010, he claimed that, in comparison, “industry and government standards are around 98 percent generally.” It’s alright though, protecting the president and vice president is a job that’s really only important about 60 percent of the time, right?

It would be easy to write this off as just another example of public-sector inefficiency, but you can find these same issues in the private sector as well. Inertia can, and does, affect systems belonging to government agencies and business alike. Even a perfectly designed implemented system (we’ve all got those, right?) is subject to platform rot if ignored. Ironically, our organizations seem designed to do just that by being project-centric.

In philosophy, there’s a paradox called the Ship of Theseus, that explores the question of identity. The question arises, if we maintain something by replacing its constituent parts, does it remain the same thing? While many hours could be spent debating this, to those whose opinion should matter most, those who use the system, the answer is yes. To them, the identity of the system is bound up in what they do with it, such that it ceases to be the same thing, not when we maintain it but when its function is degraded through neglect.

Common practice, however, separates ownership and interest. Those with the greatest interest in the system typically will not own the budget for work on it. Those owning the budget, will typically be biased towards projects which add value, not maintenance work that represents cost.

Speaking of cost, is 75% of the budget an unreasonable amount for maintenance? How well are the systems meeting the needs of their users? Is quality increasing, decreasing, or holding steady? Was more money spent because of deferred maintenance than would have been spent with earlier intervention? How much business risk is involved? Without this context, it’s extremely difficult to say. It’s understandable that someone outside an organization might lack this information, but even within it, would a centralized IT group have access to it all? Is the context as meaningful at a higher, central level as it is “at the pointy end of the spear”?

Maintaining systems bit by bit, replacing them gradually over time, is likely to be more successful and less expensive, than letting them rot and then having a big-bang re-write. In my opinion, having an effective architecture for the enterprise’s IT systems is dependent on having an effective architecture for the enterprise itself. If the various systems (social and software) are not operating in conjunction, drift and inertia will take care of building your legacy (system).

[Greek Trireme image from Deutsches Museum, Munich, Germany via Wikimedia Commons]

Dealing with Technical Debt Like We Mean it

What’s the biggest problem with technical debt?

In my opinion, the biggest problem is that it works. Just like the electrical outlet pictured above, systems with technical debt get the job done, even when there’s a hidden surprise or two waiting to make life interesting for us at some later date. If it flat-out failed, getting it fixed would be far easier. Making the argument to spend time (money) changing something that “works” can be difficult.

Failing to make the argument, however, is not the answer:

Brenda Michelson‘s observation is half the battle. The argument for paying down technical debt needs to be made in business-relevant terms (cost, risk, customer impact, etc.). We need more focus on the “debt” part and remember “technical” is just a qualifier:

The other half of the battle is communicating, in the same business-relevant manner, the costs and/or risks involved when taking on technical debt is considered:

Tracking what technical debt exists and managing the payoff (or write off, removing failed experiments is a reduction technique) is important. Likewise, managing the assumption of technical debt is critical to avoid being swamped by it.

Of course, one could take the approach that the only acceptable level of technical debt is zero. This is equivalent to saying “if we can’t have a perfect product, we won’t have a product”. That might be a difficult position to sell to those writing the checks.

Even if you could get an agreement for that position, reality will conspire to frustrate you. Entropy emerges. Even if the code is perfected and then left unchanged, the system can rot as its platform ages and the needs of the business change. When a system is actively maintained over time without an eye to maintaining a coherent, intentional architecture, then the situation becomes worse. In his post “Enterprise Modernization – The Next Big Thing!”, David Sprott noted:

The problem with modernization is that it is widely perceived as slow, very expensive and high risk because the core business legacy systems are hugely complex as a result of decades of tactical change projects that inevitably compromise any original architecture. But modernization activity must not be limited to the old, core systems; I observe all enterprises old and new, traditional and internet based delivering what I call “instant legacy” [Note 1] generally as outcomes of Agile projects that prioritize speed of delivery over compliance with a well-defined reference architecture that enables ongoing agility and continuous modernization.

Kellan Elliot-McCrea, in “Towards an understanding of technical debt”, captured the problem:

All code is technical debt. All code is, to varying degrees, an incorrect bet on what the future will look like.

This means that assessing and managing technical debt should be an ongoing activity with a responsible owner rather than a one-off event that “somebody” will take care of. The alternative is a bit like using a credit card at every opportunity and ignoring the statements until the repo-man is at the door.

Technical Debt – Why not just do the work better?

Soap or oil press ruins in Panayouda, Lesvos

A comment from Tom Cagley was the catalyst for my post “Design Communicates the Solution to a Problem”, and he’s done it again. I recently re-posted “Technical Debt – What it is and what to do about it” to the Iasa Global blog, and Tom was kind enough to tweet a link to it. In doing so, he added a little (friendly) barb: “Why not just do the work better?” My reply was “It’s certainly the best strategy, assuming you can“.

Obviously, we have a great deal of control over intentional coding shortcuts. It’s not absolute control; business considerations can intrude, despite our best advice to the contrary. There are a host of other factors that we have less control over: aging platforms, issues in dependencies, changing runtime circumstances (changes in load, usage patterns, etc.), and increased complexity due to organic growth are all factors that can change a “good” design into a “poor” one. As Tom has noted, some have extended the metaphor to refer to these as “Quality Debt, Design Debt, Configuration Management Debt, User Experience Debt, Architectural Debt”, etc., but in essence, they’re all technical deficits affecting user satisfaction.

Unlike coherent architectural design, these other forms of technical debt emerge quite easily. They can emerge singly or in concert. In many cases, they can emerge without your doing anything “wrong”.

Compromise is often a source of technical debt, both in terms of the traditional variety (code shortcuts) and in terms of those I listed above. While it’s easy to say “no compromises”, reality is far different. While I’m no fan of YAGNI, at least the knee-jerk kind, over-engineering is not the answer either. Not every application needs to be built for hundreds of thousands of concurrent users. This is compromise. Having insufficient documentation is a form of technical debt that can come back to haunt you if personnel changes result in loss of knowledge. How many have made a compromise in this respect? While it is a fundamental form of documentation, code is insufficient by itself.

A compromise that I’ve had to make twice in my career is the use of the shared database EAI pattern when an integration had to be in place and the target application did not have a better mechanism in place. While I am absolutely in agreement that this method is sub-standard, the business need was too great (i.e. revenue was at stake). The risk was identified and the product owner was able to make an informed decision with the understanding that the integration would be revised as soon as possible. Under the circumstances, both compromises were the right decision.

In my opinion, taking the widest possible view of technical debt is critical. Identification and management of all forms of technical debt is a key component of sustainable evolution of a product over its lifetime. Maintaining a focus on the product as a whole, rather than projects, provides a long-term focus should help make better compromises – those that are tailored to getting the optimal value out of the product. Having tunnel vision on just the code leaves too much room for surprises.

[Soap or oil press ruins in Panayouda, Lesvos Image by Fallacia83 via Wikimedia Commons.]

“Avoiding Platform Rot” on Iasa Global Blog

just never had the time to keep up with the maintenance

Is your OS the latest version? How about your web server? Database server? If not now, when?

A “no” answer to the first three questions is likely not that big a deal. There can be advantages to staying off the bleeding edge. That being said, the last question is the key one. If the answer to that is “I haven’t thought about it”, then there’s potential for problems.

See the full post on the Iasa Global Blog (a re-post, originally written for the old Iasa blog).

Technical Debt – What it is and what to do about it

This is gonna cost you

In spite of all that’s been written on the subject of technical debt, it’s still a common occurrence to see it defined as simply “bad code”. Likewise, it’s still common to see the solution offered being “stop writing bad code”. Technical debt encompasses much more than that simplistic definition, so while “stop writing bad code” is good advice, it’s wholly insufficient to deal with the subject.

Steve McConnell’s definition is much more comprehensive (and, in my opinion, closer to the mark):

A design or construction approach that’s expedient in the short term but that creates a technical context in which the same work will cost more to do later than it would cost to do now (including increased cost over time)

While it’s a better definition, I’d differ with it in three ways. Technical debt may not only incur costs due to rework of the original item, but also by making more difficult changes that are dependent on the original item. Technical debt may also end up costing nothing extra over time (due to a risk not materializing or because the feature associated with the debt is eliminated). Lastly, it should be noted that the cost of technical debt can extend beyond just effort by also affecting customer satisfaction.

In short, I define technical debt as any technical deficit that involves a risk of greater cost and/or end user dissatisfaction.

This definition encompasses debts that are taken on deliberately and rationally, those that are taken on impulsively, and those that are taken on unconsciously.

Code that is brittle, redundant, unnecessary, unclear, insecure, and/or untested is, of course, a type of technical debt. Although Bob Martin argues otherwise, the risk of costs to be paid clearly makes it so. Likewise, aspects of design can be considered technical debt, whether in the form of poor decisions, intentional shortcuts, decisions deferred too long, or architectural “drift” (losing design coherence via new features being added using new technologies/techniques without bringing older components up to date, or failing to evolve the system as the needs of the business change). Deferring bug fixes is a form of technical debt as is deferring automation of recurring support tasks. Dependencies can be a source of technical debt, both in regard to debt they carry and in terms of their fitness to your purpose. The platform that hosts your application is yet another potential source of technical debt if not maintained.

As noted above, the “interest” on technical debt can manifest as the need for rework and/or more effort in implementing changes over time. This increase in effort can come through the proliferation of code to compensate for the effects of unresolved debt or even just through increased time to comprehend the existing code base prior to attempting a change. As Ruth Malan has noted, strategy may drive architecture, but once the initial architecture is in place, it serves to both enable and constrain the strategy of the system going forward (strategies requiring major architectural changes typically must offer extremely high ROI to get approval). Time spent on manual maintenance tasks (e.g. running scripts to add new reference values) can also be a form of interest, considering that time spent there is time that could be spent on other tasks.

Costs associated with technical debt are not always a gradual payback over time as with an ordinary loan. Some can be like a debt to the mob: “they come at night, pointing a gun to your head, and they want their money NOW”. Security issues are a prime example of this type of debt. Obviously, debts that carry the danger of coming due with little or no notice should be considered too risky to take on.

Having proposed a definition for the term “technical debt” and identified the risks that it entails, it remains to discuss what to do about it. The first step is to recognize it when it’s incurred (or as soon as possible thereafter). For debt taken on deliberately, recognition should be trivial going forward. Recognition of existing debt in an established system may require discovery if it has not been cataloged previously. Debt that has been taken on unconsciously will always require effort to discover. In all cases, the goal is to maintain a technical debt backlog that is as comprehensive as possible. Maintaining this backlog provides insight into both the current state of the system and can inform risk assessments for future decisions.

Becoming aware of existing debt is a critical first step, but is insufficient in itself. Taking steps to actively manage the system’s debt portfolio is essential. The first step should be to stop unconsciously taking on new debt. Latent debt tends to fit into the immediate, unexpected payback model mentioned above. Likewise, steps taken to improve the quality up front (unit testing, code review, static analysis, process changes, etc.) should reduce the effort needed for detection and remediation on the back end. Architectural and design practices should also be examined. Too little design can be as counter-productive as too much. Striking the right balance can yield savings over the life of the application.

Deciding whether or not to take on intentional technical debt is less black and white. Often this type of debt is taken on for rational reasons. An example of this is what Ruth Malan characterizes as “…trading technical learning and code design improvement for market learning (getting features into user hands to improve them)”. Other times, the balance between risk and reward (whether time to market or budget) may tilt in favor of taking on a debt. When this is the case, it is critical that the owner(s) of the system make the decision in possession of the best possible information you can provide. An impulsive decision taken on the basis of “feel” rather than information will likely carry more risk.

Retiring old debt should be the final link in the chain. Just as the taking on of new debt should be done in a rational manner, so should the retirement of old debt. Not all debt carries the same risk/reward ratio and efforts that carry more bang for the buck will be an easier sell. Although some may disagree, I firmly believe that better outcomes will result from making those who own the system active partners in its development and evolution.

It’s highly unlikely that a system will be free of technical debt. Perversely, being free of such debt could actually be a liability. That being said, there is a world of difference between the two poles of debt-free and technical anarchy. Effort spent to rationally manage a system’s debt load will free up time to be put to better use.

Avoiding Platform Rot

(Mirrored from the Iasa Blog)

just never had the time to keep up with the maintenance

Is your OS the latest version? How about your web server? Database server? If not now, when?

A “no” answer to the first three questions is likely not that big a deal. There can be advantages to staying off the bleeding edge. That being said, the last question is the key one. If the answer to that is “I haven’t thought about it”, then there’s potential for problems.

“Technical Debt” is a currently a hot topic. Although the term normally brings to mind hacks and quick fixes, more subtle issues can be technical debt as well. A slide from a recent Michael Feathers presentation (slide 5) is particularly applicable to this:

Technical Debt is: the Effect of
Unavoidable Growth and the Remnants
of Inertia

New features tend to be the priority for systems, particularly early in their lifecycle. The plumbing (that which no one cares about until it quits working), tends to be neglected. Plumbing that is outside the responsibility of the development team (such as operating systems and database management systems) is likely to get the least attention. This can lead to systems running on platforms that are past their end of support date or a scramble to verify that the system can run on a later version. The former carries significant security risks while the latter is hardly conducive to adequately testing that the system will function identically on the updated platform. Additionally, new capabilities as well as operational and performance improvements may be missed out on if no one is paying attention to the platform.

One method to help avoid these types of issues is adoption of a DevOps philosophy, such as Amazon’s:

Amazon applies the motto “You build it, you run it”. This means the team that develops the product is responsible for maintaining it in production for its entire life-cycle. All products are services managed by the teams that built them. The team is dedicated to each product throughout its lifecycle and the organization is built around product management instead of project management.

This blending of responsibilities within a single team and focus on the application as a product (something I consider extremely beneficial) lessens the chance that housekeeping tasks fall between the cracks by removing the cracks. The operations aspect is enhanced by ensuring that its concerns are visible to those developing and the development aspect is enhanced by increased visibility into new capabilities of the platform components. The Solutions Architect role, spanning application, infrastructure, and business, is well placed to lead this effort.