Knowing the Price of Everything and the Value of Nothing

Are they worth it?

Late on a September afternoon in 1812, outside a village on the road to Moscow, Napoleon had a problem. After nine hours of grinding battle in which both armies sustained massive losses, the Russian armies were on the verge of disintegration. Napoleon’s staff was begging him to commit his elites, the Imperial Guard, and complete the victory, but Marshall Bessières asked “Will you risk your last reserves eight hundred miles from Paris?” He would not, and although the French army would march into Moscow a week later on September 14, it would also march back out five weeks later, retreating back to the Polish border. Nearly five sixths of the 685,000 man army that started the campaign had been lost, and the end of Napoleon’s control of the continent was in sight.

While my inner history geek finds this fascinating in and of itself, there is an architecturally significant moral to this story. The lack of reserves limits your options and the limited set of options you’re left with tends to range from bad to worse. Rather than reserves of soldiers, supplies, and ammunition, we deal in reserves of storage, memory, processor, and bandwidth. Exhausting these reserves can lead to catastrophe:

It can be tempting for some to seek out high levels of utilization. In their minds, a system that spends the majority of its time at fifty percent utilization is wasting fifty percent of its resources. After all, while storage, memory, processor, and bandwidth are cheaper than in the past, the cost is still non-zero. Far better, in their opinion, to more closely manage the allocation and eliminate the waste.

The problem, of course, is that exceeding the critical level of a resource will degrade a system. Whether that degradation is in the form of a crash or is handled more gracefully via throttling, queuing, suspending functionality, etc. it is still degradation. The only way to prevent degradation is to insure that sufficient excess capacity is in place to handle peak loads. For storage, this would be an amount sufficient to hold data, etc. generated by peak load for the period it would take to recognize and respond to the impending shortage.

Peak load is the critical metric as this reflects the worst case scenario. In situations where resources are shared, peak load across all systems should be used. Average load is useless in this context due to its smoothing out the peaks and valleys (remember, if you stick one foot in a bucket of ice water and the other in bucket of boiling water, on average, you’re comfortable).

Maintaining reserve capacity is guaranteed to allocate excess resources that will be wasted money – just as backups, disaster recovery environments, and insurance all represent wasted money (until needed). Obsessing about the cost of those excess resources without factoring in the cost of an outage or slowdown is a perfect example of being penny-wise and pound foolish.


9 thoughts on “Knowing the Price of Everything and the Value of Nothing

  1. This post resonates with Hayim Makabee’s post on technical debt and risk aversion: Not investing in available capacity for peak periods is also a kind of debt.
    This can come in the shape of enabling the product/s to operate at peak period within the available resources, or to procure more resources or both.
    It is tempting to postpone such decisions until it’s too late, hoping that it will be OK.
    But the risk of losing business or prestige is typically much, much higher than the cost of preparing for it.


    • Indeed, Hayim is definitely one of those I pay close attention to and I agree that this is one of those examples of technical debt that people don’t normally think of as such. I forget where I heard it, but I remember someone saying that if you think insurance is expensive, try a disaster without it.


  2. Excellent post. The idea of peak load as a critical metric important. It is also important to consider peak load as a predictor of impending systemic failure. As peak load approaches its maximum technical debt increases. Finding the balance point is important for teams and leaders.


    • Thanks, Tom. You’re absolutely correct that peak load can be a predictor of failure (assuming that load is close to available capacity). That’s why Kris’s Tweet really grabbed my attention – lack of reserves makes you a hostage to fortune.


  3. I use a combination of PPL ( Predictable Peak Load ) and MTBF to lay out the characteristics of a system, be it hardware or software. Works well in practical life. Learned this from years of software engineering in the aerospace sector.


  4. I just found this post and think it’s great. My original tweet was aimed at the South West London train network, where many lines run through a single station (Wimbledon) that is at 100% capacity. As soon as anything goes wrong there, however minor, it leads to massive delays up and down the line, with unexpected knock on consequences as connections are missed, rolling stock is in the wrong place and so forth. I had the IT architecture parallel in my head and this fleshes it out brilliantly.

    The elasticity of cloud computing promises much in this area. I just wish we could scale out Wimbledon station so easily!


    • Thanks Kris, very glad you liked the post.

      Re: Wimbledon station, those architecting physical things have at least some excuse for running out of capacity (uncertainty around how much excess capacity to build, time & expense of bringing more online, etc.). In our line, the same issues apply, but at a tiny fraction of what’s there in the real world.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s