Design Away Error Handling?

Evil Monkey Pointing

Writing is an interesting process. Some posts spring to life; ignited by some inspiration, they swiftly flow from fingertips to (virtual) page. Other posts simmer. An idea is half-conceived, then languishes incomplete. It sits in the corner staring at you balefully, a reproach for your lack of commitment. In the case of this one, it sat for the better part of a year because I wasn’t quite sure which side I wanted to come down on.

It started with a fairly uncontroversial tweet from Michael Feathers: “Spend more time designing away errors so that you don’t have to handle them.” On its face, this is reasonable; eliminating error vectors should lead to a more robust product. Some warning flags appear, however, when you read the stream leading up to that tweet:

Francis Fish’s point about context (“…medical equipment and (say) android app have totally diff needs…”) certainly applies, as does Feather’s reply. Whenever I see the word “best” devoid of context, the credibility detector bottoms out. It’s the response to Brian Knapp (“Yeah, they are better not used at all. :)”) that is worrisome. Under some circumstances, throwing an exception when an error condition occurs is the right answer.

Having default values for parameters is one technique for designing away errors. Checking for problem conditions such as disk space or network connectivity prior to use can be used as well. The key thing to remember is that these techniques assume that the problem is an expected one and that something can be done about it. Checking for space or connectivity is useless if you don’t have an alternate location to write to or if you lack the ability to restore the connection. Likewise, use of a default value is only appropriate when there is a meaningful default.

The thing to remember is that avoiding an exception is not the goal, correct execution/valid state is. If you’re transferring money between accounts, you want to be able to trust that either the transaction completed and the balances are adjusted or that you know something went wrong. Silent failures are much more of a problem than noisy errors. As Jef Claes noted in “Tests as part of your code”, silent failures can put you in the newspaper (and not in a good way).

A more recent Twitter exchange involving Feathers returned to this same issue:

The short answer, is yes, it’s a bug. Otherwise things found in code reviews would not count as defects because they had not happened “live”. The last tweet in that stream summed it up nicely:

We cannot rely on design alone to eliminate error conditions because we cannot foresee all potential issues. Testing shares this same dilemma. I believe Arlo Belshee strikes the right balance in “Treat bugs as fires”. Fire departments concentrate foremost on preventing fires while still extinguishing those that fall through the cracks. Where one occurs, it’s treated as a learning experience. So too should we treat error conditions. Dan Cresswell put it nicely:

Handling exceptions is tedious, but critical. Where we can remove risks, or at least reduce them via design, so much the better. We cannot, however, rely on our ability to foresee every circumstance. Chaos Monkey chooses you.

[Hat tip to Lorrie MacVittie for tweeting the evil monkey image above]

Advertisement

“Error Handling – No News is Really Bad News” on Iasa Global Blog

Editorial Cartoon - Titanic

A recent post on The Daily WTF highlighted a system that “…throws the fewest errors of any of our code, so it should be very stable”. The punchline, of course, was that the system threw so few errors because it was catching and suppressing almost all the errors that were occurring. Once the “no news is good news” code was removed, the dysfunctional nature of the system was revealed.

See the full post on the Iasa Global Blog (a re-post, originally published here).

Hatin’ on Nulls

Dante's Inferno; Lucifer, King of Hell

When I first read Christian Neumanns’ “Why We Should Love ‘null'”, I found myself agreeing with his position. Yes, null references have “…led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage…” per Sir C. A. R. Hoare. Yes, many people heartily dislike null references and will go to great lengths to work around the problem. Finally, yes, these workarounds may be more detrimental than the problem they are intended to solve. While I agreed with the position that null references are a necessary inconvenience (the ill effects are ultimately the result of failure to check for null, not the null condition itself), I didn’t initially see the issue as being particularly “architectural”.

Further on in the article, however, Christian covered why null references and the various workarounds, become architecturally significant. The concept of null, nothing, is semantically important. A price of zero dollars is not intrinsically the same as a missing price. A date several millenia into the future does not universally convey “unknown” or “to be determined”. Using the null object pattern may eliminate errors due to unchecked references, but it’s far from “safe”. According to Wikipedia, “…a Null Object is very predictable and has no side effects: it does nothing“. That, however, is untrue. A Null Object masks a potential error condition and allows the user to continue on in ignorance. That, in my opinion, is very much doing something.

A person commenting on Christian’s post stated that “…a crash is the worst kind of experience a user can have”. That person argued that masking a null reference error may not be as bad for the user as a crash. There’s a kernel of truth there, but it’s a matter of risk. If an application continues on and the result is a misunderstanding of what’s been done or worse, corrupted data, how bad is that? If the application in question is a game, there’s little real harm. What if the application in question is dealing with health information? I stand by the position that where there is an error, no news is bad news.

As more and more applications become platforms via being service enabled, semantic issues gain importance. Versioning strategies can ensure structural compatibility, but semantic issues can still break clients. Coherence and consistency should be considered hallmarks of an API. As Erik Dietrich noted in “Notes on Writing Discoverable Framework Code”, a good API should “make screwing up impossible”. Ambiguity makes screwing up very possible.

Error Handling – No News is Really Bad News

Did you think I wouldn't notice?

A recent post on The Daily WTF highlighted a system that “…throws the fewest errors of any of our code, so it should be very stable”. The punchline, of course, was that the system threw so few errors because it was catching and suppressing almost all the errors that were occurring. Once the “no news is good news” code was removed, the dysfunctional nature of the system was revealed.

On one level, it’s funny to think of a system being considered “very stable” on the basis of it destroying the evidence of its failures. Anyone who has been in software development for any length of time probably has a war story about a colleague who couldn’t tell the difference between getting rid of error messages and correcting the error condition. However, if the system in question is critical to the user’s personal or financial well-being, then it’s not so amusing. Imagine thinking you had health insurance because the site where you enrolled said you did, and finding out later that you really didn’t.

Developing software that accomplishes something isn’t trivial, but then again, it isn’t rocket science either. Performing a task when all is correct is the easy part. We earn our money by how we handle the other cases. This is not only a matter of technical professionalism, but also a business issue. End users are likely to be annoyed if our applications leave them stranded out of town or jumping through hoops only to be frustrated at the end of the process.

Better an obvious failure than a mystery that leaves the user wondering if the system did what it said it did. Mystery impairs trust, which is a key ingredient in the customer relationship.

All of the above was written with the assumption of incompetence rather than malice. However, a comment from Charlie Alfred made during a Twitter discussion about technical debt raised another possibility:

Wonder if such a thing as “Technical Powerball”? Poor design, unreadable code, no doc, but hits jackpot anyway 🙂

Charlie’s question doesn’t assume bad intent, but it occurred to me that if “jackpot” is defined as “it just has to hold together ’til I clear the door and cash the check”, then perhaps a case of technical debt is really a case of “Technical Powerball”. Geek and Poke put it well: