Friday 24 June 2016

Confusing Assert and Throw

One of the earliest programming books I read that I still have a habit of citing fairly often is Steve Maguire’s Writing Solid Code. This introduced me to the idea of using “asserts” to find bugs where programming contracts had been violated and to the wider concept of Design by Contract. I thought I had understood “when and where” to use asserts (as opposed to signalling errors, e.g by throwing an exception) but one day many years back a production outage brought home to me how I had been lured into misusing asserts when I should have generated errors instead.

The Wake Up Call

I was one of a small team developing a new online trading system around the turn of the millennium. The back end was written in C++ and the front-end was a hybrid Java applet/HTML concoction. In between was a custom HTTP tunnelling mechanism that helped transport the binary messages over the Internet.

One morning, not long after the system had gone live, the server hung. I went in to the server room to investigate and found an assert dialog on the screen telling me that a “time_t value < 0” had been encountered. The failed assert had caused the server to stop itself to avoid any potential further corruption and wait in case we needed to attach a remote debugger. However all the server messages were logged and it was not difficult to see that the erroneous value had come in via a client request. Happy that nothing really bad had happened we switched the server back on, apologised to our customers for the unexpected outage, and I contacted the user to find out more about what browser, version of Java, etc. they were using.

Inside the HTML web page we had a comma-separated list of known Java versions that appeared to have broken implementations of the Date class. The applet would refuse to run if it found one of these. This particular user had an even more broken JVM version that was returning -1 for the resulting date/time value represented in milliseconds. Of course the problem within the JVM should never have managed to “jump the air gap” and cause the server to hang; hence the more important bug was how the bad request was managing to affect the server.

Whilst we had used asserts extensively in the C++ codebase to ensure programmatic errors would show up during development we had forgotten to put validation into certain parts of the service - namely the message deserialization. The reason the assert halted the server instead of silently continuing with a bad value was because we were still running a debug build as performance at that point didn’t demand we switch to a release build.

Non-validated Input

In a sense the assert was correct in that it told me there had been a programmatic error; the error being that we had forgotten to add validation on the message which had come from an untrusted source. Whereas something like JSON coming in over HTTP on a public channel is naturally untrusted, when you own the server, client, serialization and encryption code it’s possible to get lured into a false sense of security that you own the entire chain. But what we didn’t control was the JVM and that proved to be our undoing [1].

Asserts in C++ versus C#

In my move from C++ towards C# I discovered there was a different mentality between the two cultures. In C++ the philosophy is that you don’t pay for what you don’t use and so anything slightly dubious can be denounced as “undefined behaviour” [2]. There is also a different attitude towards debug and release builds where the former often performs poorly at the expense of more rigorous checking around contract violations. In a (naïve) sense you shouldn’t have to pay for the extra checks in a release build because you don’t need them due to you having removed (in theory) any contract violations via testing with the debug build.

Hence it is common for C++ libraries such as STLport to highlight entering into undefined behaviour using an assert. For example iterators in the debug build of STLport are implemented as actual objects that track their underlying container and can therefore detect if they have been invalidated due to modifications. In a release build they might be nothing more than an alias for the built-in pointer type [3].

However in C# the contract tends be validated in a build agnostic manner through the use of checks that result in throwing an explicit ArgumentException / ArgumentNullException type exception. If no formal error checking is used up front then you’ll probably encounter a more terse NullPointerException or IndexOutOfBoundsException when you try and dereference / index on the value sometime later.

Historically many of the assertions that you see in C++ code are validations around the passing of null pointers or index bounds. In C# these are catered for directly by the CLR and so it’s probably unsurprising that there is a difference in culture. Given that it’s (virtually) impossible to dereference in invalid pointer in the CLR and dereferencing a null one has well defined behaviour there is not the same need to be so cautious about memory corruption caused by dodgy pointer arithmetic.

Design by Contract

Even after all this time the ideas behind Design by Contract (DbC) still do not appear to be very well known or considered useful. C# 4 added support for Code Contracts but I have only seen one codebase so far which has used them, and even then its use was patchy. Even the use of Debug.Assert as a very simple DbC mechanism never appears to be used, presumably because there is so little use made of debug builds in the first place (so they would never fire anyway).

Consequently I have tended to use my own home-brew version which is really just a bunch of static methods that I use wherever necessary:

Constraint.MustNotBeNull(anArgument, “what it is”);

When these trigger they throw a ContraintViolationException that includes a suitable message. When I first adopted this in C# I wasn’t sure whether to raise an exception or not, and so I made the behaviour configurable. It wasn’t long before I stripped that out along with the Debug.Assert() calls and just made the exception throwing behaviour the default. I decided that if performance was ever going to be an issue I would deal with it when I had a real example to optimise for. So far that hasn’t happened.

Always Enabled

If I was in any doubt as to their value I got another reminder a while back when developing an in house compute grid. I added a whole bunch of asserts inside the engine management code, which strictly speaking weren’t essential as long as the relationship between the node manager, engine and task didn’t change.

One day someone added a bit of retry logic to allow the engine to reconnect back to its parent manager, which made perfect sense in insolation. But now the invariant was broken as the node manager didn’t know about engines it had never spawned. When the orphaned task finally completed it re-established the broken connection to its parent and tried to signal completion of its work. Only it wasn’t the same work dished out by this instance of the manager. Luckily an internal assert detected the impedance mismatch and screamed loudly.

This is another example where you can easily get lulled into a false sense of security because you own all the pieces and the change you’re making seems obvious (retrying a broken connection). However there may be invariants you don’t even realise exist that will break because the cause-and-effect are so far apart.

Confused About the Problem

The title for this post has been under consideration for the entire time I’ve been trying to write it. I’ve definitely thought about changing it twice before to “Confusing Assert and Validation” as essentially the decision to throw or not is about how to handle the failure of the scenario. But I’ve stuck with the title because it highlights my confused thinking.

I suspect that the reason for this confusion is because the distinction between a contract violation and input validation is quite subtle. In my mind the validation of inputs happens at the boundary of a component, and it only needs to be done once. When you are inside that component you should be able to assume that validation has already occurred. But as I’ve described earlier, a bug can allow something to slip through the net and so input validation happens at the external boundary, but contract enforcement happens at an internal boundary, e.g. a public versus internal class or method [4].

If the input validation triggers its significance is probably less because the check is doing the job it was intended to do, but if a contract violation fires it suggests there is a genuine bug. Also by throwing different exception types it’s easier to distinguish the two cases and then any Big Outer Try Blocks can recover in different ways if necessary, e.g. continue vs shutdown.

Another way to look at the difference is that input validation is a formal part of the type’s contract and therefore I’d write tests to ensure it is enforced correctly. However I would not write tests to ensure any internal contracts are enforced (any more than I would write tests for private methods).

Epilogue

I started writing this post a few years ago and the fact that it has taken this long shows that even now I’m not 100% sure I really know what I’m doing :o). While I knew I had made mistakes in the past and I thought I knew what I was doing wrong, I wasn’t convinced I could collate my thoughts into a coherent piece. Then my switch to a largely C# based world had another disruptive effect on my thinking and succeeded in showing that I really wasn’t ready to put pen to paper.

Now that I’ve finally finished I’m happy where it landed. I’ve been bitten enough times to know that shortcuts get taken, whether consciously or not, and that garbage will end up getting propagated. I can still see value in the distinction of the two approaches but am happy to acknowledge that it’s subtle enough to not get overly hung up about. Better error handling is something we can always do more of and doing it one way or another is what really matters.

[1] Back then the common joke about Java was “write once, test everywhere”, and I found out first-hand why.

[2] There is also the slightly more “constrained” variant that is implementation-defined behaviour. It’s a little less vague, if you know what your implementation will do, but in a general sense you still probably want to assume very little.

[3] See “My Kingdom for a C# Typedef” for why this kind of technique isn’t used.

[4] The transformation from primitive values to domain types is another example, see “Primitive Domain Types - Too Much Like Hard Work?”.

1 comment:

  1. Enjoyable & informative post. I like the part about once having done pre-condition checking at your external boundaries you can assume that things are good.

    In C++ I do this if the incoming types are pointers and once determined as not-null then switch to references. In Swift it's easily done by switching from optionals to non-optionals.

    In terms of asserts & throws I think that unless you're in a tight performance sensitive loop then you shouldn't use asserts and just throw. Just use asserts in these places but even ideally return success/failure and test (this can be forced by RAII blocks that assert in their destructors if return value not checked)

    Even if the exception handling is just graceful exit

    ReplyDelete