As a general rule of thumb I’ve always been fond of the principles behind failing fast - fail sooner rather than ignore/sidestep the issue and hope something magic happens later. But, depending on the context, the notion of failure is different and this needs to be kept in mind - one size never fits all.
I first came across the idea of failing fast back in the ‘90s when I was working at a software house writing desktop applications in C. Someone had just bought Steve Maguire’s excellent book Writing Solid Code which promoted the ideas of using ASSERTs to detect and report bugs as soon as possible during development. Windows 3.x had a couple of functions (IsBadReadPtr and IsBadWritePtr) that allowed you to “probe” a piece of memory to see if it was valid before referencing it so that you could ASSERT with some useful context rather than just crash (then called a UAE, and later a GPF) in an ugly mess.
The flip side to this is that developers could then engage in the act of Defence Programming. This is another term that has both good and bad connotations. I’m using it to mean code that protects itself at the expense of the caller and so tries to avoid being the victim. Why would a developer do this? Because prior to the modern days of shared code ownership, if an application crashed in your code it was your fault - guilty until proven innocent. It was quite likely you were passed a bum value and happened to be the first once to touch it. This was epitomised by the lstrcpy() Windows function as shown by this nugget from the SDK documentation:-
“When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error”
So, in this context failing fast is all about not letting an error go undetected for too long. And if you can detect the problem don’t mask it as the side-effects could be much worse - corrupt data - and then the user really will be annoyed! When you encounter a Blue Screen of Death (BSOD) this is Windows telling you it would rather ruin your day now than limp along and potentially allow all your data to be slowly mangled in the background.
The second meaning of failing fast I formally met in Michael Nygard’s splendid book Release It! In this context we are talking about Distributed Systems where many different processes are talking to one another. The kind of scenarios Michael describes relate to the inefficient use of resources that are a side-effect of some failure further downstream; a failure that is then only detected after some tangential processing has already taken place. The idea is that if you know a priori that the latter resource is borked you could avoid wasting any further resources now by failing up front. More importantly, in a system that handles a variety of tasks, this avoids unnecessarily impacting unrelated work (e.g. by hogging request processing threads).
He also addresses one of my own pet peeves in the book - infinite timeouts. I’ve always believed that virtually no code should wait “forever”. Yes, it’s often hard to find out (or come up with) how long someone is willing to wait for an action to happen, but I’m pretty sure it’s unlikely to be for-ever. Debugging and stress testing adds an extra dimension to this problem because you may be willing to wait much longer than normal and so perhaps the timeout needs to be easily configurable. Michael has some great examples in his book about the effects of infinitely waiting on connection pools, etc. and the chain of events this can cause.
If you’ve been wondering when I’m finally going to get around to the title of this blog post, stay with me it’s coming…
Applying the the above patterns to certain parts of your system, say, the middle tier, without considering the side-effects has the potential to create a Black Hole. This is either a service that sucks in requests at an alarming rate because the moment it starts work on them it detects they will fail and so can move on to fresh work that much quicker, or is the empty space created by a service that’s shut itself down. Either way your system has turned itself into an expensive NOP.
I’ve also seen systems exhibit Black Hole like behaviour for other reasons, most commonly when the process has reached its memory limit due to a leak or massive heap fragmentation. The danger here is putting something in the infrastructure code to react to this kind of event in a way that might be detrimental to the system’s other components. It’s not nice but a desktop app restarting does not have the some potential to ruin the business’s day quite like a distributed failure where a rogue service or pooled process has gone AWOL and left a gaping chasm into which requests are falling. The other common problem, at least in development environments, is mis- or corrupt configurations. This is more an annoyance due to wasted test time, but it does give you the opportunity to see how your system could fail in reality.
The key word in that earlier phrase (alarming rate) is obviously “alarming”. There must always be something watching each service or process to notice when the normal heartbeat of the system is disrupted. If you do want to take a common approach to an out-of-memory condition by shutting the process or service down, make sure there is something around to restart it as soon as possible. For example the SRVANY tool in the Windows Resource Kit is a great tool to get a simple console app up and running as an NT Service. But it comes at a cost - it doesn’t notice if your process has died and then restart it like the Service Control Manager (SCM) does for real services.
By far my favourite pattern in the book is Circuit Breaker. For me this is often the biggest area where developers fail to appreciate the differences between simple client-side development and creating scalable, robust, server-side components. It’s hard thinking about the knock-on effects of a local design decision is when placed in the wider context of the entire system. A request failing in isolation is a different prospect to considering what happens when more requests are failing than succeeding. Or how the failure of one request can affect subsequent ones, such as by exhausting memory or corrupting the process in the case of native code.
What makes the design of circuit breakers hard is the varying notion of what a “transient” error is. Flick the switch too soon and you’ll find your processing grinds to a halt too rapidly; leave it too long and any black holes may consume your entire workload. My current client performs infrastructure maintenance at the weekends when our system is processing low priority work. The database and/or network can disappear for hours at a time, whereas during the week we’d expect any outage to be measured in minutes. It’s very tempting to punt with this kind of issue and call in the support team, but I reckon that if you have to start writing up wiki pages that guide the support team then you could have put that effort into coding up the same rules and making the resolution automatic.
Using a circuit breaker can slow down the effects of a Black Hole by throttling the rate of failure. Hopefully once the transient error has passed it will re-open and normal service will be resumed.