Monday, 17 September 2012

Black Hole - The Fail Fast Anti-Pattern

As a general rule of thumb I’ve always been fond of the principles behind failing fast - fail sooner rather than ignore/sidestep the issue and hope something magic happens later. But, depending on the context, the notion of failure is different and this needs to be kept in mind - one size never fits all.

I first came across the idea of failing fast back in the ‘90s when I was working at a software house writing desktop applications in C. Someone had just bought Steve Maguire’s excellent book Writing Solid Code which promoted the ideas of using ASSERTs to detect and report bugs as soon as possible during development. Windows 3.x had a couple of functions (IsBadReadPtr and IsBadWritePtr) that allowed you to “probe” a piece of memory to see if it was valid before referencing it so that you could ASSERT with some useful context rather than just crash (then called a UAE, and later a GPF) in an ugly mess.

The flip side to this is that developers could then engage in the act of Defence Programming. This is another term that has both good and bad connotations. I’m using it to mean code that protects itself at the expense of the caller and so tries to avoid being the victim. Why would a developer do this? Because prior to the modern days of shared code ownership, if an application crashed in your code it was your fault - guilty until proven innocent. It was quite likely you were passed a bum value and happened to be the first once to touch it. This was epitomised by the lstrcpy() Windows function as shown by this nugget from the SDK documentation:-

“When this function catches SEH errors, it returns NULL without null-terminating the string and without notifying the caller of the error”

So, in this context failing fast is all about not letting an error go undetected for too long. And if you can detect the problem don’t mask it as the side-effects could be much worse - corrupt data - and then the user really will be annoyed! When you encounter a Blue Screen of Death (BSOD) this is Windows telling you it would rather ruin your day now than limp along and potentially allow all your data to be slowly mangled in the background.

The second meaning of failing fast I formally met in Michael Nygard’s splendid book Release It! In this context we are talking about Distributed Systems where many different processes are talking to one another. The kind of scenarios Michael describes relate to the inefficient use of resources that are a side-effect of some failure further downstream; a failure that is then only detected after some tangential processing has already taken place. The idea is that if you know a priori that the latter resource is borked you could avoid wasting any further resources now by failing up front. More importantly, in a system that handles a variety of tasks, this avoids unnecessarily impacting unrelated work (e.g. by hogging request processing threads).

He also addresses one of my own pet peeves in the book - infinite timeouts. I’ve always believed that virtually no code should wait “forever”. Yes, it’s often hard to find out (or come up with) how long someone is willing to wait for an action to happen, but I’m pretty sure it’s unlikely to be for-ever. Debugging and stress testing adds an extra dimension to this problem because you may be willing to wait much longer than normal and so perhaps the timeout needs to be easily configurable. Michael has some great examples in his book about the effects of infinitely waiting on connection pools, etc. and the chain of events this can cause.

If you’ve been wondering when I’m finally going to get around to the title of this blog post, stay with me it’s coming…

Applying the the above patterns to certain parts of your system, say, the middle tier, without considering the side-effects has the potential to create a Black Hole. This is either a service that sucks in requests at an alarming rate because the moment it starts work on them it detects they will fail and so can move on to fresh work that much quicker, or is the empty space created by a service that’s shut itself down. Either way your system has turned itself into an expensive NOP.

I’ve also seen systems exhibit Black Hole like behaviour for other reasons, most commonly when the process has reached its memory limit due to a leak or massive heap fragmentation. The danger here is putting something in the infrastructure code to react to this kind of event in a way that might be detrimental to the system’s other components. It’s not nice but a desktop app restarting does not have the some potential to ruin the business’s day quite like a distributed failure where a rogue service or pooled process has gone AWOL and left a gaping chasm into which requests are falling. The other common problem, at least in development environments, is mis- or corrupt configurations. This is more an annoyance due to wasted test time, but it does give you the opportunity to see how your system could fail in reality.

The key word in that earlier phrase (alarming rate) is obviously “alarming”. There must always be something watching each service or process to notice when the normal heartbeat of the system is disrupted. If you do want to take a common approach to an out-of-memory condition by shutting the process or service down, make sure there is something around to restart it as soon as possible. For example the SRVANY tool in the Windows Resource Kit is a great tool to get a simple console app up and running as an NT Service. But it comes at a cost - it doesn’t notice if your process has died and then restart it like the Service Control Manager (SCM) does for real services.

By far my favourite pattern in the book is Circuit Breaker. For me this is often the biggest area where developers fail to appreciate the differences between simple client-side development and creating scalable, robust, server-side components. It’s hard thinking about the knock-on effects of a local design decision is when placed in the wider context of the entire system. A request failing in isolation is a different prospect to considering what happens when more requests are failing than succeeding. Or how the failure of one request can affect subsequent ones, such as by exhausting memory or corrupting the process in the case of native code.

What makes the design of circuit breakers hard is the varying notion of what a “transient” error is. Flick the switch too soon and you’ll find your processing grinds to a halt too rapidly; leave it too long and any black holes may consume your entire workload. My current client performs infrastructure maintenance at the weekends when our system is processing low priority work. The database and/or network can disappear for hours at a time, whereas during the week we’d expect any outage to be measured in minutes. It’s very tempting to punt with this kind of issue and call in the support team, but I reckon that if you have to start writing up wiki pages that guide the support team then you could have put that effort into coding up the same rules and making the resolution automatic.

Using a circuit breaker can slow down the effects of a Black Hole by throttling the rate of failure. Hopefully once the transient error has passed it will re-open and normal service will be resumed.

Friday, 14 September 2012

From Visual C++ to GCC - Linker Order Matters

Back in July 2009 I wrote a post about how I had started using GCC as a cheap alternative to performing some static code analysis as I lamented the lack of low-cost options for hobbyist programmers the month before. I also promised to write up more about my experiences as I had had a variety of problems outside those with the code itself - this, finally, is the first of my follow-up posts.

Using Code::Blocks (with TDM-GCC) the first project I ported was my Core library as it’s the newest and was written with the goal of portability[*] in mind. After applying a liberal sprinkling of #ifdef’s and fixing a few other minor transgressions I got it compiling and somewhat surprisingly it linked out of the box too. Sadly I needed to fix a couple of bugs with my ANSI/Unicode string conversation helper functions but other than that it was pretty easy. My simple XML library went the same way too.

But success was short-lived because the moment I ported my ancient Windows class library (that underpins every app I’ve written) the linker started spewing out reams of unresolved symbols. I wasn’t overly surprised because I had just fixed up various uses of the Visual C++ specific #pragma[#] that allows you to specify a library that the MS linker should include automatically:-

#ifdef _MSC_VER
#pragma comment(lib, "version")
#endif

So I ploughed through the linker errors, and with a little help from “strings” and “grep” managed to work out which libraries from the MinGW toolset I needed to include. The error count swiftly went down and down until I had just a few left. And then I was stumped. There were only a handful of symbols left to resolve, but one of them was a function of my own that resided in the Core library I had just managed to build! The Core unit tests had just been run too so I couldn’t understand why GCC was now having trouble when it had linked fine minutes before.

I spent ages Googling, but being new to the GCC toolchain and Code::Blocks I didn’t really know if this was a problem with my code, GCC, Code::Blocks or just a limitations of Windows development with that configuration. I decided that given my original goal was satisfied - that I could use GCC to at least compile my code and so provide me with extra static analysis - I would just live with it for now. I still hadn’t got into command line builds with Code::Blocks or worked out how to keep the .vcproj and .cbp projects files in sync, so there I left it. For far too long...

A few months ago I reopened this case as I had ported many of my libraries and apps and now really wanted to build everything end-to-end to satisfy myself that I could (notionally) support more than just Visual C++ and stop being quite such a luddite. So I vented my anger via Twitter and one of my fellow ACCU members (either Anthony Williams or Pete Goodliffe) asked if the order of the libraries made a difference. Naturally my response was “What, the order of libraries matters with GCC?”.

Apparently so. The next problem was that by taking a shortcut and using Code::Blocks I had distanced myself from the underlying compiler and so wasn’t sure what, if any, control I had over the ordering. However a quick flick around the Code::Blocks project settings UI and I started noticing buttons and dials I hadn’t spotted before.

CodeBlocks

And then I discovered the magic setting that made all the difference - the policy “Prepend target options to project options”. This normally defaults to “Append…” but open the drop list and you can switch it to “Prepend…”. That sorted the 3rd party symbols, but I still had a couple of my own symbols unresolved. I quickly realised that was because my own libraries were in the wrong order too. A quick play with the arrows that allow you to reorder the “Link libraries” and everything was good.

GUIs are all very good for browsing and tinkering but when it comes to making repetitive changes they suck. Fortunately the Code::Blocks projects files (.cbp) are XML and so it was pretty easy to isolate the relevant sections and make sure they were consistent across all my solutions:-

<Target title=. . .>
    . . .
    <Option projectLinkerOptionsRelation="2" />
    . . .
    <Linker>
        <Add library="..\Lib\NCL\Debug\libNCL.a" />
        <Add library="..\Lib\WCL\Debug\libWCL.a" />
        <Add library="..\Lib\Core\Debug\libCore.a" />
    </Linker>
</Target>
. . .
<Linker>
    <Add library="ole32" />
    <Add library="oleaut32" />
    <Add library="uuid" />
    <Add library="comdlg32" />
    <Add library="version" />
    <Add library="gdi32" />
    <Add library="ntdll" />
    <Add library="advapi32" />
    <Add library="shlwapi" />
</Linker>

Once I had sorted all my project settings everything was building exactly as it should. Now, with the command line below, I can also build all my Code::Blocks projects just like I do with Visual C++ so I should be able to make GCC a more frequent part of my development regime.

C:\> codeblocks.exe --build %f --no-batch-window-close

 

[*] Portability across Windows compilers and STL implementations essentially. Although I have encapsulated use of the Windows API on the vague off-chance that I should decide to embrace other OS ecosystems.

[#] I think this is one of the best extensions Microsoft added to Visual C++ because it makes writing Facades so much easier - you only have to link your application with your Facade library and the additional linker dependencies are automatically taken care of for you.

Thursday, 13 September 2012

Can Code Be Too Simple?

Many years ago I was reflecting on some changes that a newer (albeit experienced) colleague[#] had made and I came across a refactoring that lead to a method becoming reduced to just this one line of C++ code:-

void MemoryTracker::incUsage(const string& typename)
{
    m_objects[typename]++;
}

But this troubled me. My spider-sense was tingling that something was missing - surely it couldn’t be that simple? The main reason for my immediate worries is that this was C++ and I was working on a very old system where I’d spend a fair bit of time fixing a variety of problems caused by the language’s ability to allow you to leave variables uninitialised.

In case you’re still pondering m_objects is declared as a std::map<std::string, size_t>. To add a little more substance to my argument I had been doing a fair bit of VBScript lately and a tiny sideline in C#. In those languages you can’t achieve the same degree of simplicity because they throw an exception when you try and index an item that doesn’t exist. Consequently you’d end up writing this kind of code instead[*]:-

public void IncUsage(string typename)
{
    if (!m_objects.Contains(typename))
        m_objects.Add(typename, 1);
    else
        m_objects[typename]++;
}

So the C++ version is actually doing quite a bit behind your back. Not only that, but it is also doing stuff which you’d normally have to watch out for in your own code.

  1. If the item already exists in the map, return a reference to its value
  2. If the item doesn’t exist, create an entry for it, default construct the value, and then return a reference to the value

The bit that makes me tingle is the “default construct the value” because for primitive types like size_t this is ingrained in my brain as a NOP, as that’s usually what happens unless you go out of your way to ensure it isn’t. Such is the mark of a brain addled by too many years doing C and “Visual” C++. Of course the beauty of containers like std::map is exactly that they “do the right thing” and so I need to let go and just accept that.

Hopefully now you can see why the original code is so simple. When passing a “typename” that doesn’t exist in the map, one gets added. Not only that but the default value for a size_t is 0 and so the post-increment immediately bumps that to 1. Clever, eh?

As I remarked back at the very beginning this was many years ago and with hindsight my response should have been “show me the tests”. A quick read of the tests should have been enough to convince that it worked correctly in the case I was concerned about. But hey, guess what, there were no unit tests at all for the set of shared libraries this code featured in. In fact it was working on this legacy codebase that opened my eyes to the necessity of unit testing.

There are a couple of quotes that usually get trotted out when talking about the beauty and simplicity that code should possess:-

“Make things as simple as possible, but not simpler” -- Albert Einstein

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away” -- Antoine de Saint-Exupery

I totally agree with the sentiment on all levels - from architecture down to individual statements. And I’ve admitted that backing production code with tests should quash any doubts about code not meeting its requirements. So why do I still feel that it is questionable? There is a fine line between “clever” code (code that does stuff in a concise and yet unobvious way) and “simple” code that does its job using the right tools. I guess that when working in a mixed language/mixed ability environment the “right tools” welded correctly can sometimes seem to be indistinguishable from “clever”.

 

[#] This “chat” was made all the more interesting by their reluctance to accept that there could be anything even worth discussing. Their position was that the code is correct and does exactly what the specification demands. End of.

[*] In C# you could of course wrap this up in an extension method and there may also be a more optimal way. My point is that different containers sometimes have different semantics even within the same language when it comes to “adding” or “setting” a value.

Monday, 10 September 2012

Maintaining VCS History Across Refactorings

Refactoring is seen as an essential tool in the ever increasing battle against software entropy, and the most common variant[*] I do day-to-day is Extract Method. The other two refactorings that make more than an occasional guest appearance are Extract Interface and Extract Class. The former usually comes about when trying to mock legacy code whilst the latter is the result of applying Extract Method too many times and missing the proverbial wood because so many trees have sprung up to hide them.

The other thing I tend to spend more than my fair share of time on is Software Archaeology - digging through the Version Control System to look back at the evolution of a class or file. Just the other day I was trawling the archives trying to work out why a database table had a particular primary key in the hope that a check-in comment might help explain the decision; it didn’t this time, but so often it does.

One side-effect of the relentless use of refactoring to improve the codebase is that it creates a lot more revisions in the VCS. I’ve written in the past about the effects of fined grained check-ins (on an integration branch) and for the most part it’s not too much of an issue. The most common refactorings tend to cause changes that remain within the same source file and so from the VCS’s perspective it’s easy to manage. Where the VCS does struggle is with those refactorings that involve extracting a definition or behaviour into a new entity, i.e. a new interface or class. This is because we tend to store our code as one interface/class per source file. Yes, nested classes can be found in the wild occasionally, but in my experience they are a rare breed.

The most common way to perform an Extract Class refactoring is to create a new source file and then cut-and-paste the code from one to the other. If you’re a modern developer your IDE will probably do the donkey work for you. The Extract Interface refactoring is very similar, but this time you copy-and-paste the method signatures to a new file. In both cases though you have created a new source file for the VCS to track and from its perspective you might as well be creating a brand spanking new entity because it can’t see that logically you have moved or copied code from one source file to another. What’s a poor developer to do?

The way I handle these two refactorings (and any other change where a new source file is logically seeded from an existing one) is to clone[#] the original file in the VCS, then paste the correct contents in so that the resulting file is the same as if starting from scratch. The difference though is that at commit time the VCS has the extra metadata it needs to allow you to walk directly back in time to the code before the refactoring. In the case of Extract Class this means the evolution of the methods prior to the refactoring are readily available without having to use check-in comments (or worse still “just knowing” where to look) to find its past lives.

I have no idea if any IDEs or refactoring tools have support for this way of working. From the little I’ve seen (I tend to refactor manually because they never put things quite where I want them) they seem to just drive the IDE rather like a macro. Please feel free to set me straight if this isn’t the case.

 

[*] The name comes from Martin Fowler’s excellent book on refactoring.

[#] In Subversion this is just a COPY, which is essentially what every operation in SVN is - branching, labelling and even renaming.

Friday, 7 September 2012

Assume Failure by Default

Maybe I’m just an old pessimist but when I see code like this below, it makes me think “now there’s an optimistic programmer”:-

void doSomething()
{
    bool ok = true;
    . . .
    return ok;
}

Personally I’d start with ok = false. Actually I wouldn’t even write that because I’ve always favoured early exit (in C++) and despise the heavily nested code that this style can lead to. But does it matter? Well, yes, because developers generally don’t test their error handling with anything like the effort they’ll put into making sure it works under normal conditions. And I’m not just talking unit testing here either, what about system testing where machines run out of memory and disk space, and networks disappear and reappear like the Cheshire Cat?

So, let me give a more common example which touches on one of my pet peeves - process exit codes. How many programmers check that the exit code produced by their executable is anything other than 0? Putting aside the whole “void main()” issue for now on the basis that you’ll be using a language that forces you to return an exit code in the first place, this is not IMHO the best way to begin:-

int main(int argc, char** argv)
{
    . . .
    return 0;
}

For starters let’s stop with the magic numbers (which also happen to look like FALSE for extra mind-bending confusion[+]) and make it obvious:-

int main(int argc, char** argv)
{
    . . .
    return EXIT_SUCCESS;
}

Now that obviously looks dodgy - success, always? In reality main() is just a stub to invoke factories to get the services needed by your application logic and so you might end up with something more like this instead:-

{
    int result = EXIT_FAILURE;

    try
    {
        // Create services
        . . .
        // Invoke application logic
        . . .
        result = EXIT_SUCCESS;
    }
    catch(/* something */)
    {
    }
    catch(/* something different */)
    {
    }

    return result;
}

If you prefer the early exit approach then you might elide the result variable and just perform a return EXIT_<WHATEVER> instead at the relevant spot. The key point here being that you focus on ensuring the default paths through the code should end in failure being reported unless you happen to strike gold and manage to make it through to the end without hitting any problems. By assuming failure you have to work harder to screw it up and create a false positive, and personally I prefer code that fails by accident, a false negative, immediately right where the problem is rather than later when the context has long gone.

As if to labour the point I’ve recently had to work with some components where the code is written in the “optimistic” style - essentially the methods are written this way round[#]:-

{
    bool ok = true;

    try
    {
        if (doSomething())
            if (doSomething())
                . . .
            else
                ok = false;
        else
            ok = false;
    }
    catch(/* stuff */)
    {
        ok = false;
    }

    return ok;
}

I’ve lost count of the number of times that I’ve debugged this code only to find that somewhere earlier a return code was ignored or an exception handler swallowed an error. But hey, this is nothing new, this problem has been around far longer than I’ve been programming. Exceptions are pretty much ubiquitous these days and so it does feel less common to see this kind of problem within application logic.

For me the acid test[*] is still “if this code traps an out-of-memory error what do I want it to do?”. And I’ve yet to answer that question with “just ignore it”.

 

[+] Have you ever seen C code written that does an equality string comparison like this: if (!strcmp(lhs, rhs)).

[#] It’s actually C-style procedural code encoded as C#. The author is not one of the development team and is the first to admit it’s not exactly a beacon of quality. But it’s in production and generating value every day (along with a side order of technical debt).

[*] By this I mean when someone writes “catch (Exception e)” I ask them the acid test question, because that’s exactly what they’re going to catch eventually. Presumably it’ll be forwarded until we hit a component boundary (process, COM, DLL, etc) at which point you’re going to have to transform every error into something the caller understands (ERRORLEVEL, HRESULT, etc).