Wednesday, 29 October 2014

When to Break the Rules

As a general rule being consistent is better than not being consistent. We create guidelines for how our code looks and what idioms and patterns we use to try and create a codebase that is easy to comprehend so that we don’t then waste brain cycles trying to sift through the trees looking for the wood. But there are times when I find those general rules do not apply, or perhaps I should say, different rules may apply.

Different Domains

In my last two posts about test naming conventions I touched on this very briefly. Tests are a classic case where not all the normal rules for production code apply. Yes, as a general rule, most of them probably still do, but when it comes to naming and code formatting it can be beneficial to do something that helps to convey the behaviour under test. I would break other rules in C# here too such as using public fields, or at least creating mutable properties so that I can use the object initialisation syntax [1]. In C++ I might rely more heavily on the pre-processor to hide some of the technical grunge and allow the reader to focus on the important bits.

A consequence of this is that you can start fighting with your static code analysis tools, e.g. ReSharper, and so being able to suppress certain rules on a per-component basis becomes a necessity if you want to remain free to express yourself in code. For instance with ReSharper I will disable some of the rules about naming conventions and the ones about being explicit where localisation concerns might arise.

Of course what we’re really saying here is not that we’re breaking the rules for the codebase per se, but that there are really two different dominant domains within the codebase that serve different purposes - production code and test code. As such we’re not really breaking the production code rules but defining a different set of rules for the test code. But it’s important to realise that this distinction occurs and that it’s a valid approach to take. Even within the production codebase there may be some smaller domains that are better served by loosening the rules, such as configuration, but this may be driven in part by an inability to configure a tool to meet your desires [2].

Refactoring

For production code the decision to do something different can often be much harder to justify. The desire is often to be consistent even if an earlier decision seems like a bad choice now, and the team decides that in future they believe a better approach should be adopted. During the period of change the codebase lives in a state of flux with two opposing sides - one demanding consistency and the other promoting a better life. Whilst the old timers will know about these changes in direction the new joiners wont and so they will likely follow whatever they see first, which may give the wrong impression. Unless you are pairing or have code reviews the opportunity to guide them in the new direction might be missed and the old ways will persist even longer.

One example of this was the change in naming convention in a C# project I worked on. The project was started by a bunch of ex-C++ programmers and so we adopted the all-uppercase convention for constants. Quite a bit of code was written before someone started using ReSharper which complained about the convention and so we investigated why ReSharper was suggesting a different style. We realised at that point that our style was not idiomatic and, more importantly, it actually made sense not to do it. So we decided that going forward we would drop some of our old inherited C++ habits and follow the more traditional C# conventions.

This might sound as if I’m contradicting my advice in “Don’t Let Your Tools Pwn You”, but really I’m advocating how we used it as a tool for learning. The problem was that only one of us had a license for the tool and there was now quite a bit of code that violated the conventions. What to do?

We had to decide how we wanted to move forward and the only way forward appeared to involve us going against our principles in one way or another. Trying to go back and fix all the code up at once seemed like a criminal waste of time as it wasn’t dangerous; just not exactly what we’d like to see now. If we add new constants where there are already ones in the old style, do we follow the existing code style or adopt the new one thereby mixing different styles. If we fix up the style of the old constants whilst adding the new ones we create a commit that includes functional and non-functional changes which we tried to avoid. If we fix up just the old subset where the new constant goes in a separate check-in we then create other bits of code of differing styles.

While it might seem like it took longer to explain this problem than to correct it, especially given that this particular example can be largely automated, it does illustrate the tensions that can come into play. Sometimes the problem has to get worse before it can get better. For example we tried using a 3rd party mocking framework in a number of different ways as we couldn’t all reach a general consensus about a style we liked. Eventually we discovered a way of using it that supported more like what we were after, but there was no automated way of going back and re-writing all those existing tests in the preferred way.

Guidelines, Not Rules

I find the following quote from Pirates of the Caribbean nicely sums up the attitude we should take towards any “rules” we might establish to try and maintain a consistent codebase and development process:

Elizabeth: Wait! You have to take me to shore. According to the Code of the Order of the Brethren...

Barbossa: First, your return to shore was not part of our negotiations nor our agreement so I must do nothing. And secondly, you must be a pirate for the pirate's code to apply and you're not. And thirdly, the code is more what you'd call "guidelines" than actual rules. Welcome aboard the Black Pearl, Miss Turner.

Many of these latter problems probably stem from an immaturity with the toolchain, but even in an established language like C++ things change over time. The recent C++ 11 and C++ 14 standards both introduced sweeping changes that have caused developers to question past practices and to unlearn what they learned so long ago. We need to remember that our processes are never cast in stone but are fungible and need constant refinement as the programming landscape around us changes.

 

[1] Modern C# supports named arguments and so this is less of a concern.

[2] Some C# serializers make it hard to work with immutable types by default because of the nature of the reflection based approach they use.

Monday, 27 October 2014

Other Test Naming Conventions

The length of my last post “Unit Testing Evolution Part II – Naming Conventions” was already massively over budget and it was largely about how I name tests within my own C++ codebase. In my professional programming career I’m currently doing mostly C# and so am bound by the tools that have already been chosen, i.e. NUnit. Here there is still a fair bit of creativity with using multiple nested classes and method names to create an overall structure that avoids relying solely on overly long method names. Given that a test framework like NUnit can also easily be used for other forms of testing, such as component and integration level tests, I’ve found that a little more structure helps deal with the additional complexity that can arise one you start dealing with more dependencies or the need to check multiple bits of state.

Unit Tests

When you look closely at the way, say, NUnit displays the full names of failed tests you can see that it includes the names of any nested classes as well as the method name. We can exploit this by breaking the test name into pieces and using more nested class names to create sub-namespaces, for example:

public static class StringTests
{
  public static class trimming_should_
  {
    [Test]
    public void remove_leading_whitespace()
    {  } 

    [Test]
    public void remove_trailing_whitespace()
    {  }
}

Note the trailing underscore on the nested class name provides a marginally nicer output:

StringTests.trimming_should_.remove_leading_whitespace
StringTests.trimming_should_.remove_trailing_whitespace

In this example we have used the top-level class as a namespace to categorise the tests in the module, and then used nested static classes as another namespace for the “feature” under test (trimming). In this instance we could probably have used a real namespace for the top-level static class, but that class often isn’t static and so you would end up with two slightly different test styles (e.g. indentation) which can be irritating.

Component/Integration Tests

In a unit test you often have only very simple “arrange, act, assert” needs, but as you move up the testing pyramid into component and integration tests you usually have a little more setup, more scenarios and more state or actions to assert. Consequently it can be beneficial to add more levels of nested classes to group common scenarios together (the “when xxx”), and at the same time also split the asserts out into separately named methods (the “then xxx”) to make it clearer what the assertion is telling you.

public class ConnectionPoolTests
{
  // Setup
  . . .
  public class when_a_connection_is_acquired_
  {
    [Test]
    public void then_the_returned_connection_is_open()
    {  }

    [Test]
    public void then_the_connections_in_use_count_is_incremented()
    {  }
  }
}

While in a Cucumber test you would start with a “then”, and the remaining asserts would begin with “and”, that doesn’t really work when an arbitrary test fails, however you might find it makes reading the test’s source code easier instead.

Setup == Given

In an NUnit test you have the SetUp and TearDown helper functions that execute around the test method. The SetUp helper is akin to the “Given” in a Cucumber style test and so you can name it something more imaginative than just “SetUp”. It doesn’t make an arbitrary failing test any easier to diagnose but it does allow you to document the test setup better:

public class WebProxyTests
{
  [SetUp]
  public void given_a_proxy_to_the_remote_service_
  {  }

  public class when_I_make_a_valid_request_
  {
    [Test]
    public void then_I_receive_a_successful_response_code
    {  }

    [Test]
    public void then_the_body_contains_the_data
    {  }
  }
}

Performance

One consideration you need to bear in mind is that structuring your tests this way makes the “arrange” and “act” aspects get executed more frequently because each test case is only asserting part of the outcome. If either of these steps are lengthy it can drag out the overall test suite run time which is probably undesirable.

Friday, 24 October 2014

Unit Testing Evolution Part II – Naming Conventions

Well over 4 years ago I fell into the very trap I once heard Kevlin Henney warn me about: don’t append “Part I” to the end of a title as it suggests there will be further parts, but there may not. At the tail end of “Unit Testing Evolution Part I - Structure” I described how I had tried to short-circuit some of the tediousness of writing unit tests by skipping the naming aspect, because it’s hard. One of the most important things I learned from that talk by Kevlin (and his Sticky Minds articles on GUTs) was how important this particular aspect is. No, really, I honestly cannot believe I was once so naive.

Test Descriptions

The reason we call it a test “name” is probably because the tooling we use forces us into partitioning our tests as functions (or methods) and uses the function name to identity it [1]. Consequently we associate the function name with being the test name, but really what we want is a description of the test. In Kevlin’s talk he suggested that test descriptions should be propositions, they should attempt to say in as clear a manner as possible what the behaviour under test is.

Why is this important? Because this step, which is almost certainly the most difficult part of the task of writing a test, is the bit that focuses the mind on what exactly it is you are going to do in the test. In a TDD scenario it’s even harder as you’re not trying to describe some existing behaviour but trying to describe what the next thing you’re going to implement is. Going back to my original post, what became apparent was that instead of listening to my process (finding it hard to describe tests) I avoided the problem altogether and created a different one - tests with no description of what was and wasn’t covered. As such there became no way to know whether any behaviour was accidental or by design.

Make no mistake, at this point [2], the amount of thought you put into naming your test is crucial to you writing good tests, of any sort. If I cannot easily think of how to describe the next test then I know something’s wrong. Rather than just stick a placeholder in I’ll sweat it out because I now appreciate how important it is. In my mind I might mentally write test and production code to try and explore what it is I’m trying to unearth, but I try and resist the temptation to write any concrete code without any clear aim so as to avoid muddying the waters.

One nice side-effect of putting this effort in (which Kevlin pointed out in his talk) is that you will have a set of propositions that should read very much like a traditional specification, but scribed in code and of an executable nature to boot.

First Steps

When you first start out naming tests it’s hard, very hard. It’s probably even harder if you’re doing it in a language that is not even your native tongue. But there are some conventions that might help you get on the right track. Just remember, these are not ideal test names for other reasons that we shall come to, but when you’re struggling, anything that can help you make progress is a good thing. Even if it stops you writing “Test 1”, “Test 2”, etc. you’ve made some headway.

Note: One of the concessions we’re going to make here initially is that we will be writing test names that must also masquerade as method names, and so the format will naturally be restricted to the character set allowable in identifiers, i.e. no punctuation or spaces (in many popular programming languages).

Where we left off at the end of my original post was how the test structure had changed. From a naming perspective I had conceded that I needed to put more effort in but was perturbed by the apparent need to express the test name in a way that could be used to name a function. Consequently one of my unit tests in my Core library for trimming strings was expressed like this:

TEST_CASE(StringUtils, trimWhitespace)
{
  . . .
}
TEST_CASE_END

Even for a simple function like trim, which we all know roughly what it does, this name “trimWhitespace” says nothing about what it’s behaviour actually is. It raises many questions, such as “what exactly denotes whitespace?” and “where does it trim this space: the front, the middle, the back?”. Clearly I needed to find a better way to name the tests to answer these kinds of questions.

One of the earliest examples I came across was by Roy Osherove (who later went on to write The Art of Unit Testing) and he suggested a template along the lines of “<Method>_When<Condition>_Should<Action>”. Anyone who has ever done any BDD or used a tool like Cucumber or SpecFlow will recognise this as a sort of variation of the Given/When/Then style. Personally I found this ordering harder to use than “<Method>_Should<Action>_When<Condition>”, this variation just seemed a bit more natural. Using the Trim() example again I might have gone for something like these (I’ve given them both for Roy’s and my preferred style):

Trim_WhenStringContainsLeadingSpaces_ShouldRemoveThem
Trim_WhenThereIsTrailingWhitespace_ShouldRemoveIt

…or:

Trim_ShouldRemoveSpaces_WhenTheyAppearAtTheBeginning
Trim_ShouldRemoveTrailingWhitespace

One of the reasons I found putting the predicate at the end easier (the _WhenXxx) was that it can be elided when it’s not required, such as in the last case. I found that for really simple methods I was straining to add the predicate when it was already obvious. Of course what seems obvious today may not be obvious tomorrow so once again I had to think real hard about whether it should be elided or not.

Word Spacing

One of the problems with trying to write long function names in a camelCase or PascalCase style is that they can get pretty hard to read which is defeating the purpose of using descriptive names in the first place. Consequently judicious use of underscores and dropping the capitalisation can make test names easier to read, e.g.

trim_should_remove_spaces_from_the_beginning

Some test frameworks acknowledge this style and will replace the underscores with spaces when reporting test names in their output to complete the illusion.

Test Behaviour, Not Methods

Once I started to get the hang of writing more precise test descriptions I was ready to tackle that other big smell which was emanating from the tests - their affinity with specific functions or methods. When testing pure functions the “act” part of the “arrange, act, assert” dance will usually only involve invoking the function under test, but this is not the case with (non-static) classes [3].

With classes there is a call to the constructor in addition to any method you want to test. Or if you want to test the constructor you probably need to invoke some properties to sense how the object was constructed. Factor in the advice about only testing one thing at a time and this way leads to madness. What you really want to be testing is behaviour (nay, features) not specific methods. There will likely be a high correlation between the behaviour and the public methods, which is to be expected in OO, but one does not imply the other.

Kevlin’s article on GUTs says this far better than I ever could, but for the purposes of this post what we’re interested in is the effect it has on the way we name our tests. Most notably we shouldn’t be including the exact name the method in the test, unless it happens to also be the best way to express the behaviour under test. That might sound odd, given that we try and name our functions and methods to clearly express our intent, but what I mean is that we don’t need to contort our test name so that our method name is used verbatim, we might use a different tense for example. Of course if we find ourselves using an entirely different verb in the test name then that is probably a good sign that we need to refactor. If we are doing TDD then we have probably just answered another piece of the puzzle; such is the power given to us by trying to name tests well.

Constructors are a good example of where this test/method affinity problem shows up. When writing a test for an object’s default state, if you use a style that includes the method name, what do you use for it? Do you use the name of the type, the word “constructor”, or the method used to observe the state? Consider a test for the default construction of a string value. This is what it might look like under the earlier method-orientated styles:

constructor_should_initialise_itself_to_an_empty_string_when_no_arguments_provided

string_should_initialise_itself_to_an_empty_value_when_no_arguments_provided

length_should_return_zero_when_constructed_with_no_arguments

The last example gives us a big hint that we’re looking at the problem the wrong way. The first two test names describe the how the implementation represents the default value, which in this case is “an empty string”. But we know that there is more than one way to represent an empty string (See “Null String Reference vs Empty String Value”) and so we should not burden the test with that knowledge because it’s a proxy for a client, and they don’t care about that either.

That last test name gets us thinking in more abstract terms and leaves it up to the test itself to show us how it could be realised, e.g.

public void a_string_is_empty_by_default()
{
  string defaultValue = new string();

  // Could be either (or both) of these…
  Assert.That(defaultValue.Length, Is.EqualTo(0);
  Assert.That(defaultValue.ToString(), Is.EqualTo(“”);
}

Writing code in a test-first manner really started to make this kind of mistake all the more apparent. I found that trying to express what behaviour I wanted to achieve before I had written it highlighted the inadequacy of the naming convention I had been using up to that point.

One other point about not being directly dependent on the function or method name is that refactoring can have less of an impact on the structure of the tests. In theory if you changed the name of the method you have to manually change the name of the associated tests as I’m not aware of any tooling which is that clever. If you move the implementation, say, from the constructor to a method you then have to rewrite the associated tests, and therefore the test names. The definition of refactoring is to change the design without changing the observable behaviour, and so if your tests and names are framed around the observable behaviour, rather than the implementation, you should not need to change anything.

Towards a Natural Language

The final step in my journey was to move towards a test framework that allowed natural language to be used instead of a limited character set for test names. This also paved the way for use of punctuation marks which helps give the test name a more human feel. The following example is what the unit test for my trim function looks like now:

TEST_CASE("trim strips spaces, tabs, carrage returns and newlines from the front and back of strings")

  . . . 
}
TEST_CASE_END

In retrospect this test case could probably be broken down a little further, perhaps by handling trimming of leading and trailing whitespace as separate cases. The reason it’s all one TEST_CASE() comes down to how I moved from the old style “TEST_CASE(StringUtils, trimWhitespace)” to the new style “TEST_CASE(trims strips...)”. As you can probably guess I just used SED to go to an intermediate step where the test name was simply “trimWhitespace” and then I slowly went back over my tests and tried to work out what a better description would be.

The trade-off here was that I gave up the ability to use the test name as a function or method name, but that’s of little use in C++ where there is no reflection capability [4]. The way it’s currently implemented means running a single test from the command line would require some thought, but I know that other more mature test frameworks, like Phil Nash’s Catch, have come up with something usable. At the moment I’m happy to treat the need to run a test in isolation as a smell that my tests take too long or there are too many tests in a single fixture.

No Pre-Processor?

In C & C++ we have the (much derided) pre-processor to help us implement such a convention, but there is no such facility in C#. However, there has been a recent pattern emerging in C# where the indexing operator is “abused” to allow you to register closures that look like long-named methods, e.g.

public class SomeTests : TestBase
{
  public void Tests()
  {
    Test[“a nice free form name”] = () =>
    {
      // Arrange, act and assert.
    };
  }
}

I like this idea as tests usually have no return value and take no arguments so it doesn’t look that weird. One should also remember that test code and production code are allowed to be different and may use different idioms because they serve different needs. When I started out writing my C++ unit testing framework I had one eye on writing a library in a production code style instead of taking a step back and trying to write something that solved the problem for a specific domain. Part of this journey has been understanding what writing tests is all about and in particular finally understanding the importance of naming them well. Once again I find myself having to learn that there is no such thing as a free lunch.

 

[1] Despite what I’ve written I’m going to stick with the term “test name” purely for SEO reasons. I’d like this blog post to be easily discoverable and that term I suspect is the one most commonly used.

[2] These are weasel words because I cannot guarantee someone like Kevlin will not come up with something even better in the future.

[3] A function like strtok() in the C standard library which maintains some hidden state would require extra calls to it in the “arrange” part of some of its tests, but these really should be the exception, not the norm (See “Static - A Force for Good and Evil”).

[4] There are some handy macros though, like __FUNCTION__, that could be used along with some other pre-processor trickery to help obtain the function name automatically.

Tuesday, 21 October 2014

So Many Wrongs, But No Rights

As we approach Christmas, that wonderful time of year for giving and receiving, we also approach that annual corporate ritual that is the Change Freeze. It’s also a time when I get to remember what is probably the worst software release I’ve had the misfortune to be involved in...

The Baroque Import Process

Like all successful systems it had been grown organically from a walking skeleton. Most of the codebase had decent test coverage and we had a fairly good build pipeline going out to a development instance that ran as close to production as possible. Most functional problems showed up in development and the UAT environment highlighted anything else non-environmental. We even had unit tests for most of our SQL code!

However, we also had some big chunks of technical debt [1] too. One particular stored procedure was now a behemoth (a many-hundred line monster) and had been developed without any test coverage. By the time this was recognised it had already grown massively by cut-and-paste and naturally nobody wanted to go back and write tests for it, and there was no drive from management to tackle this particular piece of debt either [2]. The other related data was not handled by this particular procedure but another complex maze of procedures and views. There was some “token” test coverage here, but not around what will transpire below.

These procedures were used to handle the versioning of an upstream feed. Very little data changed day-to-day so we used a versioning strategy that involved comparing yesterday’s and today’s data and only creating new records for updated entities. Another table then tied together which version should be used for which business date.

The Replacement Process

These procedures had managed to allow us to go live on time and had lived for a year in production, but eventually parts of the manual process were going to be automated upstream and so finally we got the go ahead to replace the major parts of this Heath Robinson process with something that lived outside the database and could provide us with better validation and diagnostics. This was developed in the lead up to Christmas and was looking good to be scheduled for release just before the change freeze kicked in.

The Change Freeze

This company, like many others, tries to mitigate the risk of problems escalating when no one is around to fix them during the festive period by putting a halt on all non-critical changes. It doesn’t matter whether you have been delivering continuously for over 12 months without a single cock-up, you still can’t deploy any changes unless they are to directly fix a priority one production incident.

The Last Minute Data Update

With just a week or so to go before the change freeze the business decided they needed to tweak some key data so that it would be in for the year end runs. This data was “manually calibrated” by them and very rarely changed. So we restored UAT to match production, ran in the new data and waited for the results. Most of the results were fine but as we looked a little closer we discovered that the report contained some entities where the data change had appeared to take effect, but the calculation result had not changed too.

Given the number of problems that had already shown up in the reporting component it seemed entirely feasible that another one had crept in somehow, but digging deeper proved that wasn’t entirely the case. The reporting code was known to be dubious for other reasons, but it didn’t explain why this particular data change had not had an effect everywhere it was expected.

Development != UAT

We also applied the data change to the development environment where the new release had been chugging along quite nicely. But the following morning the development system disagreed with UAT too. It looked like there might be a bug in the replacement process as well. Given how close we were to going live with this larger change we quickly dived in to see what had broken and whether we could also fix this before the freeze kicked in.

The Bug Unearthed

It turned out the new code was correct, the problem was actually in the old code, the code currently in production. The bug was in the versioning process where it failed to detect a new entity version if this one piece of datum that the business had wanted to change was the only change in the entity’s data. As a consequence the calculations had been done using the old value. Oops.

The One-Line Fix

The fix was trivial. All we needed to do was add an extra predicate in one of the select statements to ensure that the comparison was done against the latest version and not every prior version:

AND thing.Version = thing.LatestVersion

We knew the fix was this simple, and that it would work without even testing it...

The Irony

How did we know? Because we’d seen it before. Nope, it was better than that, the person who made this mistake had seen it before and even raised a bug request for the previous problem some months earlier.

Now It Gets Ugly

Not only do we have a new bug, but the results of the test run also highlights a longstanding problem with the reporting code. This code, which was bashed out in isolation [3], pulled in data from all over the place. In particular, instead of reporting data from the dated snapshot tables which is what the calculations use, it pulls them from the staging tables where the input feed is kept.

It might sound like these dated snapshot tables are in the “sanitised data” schema, they’re not. Both the unprocessed input feed and the dated snapshot live in the same table in the staging schema. The column where the value should have gone was never populated. No, that’s not true, it was populated, and versioned correctly, but with a different value that was never used.

Consequently what was reported was the value provided by the business, but that value never made its way into the set used for number crunching, hence the disparity. Sadly two wrongs did not make it right in this case.

Now It Gets Political

In any “sane” project I’d hope we could just hold up our hands and admit we have a bug, mention that we already have a fix for it ready to go and then discuss how best to get it deployed and move on the next most important thing. If the result of the discussion was that we would have to wait, that they could live with the bug, then so be it; at least we are being transparent. But that’s probably why I’d never make a “good” manager.

Instead we tried to find ways to disguise the bug. We couldn’t deploy the fix because so many other calculations that have also been working with incorrect data would come to light. We couldn’t deploy our new release because the replacement component didn’t have the bug either. This left us with somehow tweaking other data so that it would force a version change to occur, e.g. adding an extra space in a text field that was unused. The final option was stalling until the change freeze ended and we could hopefully bury bad news by folding the supposedly technical only release [4] in with a data release and the bug would get lost in the noise.

The Dark Cloud Arrives

With the change freeze over (6 weeks later) we had quite a backlog of changes. The frustration of the change freeze meant that the business were happy to lump together our new release alongside some other 3rd party and data changes. Our opportunity to bury bad news had arrived and the release was pushed out, the numbers all moved about, people murmured that things moved more than expected, but it quickly went quiet again. Normality resumed.

Epilogue

I don’t think I’ll ever understand why we couldn’t just hold up our hand and admit we’d made a mistake. Compared to many other projects around us we were a beacon of success as we had originally delivered something on time and under budget; although it may not have had all the bells and whistles originally planned. In the 12 months after going into production we had delivered at regular intervals measured in weeks, not months, and had not once had an outage caused by something our team had done. In contrast we had to push back a number of times on the 3rd party components provided to us because they didn’t entirely do what was expected of them - we became their regression test suite! I would hope that kind of delivery record would have afforded us the right to mess up slightly once in a while, but perhaps not. Maybe what trust we had built up was actually worth nothing.

They say confession is good for the soul. This is mine.

 

[1] Technical Debt is really about shortcuts taken for expediency, not crap code. However the crap code was quickly identified and a decision was made to live with it, by which I mean no decision was made to do anything about it up front. Is that now a conscious decision which means it becomes technical debt?

[2] Whilst I agree in principle that refactoring should be a by-product of any story, sometimes the debt grows into something better served by an architectural refactoring.

[3] This one component has provided inspiration for a number of other blog posts, e.g. this one in particular “The Cost of Defensive Programming”.

[4] We tried to avoid mixing purely technical changes, e.g. architectural refactorings and upcoming features (toggled off), with changes that affected the calculation results. This was to ensure our regression testing had virtually no noise. Packaging a number-breaking change in isolation was also a defence mechanism that allowed us to squarely point the finger outside our team when it went wrong. And it did, on numerous occasions.

Monday, 20 October 2014

What’s the Price of Confidence?

I recently had one of those the conversations about testing that comes up every now and then. It usually starts off with someone in the team, probably the project manager, conveying deep concerns about a particular change or new feature with them getting twitchy about whether it’s been tested well enough or not. In those cases where it comes from the management side, that fear can be projected onto the team, but in a way that attempts to somehow use that fear as a tool to try and “magically” ensure there are no bugs or performance problems (e.g. by implying that lots of late nights running manual tests will do the trick). This is all in contrast to the famous quote from Edsger Dijkstra:

Program testing can be used to show the presence of bugs, but never to show their absence

Financial Loss

The first time this situation came up I was working at an investment bank and the conversation meandered around until a manager started to get anxious and suggested that if we screw up it could cost the company upwards of 10-20 million quid. Okay, so that’s got mine and my colleague’s attention, but neither of us could see what we could obviously do that would ensure we were “100%” bug free. We were already doing unit testing and some informal code reviewing, and we were also delivering to our development system-test environment as often as we could where we ran in lock-step with production but on a reduced data set and calculation resolution.

In fact the crux of the argument was really that our UAT environment was woefully underpowered - it had become the production environment on the first release. If we had parity with production we could also do the kind of regression testing that would get pretty close to 100% confidence that there was nothing, either functionally or performance wise, that was likely to appear after releasing.

My argument was that, knowing what we do from Dijkstra, if the company stands to lose so much money from us making a mistake, then surely the risk is worth the investment by the company to help us minimise the chances of a problem slipping through; us being human beings and all (even if we are experienced ones). Bear in mind that this was an investment bank, where the team was made up of 6 skilled contractors, and we were only asking for a handful of beefy app servers and a few dozen blades to go in the compute grid. I posited that the cost of the hardware, which was likely to be far less than 100K, was orders of magnitude lower than the cost of failure and it was only a month or two of what the entire team costs. That outlay did not seem unrealistic to me given all the other project costs.

Loss of Reputation

The more recent conversation was once again about the parity between the pre-production and production environments, but this time about the database. The same “fear” was there again, that the behaviour of a new maintenance service might screw up, but this time the cost was more likely to be directly expressed as a soiled reputation. That could of course lead to the loss of future business from the affected customers and anyone else who was unhappy about a similar prospect happening to them, so indirectly it could still lead to some financial loss.

My response was once again a sense of dismay that we could not just get the database restored to the test environment and get on with it. I could understand if the data was sensitive, i.e. real customer data needed masking, or if it was huge (hundreds of TBs, not a couple of hundred GBs) although that would give me more cause for concern, not less. But it wasn’t, and I don’t know why this couldn’t just be done either as a one off, which is possibly more valid in this scenario, or better yet to establish it as a practice going forward.

The cultural difference at play here is that the databases appear to be closely guarded and so there are more hoops to go through to gain access to both a server and the backups.

Provisioning

Maybe I’m being naive here, but I thought one of the benefits of all the effort going into cloud computing is that the provisioning of servers, at least for the run-of-the-mill roles, becomes trivial. I accept that for the bigger server roles, such as databases, the effort and cost may be higher, but given how sensitive they can be to becoming the bottleneck we should put more effort into ensuring they are made available when the chances of a performance problem showing up is heightened. At the very least it must be possible to temporarily tailor any test environment so that it can be used to perform adequate testing of the changes that are a cause for concern.

Continuous Delivery

This all sounds decidedly old school though, i.e. doing development and a big bang release where you can focus on some final testing. In his talk at Agile on the Beach 2014 Steve Smith described Release Testing as Risk Management Theatre. A more modern approach is to focus on delivering “little and often” which means that you’re constantly pushing changes through your pre-production environment; so it has to be agile too. If you can not or will not invest in what it takes to continuously scale your test environment(s) to meet their demands then I find it difficult to see how you are ever going to gain the level of confidence that appears to be being sought after.

One thing Steve simplified in his talk [1] was the way features are pushed through the pipeline. In his model features go through in their entirety, and only their entirety, which is not necessarily the case when using practices such as Feature Toggles which forces integration to happen as early as possible. A side-effect of this technique is that partially finished features can go out into production sooner, which is potentially desirable for pure refactorings so that you begin to reap your return in the investment (ROI) sooner. But at the same time you need to be careful that the refactoring does not have some adverse impact on performance. Consequently Continuous Delivery comes with it’s own set of risks, but the general consensus is that these are manageable and worth taking to establish an earlier ROI, but you must be geared up for it.

One of the questions Steve asked in his talk was “how long does it take to get a code change into production?” [2]. Personally I like to think there are really two questions here: “how long does it take to reap the ROI on a new feature or change” and “how long does it take to roll a fix out”. A factor in both of these questions is how confident you are in your development process that you’re delivering quality code and doing adequate (automated) testing to root out any unintended side-effects. This confidence will come at a price that takes in direct costs, such as infrastructure & tooling, but also indirect costs such as the time spent by the team writing & running tests and reviewing the design & code. If you decide to save money on infrastructure and tooling, or work in an environment that makes it difficult to get what you need, how are you going to compensate for that? And will it cost you more in time and energy in the long run?

 

[1] I asked him about this after his talk and he agreed that it was a simplified model used to keep the underlying message clear and simple.

[2] This question more famously comes from “Lean Software Development: An Agile Toolkit” by Mary and Tom Poppendieck.

Friday, 17 October 2014

Terse Exception Messages

Yesterday, whilst putting together a testing tool, I hit one of those generic framework error messages that says exactly what’s wrong but says nothing about where the problem might be. A large part of this diversion was down to being half way through adding support for a new command line verb and then returning to it after a lengthy phone call whilst misremembering that I hadn’t quite finished wiring in the initial NOP implementation.

Everything looked to be in the right place, but when I ran it I got no output at all - not even an error message. Luckily I always test my command line applications with my two-line batch file (See “Windows Batch File Template”) that reports the process exit code to ensure I’m returning the correct code on failure. It showed a “2” which is our generic failure code, so that was ok.

A quick spot of debugging of our console application base class and the problem was easy to see, the code looked like this:

int exitCode = ExitCode.Failure;

foreach (var handler in _handlers)
{
  if (handler.Verb == verb)
  {
    exitCode = handler.Execute(options);
    break;
  }
}

return exitCode;

As you can probably see if there is no handler wired up for the verb then it just falls out the bottom. Luckily we got one thing right which was to adopt the stance of “Assume Failure by Default” so that at least an error was reported, albeit not a particularly helpful one.

A Better Error Message

One could argue at this point that while the code is probably not brilliant, given the domain (a console application), it’s fairly easy to test and fix whatever mistake I made. But, is it worth putting any effort into improving matters and if so how much? Personally I think avoiding fixing this kind of problem is part of what leads to the “Broken Windows” syndrome. It’s only a small piece of code and a sprinkling of LINQ will soon sort it out:

return _handlers.Single(v => v.Name == verb); 
                .Execute(options);

This is much simpler and it generates a bit more noise when it fails, which feels like A Good Thing. When I ran it I got the following error in my console window though:

{System.InvalidOperationException} Sequence contains no matching element

Anyone who has ever done any LINQ will recognise this somewhat terse error message. Of course it’s not the framework’s fault, after all how can they know anything about how the method is used? And it uses generics too so it’s not as if they could easily tell you what the element was that you were looking for because we’ve passed a lambda in.

An Even Better Error Message

At this point we have managed to reduce our code to something that is pretty simple and gets the job done - case closed, right?. The question I asked myself though was whether I should undo some of that simplification in order to produce a better diagnostic message? As I mentioned above the scenario where this code is used means it wouldn’t take long with a debugger to track down the mistake and so is there really any value in going one step further?

I have found in the past that developers often concentrate on the happy path and then generally gloss over the error paths, by which I mean they make sure an error is reported, but don’t put too much effort into ensuring that what is reported would be of use to fellow developers or, more importantly, the support team. I’ve discussed this before in “Diagnostic & Support User Interfaces” but suffice to say that a key part of what makes a complex system supportable is well thought out error handling and clearly written diagnostic messages.

In this instance I decided the LINQ exception was still too terse, and although I wouldn’t be able to pinpoint the source of the error clearly, I felt I could probably do much better with only a little extra code:

var handler = _handlers.SingleOrDefault(v =>
                                    v.Name == verb);

Constraint.MustNotBeNull(handler,
    "No handler configured for verb '{0}'", verbName);

return handler.Execute(options);

This meant that the output from running the console application now was more like this:

{ConstraintViolationException} No handler configured for verb 'new-verb'

By switching from using Single() to SingleOrDefault() I got to catch the missing handler problem myself and effectively report it through an assertion failure [1]. Of course I don’t explicitly catch the case where two handlers with the same name are registered; that fault will still be reported via a terser LINQ exception. However I felt at the time that this was less likely (in retrospect a copy-and-paste error is probably more likely).

The Anguish of Choice

This is the kind of small problem I bump into all the time and I never really know which way to go. While on the one hand I love the simplicity of the code that I could have written, I’ve also worked on too many codebases where problems are tedious to solve because not quite enough thought was put into the likely failure scenarios up front, or refactored when they did show up so that it improves the lives of our successors [2].

 

[1] The Constraint class plays a similar role to Debug.Assert() but is compiled into release builds too. Code contracts could also be used but I wrote this long before working on a C# codebase where the use of real Code Contracts was possible.

[2] A seamless link to another recent post: “Will Your Successor Be a Superstar Programmer”.

Wednesday, 8 October 2014

Will Your Successor Be a Superstar Programmer?

Something I struggle with when writing code is trying to factor in what the maintenance programmer that comes after me will be like. I’ve touched on this before in “Can Code Be Too Simple?” when I showed some code that could be succinctly implemented in C++, but might not be quite as obvious to someone more well versed in other languages. As such can they maintain the code you’ve written?

It might sound ridiculous asking someone who is not an “expert” in a particular programming language to fix a problem but that is exactly what has happened in some of the companies I’ve worked in. The likes of GitHub and Google might get to hire the cream of the crop and therefore have a level of expertise in their programmers that ensures they can write the best code possible, but those who don’t quite cut the mustard will end up somewhere else. As such the average level of programmer ability in the rest of the world is likely to be somewhat lower than the industry’s superstars.

Of course that assumes the code is maintained by the programmers who originally built the system in the first place. One client I worked at has a more unusual approach to handling software maintenance - give it to the support team. That’s right, they hire some experienced developers to build them a system and then on release they hand the codebase over to the support team to maintain it. Don’t get me wrong the support team are not idiots, far from it, but their day job is operational support and system administration, not becoming the best programmer they can be. I’m not sure whether I should find it condescending or not that my client considers the code I write as being so simple as to be maintainable by someone who does it only part-time.

When I was working on that project I had a brief discussion with a colleague about this setup. I voiced my concerns about the fact that this codebase was going to be supported by people who were most probably not nearly as experienced as the programmers that were going to develop it. In the back on my mind I always have that quote from Brian Kernighan [1]:

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

His reply, which I think is a perfectly valid viewpoint, is that it is not our problem if the company that hires us to write their software chooses not to hire people of equal talent (or even more talented) to support it.

Maybe that is the whole point. By writing the absolute best code I can (by using the natural idioms, proper domain types, etc.) the code stands a better chance of being supported by a non-expert? Sadly I won’t be around to find out, because if I was, I’d be the one supporting it [2].

 

[1] Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style. See Programming Quotes.

[2] Whilst I was around, even on a different project, any changes still came my way. I’m not sure whether that was for convenience, a lack of resources or a risk-reduction measure for this very reason.