When writing production code the functions with the lowest coupling / easiest to reason about are (generally speaking) pure functions. They only depend on their inputs and there are no side effects either which means that given the same inputs you should always get the same output. But this doesn’t just have to apply to production code, in fact it needs to apply to test code and the environment too.
I’m sure a proper mathematician will snarl at my simple analogy but the equation “A + B = C” is how I expect tests to behave. That is to say: given the same inputs and behaviours I expect to get the same output. Always. In essence I expect the tests to be as dependable as simple arithmetic.
This simple premise forms the bedrock on which all tests sit because you cannot (easily) make progress on your software project if there is a lot of noise going on around you. Every time a test fails for seemingly innocuous reasons you will be distracted from your goal. Even if you choose to completely ignore the failing tests your build pipeline will stall every time it happens and so you will still be distracted even just to keep re-starting the build! In my experience it pays to cure the disease rather than just keep treating the symptoms.
Removing sources of non-determinism in tests is definitely one problem I find myself tackling fairly regularly. Trying to ensure that A and B are the only inputs into the test, along with ensuring that A and B are always the same, is a common source of noise. In particular code and tests where date/times or concurrency is involved is likely to suffer from random distracting failures.
The other aspect of testing that this little mathematical equation is intended to convey is simplicity. The number of variables is consciously small and there is only a single operator which provides the point of focus. While you might think that this implies a discussion about low-level unit tests, it can equally apply to other forms of automated tests too. Where A and B might be simple values in a unit test they might be a single, larger compound value in an acceptance test. Similarly the operation may be just a single method in the former case and a RESTful API call in the latter. The increase in size and complexity can be offset (to a certain degree) by application of the Separation of Concerns for the scenario.
One of the first invariants to consider is the environment. It helps if you can keep it as stable as possible across test runs – if you keep switching the version of the toolchain and dependencies on each test run how will you know whether the failure is down to your change or someone in the background?
Obviously it’s not normally that blatant but there are common differences that I see developers keep getting tripped up on. One is the build server using a fresh checkout versus the developer’s workstation which often has piles of detritus building up because they don’t realise what their IDE leaves behind (see “Cleaning the Workspace”). Another is using a different, “compatible” test runner from within the IDE whilst the build script uses the real packaged command line tool. If you don’t run the same build script before publishing your changes then you’re faced with all the differences between the build server and workstation to consider when it breaks.
Many development practices have changed over time and the notion of a stable working environment is one that is now being heavily challenged. The use of cloud based build services like Travis and AppVeyor means you may have much less control over all the specific details of the build environment. Also the pace at which 3rd party packages change, such as those from NuGet or NPM, along with the number we now depend on, means that any two builds may have used different dependencies. Sitting on the bleeding edge comes at a cost so be sure you know what you’re getting yourself into.
The analogy I favour most when it comes to changing any invariants such as the build environment is to follow the advice of the rock climber – only move one thing at a time. So I’m told, a rock climber always keeps at least 3 limbs attached to the rock face and therefore only ever moves one at a time. When the moved limb is secure, only then do they move another one. I’m sure that experienced rock climbers do break the rules, sometimes, but they do it knowing what the full consequences are.
Avoiding Non-Determinism In Tests
With our environment behaving in a predictable manner we have reduced the potential for noise so that the first port of call for our test failures is the production code or test itself. Unless you are writing something inherently non-deterministic, like a true random number generator, then it should be possible to control the production code in a way that allows you to test it in a well-defined way.
This isn’t always easy, but it starts by not trying to test behaviours which are inherently unreliable, such as waiting for a fixed amount of time to pass, or an operation to run on a background thread. Whilst these operations may complete fairly quickly on your super-fast developer workstation, your build server will likely be a VM sharing a hugely overloaded box and so CPU cycles are somewhat scarcer.
My preferred method of dealing with such behaviours is either to mock system calls, such as getting the current time, or to use synchronisation objects like manual events, semaphores and countdown events to sense the transitions through mocks. If I’m scheduling work onto other threads I might mock the scheduler (or dispatch function) with one that just runs the task directly on the same thread so that I know for sure it will complete before the method returns.
The mistake I think many developers make is that they believe it’s possible to test many concurrent behaviours reliably. Whilst you might be able to prove the presence of a deadlock by testing it continuously for a period of time, you cannot prove the absence of it that way (Dijkstra taught us that) . The cost is longer running tests and therefore the chances that they will be run less often. Consequently I prefer to mostly write tests that characterise the concurrent behaviour, rather than attempt to prove the concurrency aspect is correct. That I leave for testing in an environment which is much better suited to the task.
Just recently I came across a test that failed one out of every 10 attempts, but when I added another test the failure rate went right up. The reason the original test failed at all was down to the spinning up of a background thread to do a cache refresh which occasionally took longer than what the test was prepared to wait. Waiting an arbitrary amount of time is generally a big no-no, you should try to synchronise with the thread somehow. In this instance I could do that through the outbound API mock which the operation was invoking.
The reason the failure rate went up was due to the way that NUnit runs tests alphabetically. My new test came “after” the previous one, which never cleaned up the background thread, and so it could fire a refresh again when my test was running. This caused multiple refreshes to be registered instead of zero. The answer was to retain ownership of the background thread and ensure it terminated by the time the test completed.
Does this add complexity to the production code? Yes. But, as is often the case when writing tests properly, it uncovers a number of questions about the expected behaviour that might not be obvious from quickly knocking something up.
Property Based Testing
One of the problems with example based testing is that, by definition, you only run through the scenarios laid out with specific examples. If you miss one, e.g. testing for a leap year, you won’t find the bug. Property based testing takes a slightly different approach that attempts to run through a variety of scenarios made up on-the-fly by a generator.
This might seem to go against the earlier advice, but it’s not because as long as you know the seed for the test generator you can reproduce the scenario exactly. If you can explore the entire problem space on each test run do, but unless it’s a trivial feature that’s unlikely. Hence you use a generator to create inputs that explore a (different) part of it each time. Consequently you still might not unearth that leap year bug in time, but you stand a better chance of finding it other unanticipated problems too.
Not every test we write will be as simple and predictable as my little maths equation suggests, but that doesn’t mean we shouldn’t try to get as close to it as possible. Some of the unit tests I once wrote for a message bus connection pool definitely weren’t easy to understand, despite my best efforts to keep them clear, but hopefully they are very much in the minority.
We all make mistakes too, so we might not get it right first time. The unit tests I wrote for logging garbage collections ran fine for over a year before I discovered one of them was leaking slightly and causing another (very intermittent) random failure. But I managed to fix it because I believe a zero tolerance approach to test failures pays dividends in the long run.
Good tests are hard to write, but they should be treated as first class citizens, just like your production code. If they aren’t as simple as “A + B = C” then consider it a test smell and see if there is some unnecessary complexity that can be factored out.
 It might be possible to verify it’s behaviour through other means, such as induction, but that’s something I personally know far too little about to use effectively.