Monday, 3 September 2018

Surgical Support Needs Surgical Tools

In the world of IT support there is the universal solution to every problem – turn it off and on again. You would hope that this kind of drastic action is only drawn upon when all other options have been explored or that the problem is known a priori to require such a response and is the least disruptive course of action. Sadly what was often just a joke in the past has become everyday life for many systems as rebooting or restarting is woven into the daily support routine.

In my professional career I have worked on a number of systems where there are daily scheduled jobs to reboot machines or restart services. I’m not talking about the modern, proactive kind like the Chaos Monkey which is used to probe for weaknesses, or when you force cluster failovers to check everything’s healthy; I’m talking about jobs where the restart is required to ensure the correct functioning of the system – disabling them would cripple it.

Sledgehammers

The need for the restart is usually to overcome some kind of corrupt or immutable internal state, or to compensate for a resource leak, such as memory, which degrades the service to an unacceptable level. An example of the former I’ve seen is to flush a poisoned cache or pick up the change in date, while for the latter it might be unclosed database connections or file handles. The notion of “recycling” processes to overcome residual effects has become so prominent that services like IIS provide explicit support for it [1].

Depending on where the service sits in the stack the restart could be somewhat disruptive, if it’s located on the edge, or largely benign if it’s purely internal, say, a background message queue listener. For example I once worked on a compute farm where one of the front-end services was restarted every night and that caused all clients to drop their connection resulting in a support email being sent due to the “unhandled exception”. Needless to say everyone just ignored all the emails as they only added to the background noise making genuine failures harder to spot.

These kind of draconian measures to try and attain some system stability actually make matters worse as the restarts then begin to hide genuine stability issues which eventually start happening during business hours as well and therefore cause grief for customers as unplanned downtime starts occurring. The impetus for one of my first ACCU articles “Utilising More Than 4GB of Memory in a 32-bit Windows Process” came from just such an issue where a service suddenly starting failing with out-of-memory errors even after a restart if the load was awkwardly skewed. It took almost four weeks to diagnose and properly fix the issue during which there were no acceptable workarounds – just constant manual intervention from the support team.

I also lost quite a few hours on the system I mentioned earlier debugging a problem in the caching mechanism which was masked by a restart and only surfaced because the restart failed to occur. No one had remembered about this failure mode because everyone was so used to the restart hiding it. Having additional complexity in the code for a feature that will never be used in practice is essentially waste.

Cracking Nuts

Although it’s not true in all cases (the memory problem just described being a good example) the restart option may be avoidable if the process exposed additional behaviours that allowed for a more surgical approach to recovery to take place. Do you really need to restart the entire process just to flush some internal cache, or maybe just a small part of it? Similarly if you need to bump the business date via an external stimulus can that not be done through a “discoverable” API instead of hidden as part of a service restart [2]?

In some of my previous posts and articles, e.g “From Test Harness To Support Tool”, “Home-Grown Tools” and more recently in “Libraries, Console Apps, and GUIs”, I have described how useful I have found writing simple custom tools to be for development and deployment but also, more importantly, for support. What I think was missing from those posts that I have tried to capture in this one, most notably through its title, is to focus on resolving system problems with the minimal amount of disruption. Assuming that you cannot actually fix the underlying design issue without a serious amount of time and effort can you at least alleviate the problem in the shorter term by adding simple endpoints and tools that can be used to make surgical-like changes inside the critical parts of the system?

For example, imagine that you’re working on a grid computing system where tasks are dished out to different processes and the results are aggregated. Ideally you would assume that failures are going to happen and that some kind of timeout and retry mechanism would be in place so that when a process dies the system recovers automatically [3]. However, if for some reason that automatic mechanism does not exist how are you going to recover? Given that someone (or something) somewhere is waiting for their results how are you going to “unblock the system” and allow them to make progress, without also disturbing all your other users who are unaffected?

You can either try and re-submit the failed task and allow the entire job to complete or kill the entire job and get the client to re-submit its job. As we’ve seen one way to achieve the latter would be to restart parts of the system thereby flushing the job from it. When this is a once in a million event it might make sense [4] but once the failures start racking up throwing away every in-flight request just to fix the odd broken one becomes more and more disruptive. Instead you need a way to identify the failed task (perhaps using the logs) and then instruct the system to recover such as by killing just that job or by asking it to resubmit it to another node.

Hence, ideally you’d just like to hit one admin endpoint and say something like this:

> admin-cli kill-job --server prod --job-id 12345

If that’s not easily achievable and there is distributed state to clear up you might need a few separate little tools instead that can target each part of system, for example:

> admin-cli remove-node –s broker-prod --node NODE99
> admin-cli remove-results -s results-prod --job 12345
> admin-cli remove-tasks –s taskq-prod --job 12345
> admin-cli reset-client –s ui-prod --client CLT42
> admin-cli add-node –s broker-prod --node NODE99

This probably seems like a lot of work to write all these little tools but what I’ve found in practice is that usually most of the tricky logic in the services already exists – you just need to find a way to invoke it externally with the correct arguments.

These days it’s far easier to graft a simple administration endpoint onto an existing service. There are plenty of HTTP libraries available that will allow you to expose a very basic API which you could even poke with CURL. If you’re already using something more meaty like WCF then adding another interface should be trivial too.

Modern systems are becoming more and more decoupled through the use of message queues which also adds a natural extension point as they typically already do all the heavy lifting and you just need to add new message handlers for the extra behaviours. One of the earliest distributed systems I worked on used pub/sub on a system-wide message bus both for functional and administrative use. Instead of using lots of different tools we had a single admin command line tool that the run playbook generally used (even for some classic sysadmin stuff like service restarts) as it made the whole support experience simpler.

Once you have these basic tools it then becomes easy to compose and automate them. For example on a similar system I started by adding a simple support stored procedure to find failed tasks in a batch processing system. This was soon augmented by another one to resubmit a failed task, which was then automated by a simple script. Eventually it got “productionised” and became a formal part of the system providing the “slow retry” path [3] for automatic recovery from less transient outages.

Design for Supportability

One of the design concepts I’ve personally found really helpful is Design for Testability; something which came out of the hardware world and pushes us to design our systems in a way that makes them much easier test. A knock-on effect of this is that you can reduce your end-to-end testing burden and deliver quicker.

A by-product of Design for Testability is that it causes you to design your system in a way that allows internal behaviours to be observed and controlled in isolation. These same behaviours are often the same ones that supporting a system in a more fine-grained manner will also require. Hence by thinking about how you test your system components you are almost certainly thinking about how they would need to be supported too.

Ultimately of course those same thoughts should also be making you think about how the system will likely fail and therefore what needs to be put in place beforehand to help it recover automatically. In the unfortunate event that you can’t recover automatically in the short term you should still have some tools handy that should facilitate a swift and far less disruptive manual recovery.

 

[1] Note that this is different from a process restarting itself because it has detected that it might have become unstable, perhaps through the use of the Circuit Breaker pattern.

[2] Aside from the benefits of comprehension this makes the system far more testable as it means you can control the date and therefore write deterministic tests.

[3] See “When Does a Transient Failure Stop Being Transient” for a tangent about fast and slow retries.

[4] Designing any distributed system that does not tolerate network failures is asking for trouble in my book but the enterprise has a habit of believing the network is “reliable enough”.

Thursday, 9 August 2018

Test Language: Behaviours, Not Examples

Naming is hard, as we know from the old adage about the two hardest problems in Computer Science, and naming in tests is no different. I’ve documented my own journey around how I structure tests in two previous posts: “Unit Testing Evolution Part II – Naming Conventions” and “Other Test Naming Conventions”. I’ve also covered some similar ground before quite recently in “Overly Prescriptive Tests” but that was more about the content of the tests themselves, whereas here I’m trying to focus more on the language aspects.

Describing the Example

Something which I’ve observed, both from reviewing Fizz Buzz submissions with tests [1] and from real tests, is that there is often that missing leap from writing a test which describes a single example to generalising the language to describe the effective behaviour [2]. For example, imagine you’re writing a unit test for a calculator, if you literally encode your example as your test name you might write:

[Test]
public void two_plus_two_is_equal_to_four()

Given that you could accidentally implement it with multiplication and still make the test pass you might add another scenario to be sure you don’t fall into that trap:

[Test]
public void three_plus_seven_is_equal_to_ten()

The problem with these test names is that they only tell you about the specific scenario covered by the test, not about the bigger picture. One potential refactoring might be to parameterise the test thereby forcing you to generalise the name:

[TestCase(2, 2, 4)]
[TestCase(3, 7, 10)]
public void adding_two_numbers_together_returns_their_sum(. . .)

One way this often shows up in FizzBuzz tests is with examples for the various rules, e.g.

[Test]
public void three_returns_the_word_fizz()

[Test]
public void five_returns_the_word_buzz()

The rules of a basic calculator are already known to pretty much everyone but here, unless you know the rules of the game Fizz Buzz, you would not be able to derive them from these examples alone and one very important role of tests are to document, nay specify, the behaviour of our code.

Describing the Behaviour

Hence to encode the rules you need to think more generally:

a_number_divisible_by_three_returns_the_word_fizz

There are a couple of issues here, namely that technically any number is divisible by three (just not wholly), and also that it won’t be true once we start bringing in the more advanced rules. It’s not easy trying to be precise and yet also somewhat vague at the same time, but we can try:

a_number_wholly_divisible_by_three_generally_returns_the_word_fizz

Once we bring in the “divisible by three and divisible by five” rule it becomes much harder to be precise in our test names as we’d have to include the overriding rules too which likely makes them harder to read and comprehend:

a_number_wholly_divisible_by_three_but_not_also_wholly_divisible_by_five_returns_the_word_fizz

You might just get away with it this time but its not really scalable and test names, much like code comments, often have a habit of getting out of sync with reality. Even when they break due to new functionality it’s easy to end up fixing the test and forgetting to check whether the “documentation” aspect still reflects the new behaviour.

Hence I personally prefer to use words in test names that suggest “broad strokes” when necessary and guide the reader (top to bottom) from the more general scenarios to the more specific. This, in my mind, is similar to putting the happy path cases before the various error handling ones.

Validating Collections

These examples might be a little too trivial but the impetus for this post came from similar scenarios where the test language talked about the outcome of the example itself rather than the behaviour of the logic in general. The knock-on effect of doing this, apart from making the intent of the example harder to comprehend in the future, was that it also became brittle as the specific scenario outcome was encoded in the test and any change in logic that might be orthogonal to it could break it unnecessarily. (As mentioned earlier, “Overly Prescriptive Tests” looks at brittle tests from a different angle.)

A common place where this shows up is when asserting behaviours around collections. For example imagine writing tests for querying the seats available in a cinema where there are seats in different price bands. When testing the “seat query” method for an exhausted price band you might be inclined to write:

[TestFixture]
public class when_querying_for_seats_and_none_left_in_band
{
  [Test]
  public void then_the_result_is_empty()
  {
    auditorium.Add(“Posh Seats”, new Seats[0]);

    var seats = auditorium.FindAvailableSeats();

    Assert.That(seats, Is.Empty);
  }
}

The example, being minimal in nature, means that technically in this scenario the result will be empty. However that is an artefact of the way the example is expressed and the test has been written. If I were to change the test set-up and add the following line, the test would break:

auditorium.Add(“Cheap Seats”, new Seats[100]);

While the outcome of the example above might be “empty”, that is not the general behaviour of the logic under test and our test language should be changed to describe that:

[Test]
public void then_no_seats_in_that_band_are_returned()

Now we’re not making a statement about what else might or might not be in that result, only what our expectations are for seats in the band in question. Once we have fixed the test language we can address how we validate that in the example. Instead of looking at what is in the collection we should be looking at what isn’t there as the test name tells us to expect that something should be absent, and the assert should reflect that language:

Assert.That(seats.Where(s => s.Band == “Posh Seats”), Is.Empty);

Now I should only be able to break this test by changing the data or logic specific to the example, orthogonal behaviours should not break it by accident. (See “Manual Mutation Testing” for more on how you can test the quality of your tests.)

Invest in Tests

If you’ve ever worked on a codebase with brittle tests you’ll know how frustrating it can be when your feature mushrooms because you broke a bunch of badly written tests. If we’re lucky we see the failed assertion and if it’s not obvious then we can look back at the test name to see if the scenario rings any bells. If we’re unlucky we have to immediately reach for the debugger and likely add “refactor tests” to the yak stack.

If you “pay it forward” by taking the time to write good tests up front you’ll find it easier to sustain delivery in the future.

 

[1] A company I once worked for used Fizz Buzz in their candidate early screening process. Despite being overkill in practice (as was pointed out to candidates) a suite of tests was requested as part of the submission to help get a feel for what style they used. IMHO the tests said much more about candidates than the production code.

[2] Yes, “property based testing” takes this entire concept a step further so that it exercises the behaviour with multiple examples generated differently each time. That’s the destination, this post is about one possible journey.

Tuesday, 24 July 2018

TODO or TODO Not – Redux

One of my most “successful” posts was one I wrote way back in 2011 about my dislike for TODO style comments in code – “TODO or TODO Not - There Is No Later”. The premise of that post, which includes a number of examples from real codebases I’ve worked on, is that they are fundamentally pointless because they are almost certainly too low in value to get done. If they were valuable enough they either should be a proper feature on the backlog or be left to be handled as part of a relevant story, e.g. refactoring.

At the time I wrote that post I was convinced about them not living in the codebase (past the feature’s release) but I suggested that any potentially useful ones should be converted into bona fide user stories so they could be formally considered and prioritised along with the other items. After receiving a reply to a tweet that referenced my aging blog post from a company that tries to quantify technical debt by analysing such TODO style comments I felt it must be time for an update. (The TL;DR of this post was my reply to them.)

// Learn to Let Go

One of the hardest lessons I have learned over the intervening 7 years is how to let go of stuff. What caused me to cling onto some of those TODOs that I would run across was a fear of forgetting something important. My daily use of a log book (see “Pen & Paper”) to record notes as I go along exemplifies my apparent need to keep track of my current state as my brain is like the proverbial sieve. In an era where change was much harder because the software development QA process was largely manual this makes sense as it was more popular to batch-up changes. Couple this with a general disregard for anything non-functional, such as an appreciation for refactoring, and it’s all too easy to see why people bury their personal backlog in the code rather than open it up for discussion with non-developers.

There is a familiar ring here and that’s because I’ve walked this path before more recently in 2016’s “Developers Can Be Their Own Worst Enemy”. With a modern development process that puts an emphasis on trust and transparency we can let go of the past and should have more confidence in our peers and the management to see the value in our opinions.

It’s also not acceptable to just leave a cryptic comment in the code about why something should be implemented differently or moved elsewhere – the need to change must be backed up with a reason so that we can understand the value in it, and most TODO comments are throwaway rather than well reasoned arguments.

// Measuring Technical Debt

I’ll be honest, I genuinely cannot see the point of attempting to measure the level of technical debt in a codebase, even if it were possible to do so. For a start you need to decide if you’re taking the original interpretation – a conscious decision to temporarily take a shortcut - or the ever more popular one – crap code. Even if you could do that, what are you ever going to do with the output? I once saw a SonarQube report that said we had £X million of technical debt in a codebase I was working on; how exactly does that inform you?

The reason I find it pointless is because it feels like it’s measuring the wrong thing. Just like the story about the man looking for his keys under the streetlight we apply a tool to measure the quality of the code when what really matters is measuring our ability to deliver. The reason poor quality code affects us is because it makes future change hard and slows us down. Hence if it matters there must be a causal link because if the code is that bad our pace of delivery will slow down and we’ll see that (all other things being equal).

But what is more important in my mind is that even if you did know how much debt you had you would only want to focus on the areas that are subject to change, and you do that already by focusing on the most valuable work in the backlog! And the technique for tackling the debt is refactoring. Hence we already have what we need to address the real problem.

// Trimming the Backlog

Unwinding the stack slightly let me return to the topic of reflecting unfinished work in the product’s backlog.

That sentence should already be making your spider-senses tingle – how can it be unfinished? If we’ve not met the acceptance criteria we’re not done, so carry on. If the acceptance criteria was wrong or incomplete then that’s a discussion to have with the product owner for clarification. Note that “of good quality” is an acceptance criteria inherent in every story we do as it is virtually non-negotiable [1].

Perhaps a better term would be undiscovered work. Woody Zuill has this saying “it’s in the doing of the work that we discover the work that needs doing”. What we often unearth are things which never showed up in our initial analysis and therefore we now have to factor them into our schedule or decide to drop them. When we’re knee deep in the feature it may seem really important to do it now, but given time the need may slowly dissolve, and in the end possibly disappear altogether.

What you might initially put on the backlog (or write as a TODO in the code) could be quite specific, e.g. “Need to handle poisoned messages”. As you continue you might also add other scenarios such as “Should refresh token before it expires”  and “Triage malformed messages separately from those well-formed but still invalid”. These are all valuable behaviours which go towards building a reliable service but the initial focus might not be on reliability, maybe its currently just a tactical fix to enable learning about something else. You don’t want to waste time over-engineering but at the same time you also don’t want to forget what else you have learned doing it.

The problem with this mentality is that the backlog just grows and grows, and we’re back to where we were before with TODOs littered all over the code – it’s just an every growing feature list. As time passes the need to implement some of those features will either happen because they have suddenly become important (it’s no longer tactical) or they will vanish (it remains a mere stop-gap). Leaving the stories on the backlog just makes it harder and harder to prioritise as you keep going over old ground again and again.

Once again we need to let go and rely on the fact that if they are important they will eventually resurface. If you feel really uncomfortable about deleting them entirely then you might consider rolling up related stories back into epic sized ones as part of the backlog grooming. Applying this to our recent example you could easily fold them all into a single “Service Reliability” epic that would be easier to handle. You could turn each card’s title into a checklist item.

// Confidence

That said as I get older I become less tolerant of a never-ending backlog and want to be more aggressive in the backlog grooming. Part of that is down to having more confidence in knowing that issues will be addressed [2] in a more timely fashion and that prioritisation will take into account both the technical and functional features with more or less equal consideration because the team is trusted to do The Right Thing.

Being more experienced there is no doubt more than a grain of truth that my sphere of influence is probably wider but that shouldn’t really matter. Every story must be valuable and ideally stand-up on its own merit, where my experience comes in is probably in being able to express that value more succinctly.

// TODO: Redux

I haven’t seen, heard or read anything in the intervening years that has been able to convince me in the value of using TODO comments in the code as a more successful technique of managing what needs to be done. If anything my appetite for tracking any work outside the next few sprints / weeks has begun to wane simply because it is now so easy to change in direction with only a moment’s notice should the situation deteriorate. This does not mean we should be reckless, far from it, but leaving a TODO in the code as a way of conveying a change in architecture hardly seems optimal either.

As for the notion that a TODO in the code can be equated with an increase in technical debt I don’t buy that either. I would posit it is more likely to indicate a failure within the development process itself as the mismatch between behaviour and acceptance criteria either indicates a bug in the code or a bug in the requirements and neither of these outcomes sounds like something that should be brushed off lightly with a comment in the code.

 

[1] I try not to deal in absolutes as there always seems to be an exception, but either way it must be a conscious decision.

[2] Assuming the organisation is behaving in a moderately agile way and not just paying lip service to a couple of the ceremonies or practices.

Thursday, 19 July 2018

Delivery Anti-Pattern: Local Optimisations

The daily stand-up mostly went along as usual. I wasn’t entirely sure why there was a stand-up as it wasn’t so much a team as a bunch of people working on the same codebase but with more-or-less individual goals. Applying the microservices analogy to the team you could say we were a bunch of micro-teams – each person largely acting with autonomy rather than striving to work together to achieve a common goal [1].

Time

But I digress, only slightly. What happened at the stand-up was a brief discussion around the delivery of a feature with the suggestion that it should be hidden behind a feature toggle. The implementer explained that they weren’t going to add a feature toggle because “it was a waste of their time”.

This surprised me. Knowing what I do now about how the team operates it isn’t that surprising but at the time I was used to working in teams where every member works towards common goals. One of those common goals is to try and ensure the delivery of features is a continuous flow and is not disrupted by a bad change which then has to be backed out because rolling back has the potential to create all sorts of disruption, not least the delay of those unaffected changes.

You should note that I’m not disagreeing with their choice of whether or not to use a feature toggle – I did not know anywhere near enough about the change or the system at that time to contribute to that decision. No, what disturbed me was the reason why they chose not to take that approach – that their time is more valuable than that of anyone else in the team, or the business for that matter.

In isolation that paints an unpleasant picture of the individual in question and that simply is not the case. However their choice of words, even if done without real consideration, does appear to reinforce the culture that surrounds them. In essence, with a feeling that the focus is on them and their performance, they are naturally going to behave in a way that favours optimising delivering their own features over that of the team at large.

Quality

Another example of favouring a local optimisation over the longer term goal of sustained delivery occurred when I was assigned my first piece of work. This was not so much a story as a couple of epics funded as an entire project (over 4 months solid work in the end). My instinct, after being shown roughly where in the code I needed to dig, was to ask where the existing tests were so that I knew where to add mine. The tech lead’s immediate response was “you won’t have time to write tests”.

My usual response to this statement is a jovially phrased “how will I know if it works then?” which often has the effect of opening a line of dialogue around the testing strategy and where it’s heading. Unfortunately this time around it only succeeded in the tech lead launching into a diatribe about how important delivery was, how much the business trusted us to deliver on time, blah, blah, blah, in fact almost everything that a good test suite enables!

Of course I still went ahead and implemented the entire project TDD-style and easily delivered it on time because I knew the approach was sound and the investment was more than worth it. The subsequent enhancements and repaying of some technical debt also became trivial at that point and meant that anyone, not just me, could safely and quickly make changes to that area of code. It also showed how easy it was to add new tests to cover changes to the older parts of the component when required later.

In the end over 10% of unit tests of the entire system had been written by me during that project for a codebase of probably over 1/2 million lines of code. I also added a command line test harness and a regression testing “framework” [2] in that time too all in an effort to reduce the amount of hoops you needed to go through to diagnose and safely fix any edge cases that showed up later. None of this was rocket science or in the least bit onerous.

Knowledge

I would consider a lack of supporting documentation one further local optimisation too. When only a select few have the knowledge to help support a system you have to continually rely on their help to nurse it through the bad times. This is especially true when the system has enough quirks that the cost of taking the wrong action is quite high (in terms of additional noise). If you need to remember a complex set of conditions and actions you’re going to get it wrong eventually without some form of checklist to work from. Relying on tribal knowledge is a great form of optimisation until core members of the team leave and you unearth the gaping holes in the team’s knowledge.

Better yet, design away the problems entirely, but that’s a different can of worms…

Project Before Product

I believe this was another example of how “projects” are detrimental to the development of a complex system. With the team funded by various projects and those projects being used as a very clear division on the task board through swim lanes [3] it killed the desire to swarm on anything but a production incident because you felt beholden to your specific stakeholders.

For example there were a number of conversations about fixing issues with the system that were slowing down delivery through unreliability that ended with “but who’s going to pay for that?” Although improvements were made they had to be so small as to not really affect the delivery of the project work. Hence the only real choice was to find easier ways to treat the symptoms rather than cure the disease.

Victims of Circumstance

Whenever I bump into this kind of culture my gut instinct is not to assume they are “incompetent” people, on the contrary, they’re clearly intelligent so I’ll assume they are shaped by their environment. Of course we all have our differences, that’s what makes diversity so useful, but we have to remember to stop once in a while and reflect on what we’re doing and question whether it’s still the right approach to take. What works for building Fizz Buzz does not work for a real-time, distributed calculation engine. And even if that approach did work once upon a time the world keeps moving on and so now we might be able to do even better.

 

[1] Pairing was only something you did when you’d already been stuck for some time, and when the mistake was found you went your separate ways again.

[2] I say “framework” because it was really just leveraging a classic technique: a command line tool reading CSV format data which fired requests into a server, the results of which are then diff’d against a known set of results (Golden Master Testing).

[3] The stand-up was originally run in project order, lead by the PM. Unsurprisingly those not involved in the other projects were rarely engaged in the meeting unless it was their turn to speak.

Thursday, 12 July 2018

Optimistic SQL

One of the benefits of learning other programming languages is the way it teaches you about other paradigms and idioms. This is the premise behind the “Seven Xxx in Seven Weeks” range of books. Although I have the database one on my bookshelf I’ve only ever skimmed it as at the time I bought it I suddenly found myself leaving the world of the classic RDBMS behind and working with other types of DB for real; most notably the document-oriented kind.

Although some of these products like MongoDB and Couchbase have come a long way from their early beginnings as highly available key-value stores they often still lack the full-on transaction support of the old stalwarts like SQL Server and PostgreSQL. Coupled with a high-availability service you have to think differently about how you react to concurrency conflicts as explicit locking is almost certainly never the answer [1].

The impetus for this post was going back into the world of SQL databases and being slightly bemused by a stored procedure that appeared to implement an “upsert” (an UPDATE or INSERT depending on whether the row already exists) as I realised it wasn’t how I’d approach it these days.

The Existing SQL Approach

Initially I was somewhat flummoxed why it was even written the way it was as there appeared to be no concurrency issues in play at all, it was a single service doing the writing, but I later discovered that an accident of the implementation meant there were two writers internally competing and they chose to resolve this in the database rather than remove the root source of concurrency in the service.

The upsert was basically written like this:

  • Try SELECTing the existing row.
  • If it exists, UPDATE it.
  • If it doesn’t exist, INSERT it.

In the service code there were a number of comments describing why the transaction level was being bumped up to “serializable” – it was effectively to deal with the concurrency they had introduced within the code by creating two competing writers. On top of that the initial SELECT statement in the upsert applied a HOLDLOCK which also effectively makes the transaction serializable because it puts a range lock on the row’s key (even if that key doesn’t exist yet).

The Document DB Approach

The last few years away from the relational world meant that I was used to dealing with these kinds of conflicts at a slightly lower level. Also, dealing with document updates in the service rather than writing them as SQL mean that updates were done in a server-side loop rather than pushing the concurrency issue down into the database, hence it would look more like this:

  • Try selecting the document.
  • If it exists, update it and try writing it back.
  • If it doesn’t exist, try creating it.
  • If any write fails start over from the beginning.

Due to the lack of transactions and locking, write conflicts are commonly detected by using a version number attribute that gets used in the update predicate [2]. (A write failure, via a “document not found” error, means the predicate failed to match the specific document and version and therefore a conflict has occurred.)

Another SQL Approach

So what does all this have to do with upserts in SQL?

Well, what I found interesting was that my gut reaction was to question why there is the initial select there as I would have written it as:

  • Try to UPDATE the row.
  • If no rows were updated, then INSERT it.

This particular order makes an assumption that updates are more prevalent than inserts and as a I rule I’d say that checking @@ROWCOUNT to see if anything was written is far less ugly than adding a TRY…CATCH block in T-SQL and attempting to verify that the insert failed due to a primary key violation.

That all seemed fairly obvious but I had forgotten that with the document DB approach you tend to expect, and handle, write failures as part of handling concurrency, but in this case if two connections both attempted the insert concurrently it’s theoretically possible that they could both fail the UPDATE step and then one of the INSERTs would succeed and the other would fail resulting in a primary key violation. However the code in the service was not written to detect this and retry the operation (as you would with a document DB) which is why the initial SELECT is there – to lock the “unwritten row” up front which ensures that another transaction is blocked until the row is then inserted or updated. This way no client logic needs to handle the concurrency problem.

However I believe we can still achieve the same effect by adding the same HOLDLOCK hint to our initial UPDATE so that if the row does not exist other writers will be blocked by the range lock until the subsequent INSERT goes through. Hence the initial SELECT is, I believe, redundant.

The MERGE Approach

At this point I remembered that way back in the past SQL Server introduced the MERGE operation which effectively allows you to write an upsert with a single statement as you factor both the hit and miss logic into different branches of the statement. This caused me to go looking to see what the start of the art in upsert techniques were, possibly with performance comparisons to see how much faster this must be (given that SQL Server clearly has the potential to optimise the query plan as it better knows our intent).

I started digging and was somewhat surprised when I came across the page “Performance of the SQL MERGE vs. INSERT/UPDATE”. I was expecting to have my hypothesis validated but discovered that the answer was far from clear cut. Naturally I then googled “SQL Server upsert performance” to see what else had been written on the subject and I discovered this wasn’t an anomaly so much as a misunderstanding about what problem the MERGE statement is really intended to solve.

You should of course never take performance improvements at face value but “measure, measure, measure” yourself. I wasn’t avoiding doing that, I was looking to see if there might be any pitfalls I needed to be wary of when benchmarking the approach.

At this point I haven’t gone any further with this as it’s more of a personal investigation (there is no actual performance issue to solve) but it just goes to show that writing SQL is as much an art as it’s always been.

 

[1] Some document databases, such as Couchbase, do support locking of documents, but there are heavy restrictions so you tend to find another way.

[2] In the particular example I was looking at no version number was needed in the SQL predicate because the data had a total ordering independent of the write order (it was tracking the minimum and maximum of a value over the day).

Wednesday, 11 July 2018

Support-Friendly Tooling

One of the techniques I briefly mentioned in my last post “Treat All Test Environments Like Production” was how constraining the test environments by adhering to the Principle of Least Privilege drove us to add diagnostic specific features to our services and tools.

In some cases that might be as simple as exposing some existing functionality through an extra command line verb or service endpoint. For example a common technique these days is to add a “version” verb or “–-version” switch to allow you to check which build of a particular tool or service you have deployed [1].

As Bertrand Meyer suggests in his notion of Command/Query Separation (CQS) any behaviour which is a query in nature should have no side-effects and therefore could also be freely available to use for diagnostic purposes – security and performance issues notwithstanding. Naturally these queries would be over-and-above any queries you might run directly against your various data stores, i.e. databases, file-system, etc. using the vendors own lower-level tools.

Where it gets a little more tricky is on the “command” side as we might need to investigate the operation but without disturbing the current state of the system. In an ideal world it should be possible to execute them against a part of the system reserved for such eventualities, e.g. a special customer or schema that looks and acts like a real one but is owned by the team and therefore its side-effects are invisible to any real users. (This is one of the techniques that falls under the in-vogue term of “testing in production”.)

If the issue can be isolated to a particular component then it’s probably more effective to focus on that part of the system by replaying the operation whilst simultaneously redirecting the side-effects somewhere else (or avoiding them altogether) so that the investigation can be safely repeated. One technique here is to host the component in another type of process, such as a GUI or command line tool and provide a variety of widgets or switches to control the input and output locations. Alternatively you could use the Null Object pattern to send the side-effects into oblivion.

In its most simplest form it might be a case of adding a “--ReadOnly” switch that disables all attempts to write to back-end stores (but leaves logging intact if that won’t interfere). This would give you the chance to safely debug the process locally using production inputs. As an aside this idea has been formalised in the PowerShell world via the “-WhatIf” switch which allows you to run a script whilst disabling (where supported) the write actions of any cmdlets.

If the operation requires some form of bulk processing where there is likely to be far too much output for stdout or because you need a little more structure to the data then you can add multiple switches instead, such as the folder to write to and perhaps even a different format to use which is easier to analyse with the usual UNIX command line tools. If implementing a whole different persistence mechanism for support is considered excessive [2] you could just allow, say, an alternative database connection string to be provided for the writing side and point to a local instance.

Earlier I mentioned that the Principle of Least Privilege helped drive out the need for these customisations and that’s because restricting your access affects you in two ways. The first is that by not allowing you to make unintentional changes you cannot make the situation worse simply through your analysis. For example if you happened to be mistaken that a particular operation had no side-effects but it actually does now, then they would be blocked as a matter of security and an error reported. If done in the comfort of a test environment you now know what else you need to “mock out” to be able to execute the operation safely in future. And if the mocking feature somehow gets broken, your lack of privilege has always got your back. This is essentially just the principle of Defence in Depth applied for slightly different reasons.

The second benefit you get is a variation of yet another principle – Design for Testability. To support such features we need to be able to substitute alternative implementations for the real ones, which effectively means we need to “program to an interface, not an implementation”. Of course this will likely already be a by-product of any unit tests we write, but it’s good to know that it serves another purpose outside that use case.

What I’ve described might seem like a lot of work but you don’t have to go the whole hog and provide a separate host for the components and a variety of command-line switches to enable these behaviours, you could probably get away with just tweaking various configuration settings, which is the approach that initially drove my 2011 post “Testing Drives the Need for Flexible Configuration”. What has usually caused me to go the extra step though is the need to use these features more than just once in a blue moon, often to automate their use for longer term use. This is something I covered in much more detail very recently in “Libraries, Console Apps & GUIs”.

 

[1] Version information has been embedded in Windows binaries since the 3.x days back in the ‘90s but accessing it easily usually involved using the GUI shell (i.e. Explorer) unless the machine is remote and has limited access, e.g. the cloud. Luckily PowerShell provides an alternative route here and I’m sure there are plenty of third party command line tools as well.

[2] Do not underestimate how easy it is these days to serialise whole object graphs into JSON files and then process them with tools like JQ.

Friday, 6 July 2018

Treat All Test Environments Like Production

One of the policies I pushed for from the start when working on a greenfield system many years ago was the notion that we were going to treat all test environments (e.g. dev and UAT) like the production environment.

As you can probably imagine this was initially greeted with a heavy dose of scepticism. However all the complaints I could see against the idea were dysfunctional behaviours of the delivery process. All the little workarounds and hacks that were used to back-up their reasons for granting unfettered access to the environments seemed to be the result of poorly thought out design, inadequate localised testing or organisational problems. (See “Testing Drives the Need for Flexible Configuration” for how we addressed one of those concerns.)

To be clear, I am not suggesting that you should completely disable all access to the environment; on the contrary I believe that this is required even in production for those rare occasions when you just cannot piece together the problem from your monitoring and source code alone. No, what I was suggesting was that we employ the same speed bumps and privileges in our test environments that we would in production. And that went for the database too.

The underlying principle I was trying to enshrine here was that shared testing environments, by their very nature, should be treated with the utmost care to ensure a smooth delivery of change. In the past I have worked on systems where dev and test environments were a free-for-all. The result is that you waste so much time investigating issues that are orthogonal to your actual problem because someone messed with it for their own use and just left it in a broken state. (This is another example of the “Broken Windows” syndrome.)

A secondary point I was trying to make was that your test environments are also, by definition, your practice runs at getting things right. Many organisations have a lot of rigour around how they deploy to production but very little when it comes to the opportunities leading up to it. In essence your dev and test environments give you two chances to get things right before the final performance – if you’re not doing dress rehearsals beforehand how can you expect it to go right on the day? When production deployments go wrong we get fearful of them and then risk aversion kicks in meaning we do them less often and a downward spiral kicks in.

The outcome of this seemingly “draconian” approach to managing the development and test environments was that we also got to practice supporting the system in two other environments, and in a way that prepared us for what we needed to do when the fire was no longer just a drill. In particular we quickly learned what diagnostic tools we should already have on the box and, most importantly, what privileges we needed to perform certain actions. It also affected what custom tools we built and what extra features we added to the services and processes to allow safe use for analysis during support (e.g. a --ReadOnly switch).

The Principle of Least Privilege suggests that for our incident analysis we should only require read access to any resource, such as files, the database, OS logs, etc. If you know that you are protected from making accidental mistakes you can be more aggressive in your approach as you feel confident that the outcome of any mistake will not result in breaking the system any further [1][2]. Only at the point at which you need to make a change to the system configuration or data should the speed bumps kick in and you elevate yourself temporarily, make the change and immediately drop back to mere mortal status again.

The database was an area in particular where we had all been bitten before by support issues made worse through the execution of ad-hoc SQL passed around by email or pasted in off the wiki. Instead we added a new schema (i.e. namespace) specifically for admin and support stored procedures that were developed properly, i.e. they were written test-first. (See “You Write Your SQL Unit Tests in SQL” for more on how and why we did it this way.) This meant applying certain kinds of workarounds were easier to administer because they were essentially part of the production codebase, not just some afterthought that nobody maintained.

On the design front this also started to have an interesting effect as we found ourselves wanting to leverage our production service code in new ways to ensure that we avoided violating invariants by hosting the underlying service components inside new containers, i.e. command line tools or making them scriptable. (See “Building Systems as Toolkits”.)

The Interface Segregation Principle is your friend here as it pushes you towards having separate interfaces for reading and writing making it clearer which components you can direct towards a production service if you’re trying to reproduce an issue locally. For example our calculation engine support tool allowed you to point any “readers” towards real service endpoints whilst redirecting the the writers to /dev/null (i.e. using the Null Object pattern) or to some simple in-memory implementation (think Dictionary) to pass data from one internal task to the next.

I find it somewhat annoying that we went to a lot of effort to give ourselves the best chance of designing and building a supportable system that also provided traceability only for the infrastructure team to disallow our request for personal per-environment support accounts, saying instead that we needed to share a single one! Even getting them to give us a separate account for dev, UAT and production was hard work. It sometimes feel like the people who complain most about a lack of transparency and rigour are the same ones that deny you access to exactly that.

I know there were times when it felt as though we could drop our guard in dev or UAT “just this once” but I don’t remember us ever doing that. Instead we always used it as an opportunity to learn more about what the real need was and how it could become a bona fide feature rather than just a hack.

 

[1] That’s not entirely true. A BA once concocted a SQL query during support that ended up “bug checking” SQL Server and brought the entire system to a grinding halt. They then did it again by accident after it was restarted :o).

[2] A second example was where someone left the Sysinternals DebugView tool running overnight on a server whereupon it filled up the log window and locked up a service due to the way OutputDebugString works under the covers.