As we approach Christmas, that wonderful time of year for giving and receiving, we also approach that annual corporate ritual that is the Change Freeze. It’s also a time when I get to remember what is probably the worst software release I’ve had the misfortune to be involved in...
The Baroque Import Process
Like all successful systems it had been grown organically from a walking skeleton. Most of the codebase had decent test coverage and we had a fairly good build pipeline going out to a development instance that ran as close to production as possible. Most functional problems showed up in development and the UAT environment highlighted anything else non-environmental. We even had unit tests for most of our SQL code!
However, we also had some big chunks of technical debt [1] too. One particular stored procedure was now a behemoth (a many-hundred line monster) and had been developed without any test coverage. By the time this was recognised it had already grown massively by cut-and-paste and naturally nobody wanted to go back and write tests for it, and there was no drive from management to tackle this particular piece of debt either [2]. The other related data was not handled by this particular procedure but another complex maze of procedures and views. There was some “token” test coverage here, but not around what will transpire below.
These procedures were used to handle the versioning of an upstream feed. Very little data changed day-to-day so we used a versioning strategy that involved comparing yesterday’s and today’s data and only creating new records for updated entities. Another table then tied together which version should be used for which business date.
The Replacement Process
These procedures had managed to allow us to go live on time and had lived for a year in production, but eventually parts of the manual process were going to be automated upstream and so finally we got the go ahead to replace the major parts of this Heath Robinson process with something that lived outside the database and could provide us with better validation and diagnostics. This was developed in the lead up to Christmas and was looking good to be scheduled for release just before the change freeze kicked in.
The Change Freeze
This company, like many others, tries to mitigate the risk of problems escalating when no one is around to fix them during the festive period by putting a halt on all non-critical changes. It doesn’t matter whether you have been delivering continuously for over 12 months without a single cock-up, you still can’t deploy any changes unless they are to directly fix a priority one production incident.
The Last Minute Data Update
With just a week or so to go before the change freeze the business decided they needed to tweak some key data so that it would be in for the year end runs. This data was “manually calibrated” by them and very rarely changed. So we restored UAT to match production, ran in the new data and waited for the results. Most of the results were fine but as we looked a little closer we discovered that the report contained some entities where the data change had appeared to take effect, but the calculation result had not changed too.
Given the number of problems that had already shown up in the reporting component it seemed entirely feasible that another one had crept in somehow, but digging deeper proved that wasn’t entirely the case. The reporting code was known to be dubious for other reasons, but it didn’t explain why this particular data change had not had an effect everywhere it was expected.
Development != UAT
We also applied the data change to the development environment where the new release had been chugging along quite nicely. But the following morning the development system disagreed with UAT too. It looked like there might be a bug in the replacement process as well. Given how close we were to going live with this larger change we quickly dived in to see what had broken and whether we could also fix this before the freeze kicked in.
The Bug Unearthed
It turned out the new code was correct, the problem was actually in the old code, the code currently in production. The bug was in the versioning process where it failed to detect a new entity version if this one piece of datum that the business had wanted to change was the only change in the entity’s data. As a consequence the calculations had been done using the old value. Oops.
The One-Line Fix
The fix was trivial. All we needed to do was add an extra predicate in one of the select statements to ensure that the comparison was done against the latest version and not every prior version:
AND thing.Version = thing.LatestVersion
We knew the fix was this simple, and that it would work without even testing it...
The Irony
How did we know? Because we’d seen it before. Nope, it was better than that, the person who made this mistake had seen it before and even raised a bug request for the previous problem some months earlier.
Now It Gets Ugly
Not only do we have a new bug, but the results of the test run also highlights a longstanding problem with the reporting code. This code, which was bashed out in isolation [3], pulled in data from all over the place. In particular, instead of reporting data from the dated snapshot tables which is what the calculations use, it pulls them from the staging tables where the input feed is kept.
It might sound like these dated snapshot tables are in the “sanitised data” schema, they’re not. Both the unprocessed input feed and the dated snapshot live in the same table in the staging schema. The column where the value should have gone was never populated. No, that’s not true, it was populated, and versioned correctly, but with a different value that was never used.
Consequently what was reported was the value provided by the business, but that value never made its way into the set used for number crunching, hence the disparity. Sadly two wrongs did not make it right in this case.
Now It Gets Political
In any “sane” project I’d hope we could just hold up our hands and admit we have a bug, mention that we already have a fix for it ready to go and then discuss how best to get it deployed and move on the next most important thing. If the result of the discussion was that we would have to wait, that they could live with the bug, then so be it; at least we are being transparent. But that’s probably why I’d never make a “good” manager.
Instead we tried to find ways to disguise the bug. We couldn’t deploy the fix because so many other calculations that have also been working with incorrect data would come to light. We couldn’t deploy our new release because the replacement component didn’t have the bug either. This left us with somehow tweaking other data so that it would force a version change to occur, e.g. adding an extra space in a text field that was unused. The final option was stalling until the change freeze ended and we could hopefully bury bad news by folding the supposedly technical only release [4] in with a data release and the bug would get lost in the noise.
The Dark Cloud Arrives
With the change freeze over (6 weeks later) we had quite a backlog of changes. The frustration of the change freeze meant that the business were happy to lump together our new release alongside some other 3rd party and data changes. Our opportunity to bury bad news had arrived and the release was pushed out, the numbers all moved about, people murmured that things moved more than expected, but it quickly went quiet again. Normality resumed.
Epilogue
I don’t think I’ll ever understand why we couldn’t just hold up our hand and admit we’d made a mistake. Compared to many other projects around us we were a beacon of success as we had originally delivered something on time and under budget; although it may not have had all the bells and whistles originally planned. In the 12 months after going into production we had delivered at regular intervals measured in weeks, not months, and had not once had an outage caused by something our team had done. In contrast we had to push back a number of times on the 3rd party components provided to us because they didn’t entirely do what was expected of them - we became their regression test suite! I would hope that kind of delivery record would have afforded us the right to mess up slightly once in a while, but perhaps not. Maybe what trust we had built up was actually worth nothing.
They say confession is good for the soul. This is mine.
[1] Technical Debt is really about shortcuts taken for expediency, not crap code. However the crap code was quickly identified and a decision was made to live with it, by which I mean no decision was made to do anything about it up front. Is that now a conscious decision which means it becomes technical debt?
[2] Whilst I agree in principle that refactoring should be a by-product of any story, sometimes the debt grows into something better served by an architectural refactoring.
[3] This one component has provided inspiration for a number of other blog posts, e.g. this one in particular “The Cost of Defensive Programming”.
[4] We tried to avoid mixing purely technical changes, e.g. architectural refactorings and upcoming features (toggled off), with changes that affected the calculation results. This was to ensure our regression testing had virtually no noise. Packaging a number-breaking change in isolation was also a defence mechanism that allowed us to squarely point the finger outside our team when it went wrong. And it did, on numerous occasions.
No comments:
Post a Comment