Wednesday, 12 December 2018

“Hello World” Stories

I’ve always tried really hard to fight against “technical stories”. These are supposedly user stories but which are really framed as a solution to a problem and really just technical tasks. In “Turning Technical Tasks Into User Stories” I looked at how it’s often possible to elevate these from an obvious solution to a problem back up to a problem which needs to be solved. At this point you may discover there are other, hopefully cheaper, solutions to the problem which have been missed in the original analysis either because things have changed or different people are doing the thinking.

On the flip-side there are occasionally times where, after having looked at a few related stories, it’s apparent that they all require the same underlying mechanism to work. One common solution to this is to bulk up the first story with the technical work and let the rest flow through as normal. This way you have no technical work on your backlog per-se as it’s all hidden in the stories.

Transparency

What I don’t like about this approach is that one story arbitrarily gets hit with a load of extra work, which, if you’re using historical data to stick a finger in the air for estimation of similar work later, skews the average somewhat. It also means that from a visibility perspective one story takes longer while the mechanism is being built.

One way I’ve found to address this has been to pull out the bare bones of the technical work into a “Hello, World!” story [1]. This story is framed around building the skeleton of the mechanism that will be used to drive the implementation of the subsequent features. The aim is keep the scope minimal enough that we avoid speculating while still delivering something which stands on its own two feet and remains clearly visible on the board.

Value Proposition

While the value to the end-user is in the eventual feature, the value in the mechanism is proving to the development team that the basic approach seems sound. With the skeleton built, the idiosyncrasies around each individual feature can then be dealt with appropriately at the right time and accounted for in the usual way.

To be clear this is not about doing a spike or building a prototype, although that may have happened earlier to gain the knowledge needed to undertake this piece of work. No, here we’re talking about building the bare bones of a real mechanism along with the most basic feature possible.

The reason I’ve called these “Hello World” Stories is probably self-evident, it alludes to the classic program many have chosen as their first – to write “Hello, World!” to the console. In this context the name is intended to conjure up simplicity and remind us that what we’re doing is delivering the minimum required to make the platform viable. We probably won’t literally write “Hello, World!” to the console, but it may a log message instead that we can then observe and monitor, or be a message on a queue that we can see discarded. Essentially whatever we can do to make its effects observable without wasting any real effort or leaving it partially complete.

Based on the classic INVEST acronym we should strive to make every unit of work: Independent, Negotiable, Valuable, Estimable, Small and Testable. By splitting it out from one of the arbitrary features it becomes more independent, negotiable, estimable and small which can be useful should short-term priorities change. And by extending the scope from a pure mechanism just a little bit further to the most trivial feature possible we make it more testable from a technical perspective, even if not from a product viewpoint. Most importantly, however, is it valuable in its own right? I think sometimes splitting the mechanism out gives value by making the I,N,E,S and T more tangible. In particular breaking work down into smaller deliverable units is often the most valuable practice even if occasionally the end-user has nothing initially to show for it.

Ultimately, I guess, I can’t ever remember anyone complaining they had broken their work down into pieces that were so small they were too visible.

 

[1] I’m sure there is an argument about this not being a “story” per-se but just a “task”. However I prefer to call it a story because our “Hello, World!” realization should have a grounding in the real world, even if it is more abstract than what the end-user will eventually receive.

[2] There is an assumption here that we’ve already decided we cannot or do not want to solve the dependent features in different ways, probably because it would be far more costly (in the long run) than briefly delaying them by building a common pillar.

Friday, 7 December 2018

Overthinking is not Overengineering

As the pendulum swings ever closer towards being leaner and focusing on simplicity I grow more concerned about how this is beginning to affect software architecture. By breaking our work down into ever smaller chunks and then focusing on delivering the next most valuable thing, how much of what is further down the pipeline is being factored into the design decisions we make today?

Wasteful Thinking

Part of the ideas around being leaner is an attempt to reduce waste caused by speculative requirements which has led many a project in the past into a state of “analysis paralysis” where they can’t decide what to build because the goalposts keep moving. By focusing on delivering something simpler much sooner we begin to receive some return on our investment earlier and also shape the future based on practical feedback from today, rather than trying to guess what we need.

When we’re building those simpler features that sit nicely upon our existing foundations we have much less need to worry about the cost of rework from getting it wrong as it’s unlikely to be expensive. But as we move from independent features to those which are based around, say, a new “concept” or “pillar” we should spend a little more time looking further down the backlog to see how any design choices we make might play out later.

Thinking to Excess

The term “overthinking” implies that we are doing more thinking than is actually necessary; trying to fit everyone’s requirements in and getting bogged down in analysis is definitely an undesirable outcome of spending too much time thinking about a problem. As a consequence we are starting to think less and less up-front about the problems we solve to try and ensure that we only solve the problem we actually have and not the problems we think we’ll have in the future. Solving those problems that we are only speculating about can lead to overengineering if they never manage to materialise or could have been solved more simply when the facts where eventually known.

But how much thinking is “overthinking”? If I have a feature to develop and only spend as much effort thinking as I need to solve that problem then, by definition, any more thinking than that is “overthinking it”. But not thinking about the wider picture is exactly what leads to the kinds of architecture & design problems that begin to hamper us later in the product’s lifetime, and later on might not be measured in years but even in days or weeks if we are looking to build a set of related features that all sit on top of a new concept or pillar.

The Horizon

Hence, it feels to me that some amount of overthinking is necessary to ensure that we don’t prematurely pessimise our solution and paint ourselves into a corner. We should factor work further down the backlog into our thoughts to help us see the bigger picture and work out how we can shape our decisions today to ensure it biases our thinking towards our anticipated future rather than an arbitrary one.

Acting on our impulses prematurely can lead to overengineering if we implement what’s in our thoughts without having a fairly solid backlog to draw on, and overengineering is wasteful. In contrast a small amount of overthinking – thought experiments – are relatively cheap and can go towards helping to maintain the integrity of the system’s architecture.

One has to be careful quoting old adages like “a stich in time saves nine” or “an ounce of prevention is worth a pound of cure” because they can send the wrong message and lead us back to where we were before – stuck in The Analysis Phase [1]. That said I want us to avoid “throwing the baby out with the bathwater” and forget exactly how much thinking is required to achieve sustained delivery in the longer term.

 

[1] The one phrase I always want to mean this is think globally, act locally” because it sounds like it promotes big picture thinking while only implementing what we need today, but that’s probably stretching it too far.

Monday, 12 November 2018

Feeling Isolated

By and large I think I’ve been fairly lucky with my time as a contract programmer. Virtually all the teams I’ve worked in and systems I’ve worked on have been pretty decent. None of them are going to change the world but they’ve been enjoyable, which is probably why I’ve ended up working on them for a decent length of time [1].

I can only say “virtually all” because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my “customer”, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many “resources”. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers it’s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasn’t worth it as there were no useful details; it felt like the ticket was just there to “tick a box”.

The code relied heavily on the caller “doing the right thing” so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisation’s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that “you won’t have time to write tests”, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things “right”. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didn’t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didn’t accidentally break anything. In some cases it was hard to know what was “by design” and what was “by accident”.

Clearly not everyone took this approach, as you can see in “It Compiles, Ship It!”. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) system’s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is “well I worked all this out by myself originally” and “that’s more than anyone ever gave me” I think it’s not entirely unfair to assume that knowledge sharing isn’t high on some people’s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasn’t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. I’ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because “they are not a start-up” when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that I’ve brought.

Maybe if I was a lot younger I’d not have known any better and stuck it out a bit more but these days I know it’s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

 

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day – the other two barely lasted a month or so.

[3] See “Problem Domain Expert or Technical Expert or Even Both” for more on this recurring theme.

[4] See “My [Unit] Testing Epiphany” and my more recent ACCU / Agile on the Beach talk “A Test of Strength” for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see “Terse Exception Messages”. Also “The Perils of DateTime.Parse()” covers one specific example.

Thursday, 8 November 2018

Proxy Weirdness – Socket Closed on 404

While investigating the issue that led to the discovery of the strange default behaviour of the .Net HttpClient class which I wrote up in “Surprising Defaults – HttpClient ExpectContinue” we also unearthed some other weirdness in a web proxy that sat between our on-premise adapter and our cloud hosted service.

Web proxies are something I’ve had cause to complain about before (see “The Curse of NTLM Based HTTP Proxies”) as they seem to interfere in unobvious ways and the people you need to consult with to resolve them are almost always out of reach [1]. In this particular instance nobody we spoke to in the company’s networks team knew anything about it and trying to identify if it’s your on-premise proxy and not something broken with any of the other intermediaries that sit between you and the endpoint is often hard to establish.

The Symptoms

Whilst trying to track down where the “Expect: 100-Continue” header was coming from, as we didn’t initially believe it was from our code, we ran a WireShark trace to see if we could capture the traffic from, and to, our box. What was weird in the short trace that we captured was that the socket looked like it kept closing after every request. Effectively we sent a PUT request, the response would come back, and immediately afterwards the socket would be closed (RST).

Naturally we put this on the yak stack. Sometime later when checking the number of connections to the TIBCO server I used the Sysinternals’ TCPView tool to see what the service was doing and again I noticed that sockets were being opened and closed repeatedly. As we had 8 threads concurrently processing the message queue it was easy to see 8 sockets open and close again as in TCPView they go green on creation and red briefly on termination.

At least, that appeared to be true for the HTTP requests which went out to the cloud, but not for the HTTP requests that went sideways to the internal authentication service. However they also had an endpoint hosted in the cloud which our cloud service used and we didn’t see that behaviour with them (i.e. cloud-to-cloud), or when we re-configured our on-premise service to use it either (i.e. on-premise-to-cloud). This suggested it was somehow related to our service, but how?

The HttpClient we were using for both sets of requests were the same [2] and so we were pretty sure that it wasn’t our fault, this time, although as the old saying goes “once bitten, twice shy”.

Naturally when it comes to working with HTTP one of the main diagnostic tools you reach for is CURL and so we replayed our requests via that to see if we could reproduce it with a different (i.e. non-.Net based) technology.

Phased Switchover

While the service we were writing was new, it was intended to replace an existing one and so part of the rollout plan was to phase it in slowly. This meant that all reads and writes would go to both versions of the service but only the one where any particular customer’s data resided would succeed. The consumers of the service would therefore get a 404 from us if the data hadn’t been migrated, which in the early stages of development applied to virtually every request.

A few experiments later to compare the behaviour for requests of migrated data versus unmigrated data and we had an answer. For some reason a proxy between our on-premise adapter and our web hosted service endpoint was injecting a “Connection: Close” header when a PUT or DELETE [3] request returned a 404. The HttpClient naturally honoured the response and duly closed the underlying socket.

However it did not have this behaviour for a GET or HEAD request that returned a 404 (I can’t remember about POST). Hence the reason we didn’t see this behaviour with the authentication service was because we only sent GETs, and anyway, they returned a 200 with a JSON error body instead of a 404 for invalid tokens [4].

Epilogue

I wish I could say that we tracked down the source of the behaviour and provide some closure but I can’t. The need for the on-premise adapter flip-flopped between being essential and merely a performance test aid, and then back again. The issue remained as a product backlog item so we wouldn’t forget it, but nothing more happened while I was there.

We informed the network team that we were opening and closing sockets like crazy, which these days with TLS is somewhat more expensive and therefore would generate extra load, but had to leave that with them along with an offer of help if they wanted to investigate it further, as much for our own sanity.

It’s problems like these which cause teams to deviate from established conventions because ultimately one is within their control while the other is outside it and the path of least resistance is nearly always seen as a winner from the business perspective.

 

[1] I’m sure they’re not hidden on purpose but unless you have a P1 incident it’s hard to get their attention as they’re too busy dealing with actual fires to worry about a bit of smoke elsewhere.

[2] The HttpClient should be treated as a Singleton and not disposed per request, which is a common mistake.

[3] See “PUT vs POST and Idempotency” for more about that particular choice.

[4] The effects of this style of API response on monitoring and how you need to refactor to make the true outcome visible are covered in my recent Overload article “Monitoring: Turning Noise into Signal”.

Tuesday, 30 October 2018

Always Reply Within Your SLA – Succeed or Abort

Way back in 2012 I wrote the blog post “Service Providers Are Interested In Your Timeouts Too” about how you can help service teams understand your intentions so that they can handle requests more efficiently. That was written at a time when I had been working for many years on internal systems where there were no real SLAs per-se, often just a “best efforts” approach with manual intervention required to “unblock” the system when the failures start occurring [1]. In contrast I have always strived to create self-healing systems as much as possible so that only truly remarkable events require any kind of human remediation.

In more recent years I’ve spent far more time working on web services where there is a much stronger notion of an SLA and therefore a much higher probability that if you fail to meet your SLA then the client will attempt to perform some kind of recovery rather than hang around and wait for the reply [2]. Hence what I wrote about wasting resources on dead requests in that earlier blog post have started to become more significant.

Deadlines

A consequence of this ideology is that I’ve started to become far more interested in the approach of always responding within the SLA even if that means aborting mid-request. Often an SLA is seen as an aspiration rather than any kind of hard deadline, something which we hope to achieve more often than not, where “more often” usually involves quoting some (arbitrary) number of “nines”. For those requests that fall outside this magical number all bets are off and you might get an answer in a useful timeframe or you might not. This kind of uncertainty has always bothered me as a client consumer.

Hence, I’ve started moving towards building services that always provide a reply within the SLA whether or not the request has been satisfied. Instead of tying up valuable resources in the hope that when the answer finally arrives the client still has a vested interest in it, I’d prefer to just abandon the request and let the client know the SLA would be violated if it had continued servicing it. In essence the request times-out server-side, where the time-out is the SLA.

What this means for the client is that they have a definitive reply (network issues notwithstanding) to their request within the time limit allowed. More importantly if they want to allow more time to handle the request than the SLA allows for then they need to tell the service that they’re willing to wait. Essentially this creates a priority system and allows the service to decide what to do with requests that are happy to hang around for a bit longer.

Mechanics

Implementation-wise what this mostly boils down to is ensuring that every non-trivial piece of work (think: database query, network call, disk read, etc.) must be made with a bounded call time, i.e. one where a timeout can be provided so that the caller always regains control in a timely fashion. Similarly we don’t start any work that we suspect we can’t finish in time either. This generally manifests as aborting on the first timeout which is usually given the entire SLA and therefore you’re never going to recover in time.

Internally the maximum timeout starts with the SLA and as each background query is sent it is timed and the timeout gets progressively shorter [3]. As the load increases and internal queries take longer the chances of a request aborting rises but at least the load on the upstream systems doesn’t keep rising too. Ultimately it’s just a classic negative feedback loop.

Limitations

Unfortunately what makes implementing this somewhat less than idea is that we still don’t really have cancellable requests in many frameworks and you’re never entirely sure what happens when the timeout triggers. If the underlying operation is abandoned, but has to complete anyway because it can’t be cancelled, you may not be much better off. The modern async-enhanced programming world is great for avoiding tying up threads in the happy path but once you start considering the failure modes it’s much harder to reason about and, more importantly, control what’s going to happen. Despite the fact that under the covers the world of I/O has practically always been asynchronous the higher layers still assume a synchronous model with syntactic sugar only helping to reinforce that perspective.

So far I don’t have nearly enough production-level data points to know if it’s an idea that is truly worth the effort to implement or not. Being able to reject work outright because you’ve already missed the SLA isn’t too onerous but does mean you need to tap into the processing pipeline early before the request is queued in the background to know when the internal clock has started ticking. What’s harder to determine is whether you really get any benefit out of the additional complexity needed to track your request’s progress and if aborting upstream requests creates a more or equally unstable service due to the way the timeouts leave their underlying requests dangling.

I still think it’s an approach worth pursuing but I wouldn’t be surprised to find The Morning Paper covering something from decades ago that shows it’s just a fools errand :o).

 

[1] See “Support-Friendly Tooling” for some other examples about how this can play out if reliability out-of-the-box is “assumed”.

[2] In one instance that would mean abandoning the request and potentially taking on some small financial risk on behalf of the customer.

[3] Naturally for parallel / scatter-gather I/O it’s the time of the longest concurrent request.

Tuesday, 2 October 2018

Technical Debt – Conscious Competence

Once upon a time the term Technical Debt seemed to have a very clear meaning but over the last few years that has been diluted to generally mean any crap code or process which is holding back delivery. I’m sure any scholars of Wittgenstein will be at pains to point out that “meaning is use” and therefore if everyone uses it this way who am I to argue?

For me the canonical source of information on the technical debt metaphor comes from the wiki of the person who coined the phrase in the first place – Ward Cunningham. The entry on Technical Debt there suggests to me that the choice to enter into debt is a wholly conscious one, not the unconscious acts of a less professional bunch of programmers.

By way of example I thought I’d take the opportunity to write up one of those occasions where I’ve been involved with taking debt on (in the original spirit of the term) and how we dealt with it, to show where the distinction lies.

The Bug

Soon after going live with v1.0 of a new calculation system in a large financial organisation we discovered that a number of key counterparties were missing from the daily report. The report generator was a late addition and there were various other issues around it’s development and testing which muddied the waters somewhat but suffice to say that this wasn’t delivered as cleanly as the core system was. (You might consider the more recent meaning of the term to apply here.)

More importantly what transpired was that due to various mergers in the company’s history a few counterparties had the same “unique” code in different back-end systems. This wasn’t just news to my team (we were all recent hires) but also to quite a few people in the business too. Due to only dealing with a limited set of “books” the codes were always unique to them in their context, but our new system cut right across them all.

The Root Cause

The generation of calculations was ultimately based around a Cartesian product of two counterparties, however given that most of those were pointless there was an optimisation which used another source of data to reduce that by more than an order of magnitude.

This optimisation should have been fairly simple but due to a need to initially use some existing manually managed counterparty data to ease the cutover (so regression testing should then reconcile exactly) it was somewhat more complicated than first envisaged.

Our system was designed to use the correct source of data eventually, but do a reverse lookup for the time being. It might sound simple but the lookup actually involved multiple lookups using combinations of keys that had to make assumptions about which legacy back-end system might hold the related data. The right person who could explain how we could do what we needed to do correctly also seemed elusive; there were many people with “heuristics”, but nobody who knew for sure.

In total there were ~100 counterparties out of a total of ~15,000 permutations that suffered from this problem. Unfortunately a handful of those 100 had a significant effect on the “bottom line” and therefore the usefulness of the system as a whole was in doubt at that point.

Entering Into Debt

Naturally once we unearthed this clanger we had to decide how to tackle it. After getting our heads around what this all meant and roughly where in the code the missing logic probably needed to go we had to make a decision – do we try and fix the underlying issue right away or try and put a workaround in place (assuming that’s even possible) to mitigate the problem, at least temporarily.

We were all very aware of going down the dark road of putting a tactical fix in place because we’d all seen where that can lead. We had made a concerted effort over the 12 months required to build the system to refactor relentlessly [1] and squash any bugs as soon as possible. This felt like a backwards step.

On the positive side by adopting a Design for Testability approach in most parts of the code we had extra switches on our processes [2] that allowed us to make per-counterparty requests, usually for diagnostic purposes. Hence the workaround took the form of sticking the list of missing counterparties in a simple text file, then using a command prompt FOR loop [3] to read the file and invoke the tool in “single counterparty mode”. Yes it was a little slow due the constant restarting of the process but it was easy to surgically insert into the workflow with the minimum of testing or risk.

Paying Back the Debt

With the hole plugged for now, and an easy mechanism in place for adding any other missing counterparties – update the text file – we were in a position to sort out the root problem without feeling under pressure to get the system working correctly 100%, ASAP.

As you can probably imagine the real solution wasn’t easy, not least because it was one of a few areas of SQL code that didn’t have any unit tests and was a tangled web of tables and views which had grown organically in an attempt to graft the old and the new worlds together [4].

What Did it Cost?

If we assume that the gung-ho approach would have been to just jump in and start fixing the real code, then what did we lose by not doing that? It’s possible that the final fix was simple and a little more investigation may have lead to that solution instead.

In contrast, the risk is that we end up in one of those “have we fixed it or not” scenarios where we spend an indeterminate amount of time being “real close” to getting towards “done”. The old adage about the last 10% also taking 90% of the time springs immediately to mind. Instead we were almost positive we had a simple workaround that could be deployed and get the system running correctly enough in an estimable amount of time. I believe there is a lot of value in having that degree of confidence.

What I think was critical was being able to remove the pressure on finding the right solution as this gave us time to really consider what needed to be done. Any fix done under pressure is not going to be given the attention to detail that it probably deserves. You then run the risk of making the system worse and having an even deeper hole to dig yourself out of.

The customer does not care about strategic versus tactical decisions per-se, they just want the thing to work. We cared about the solution because we knew it would be a burden in the short term as everyone had to remember about the bit “grafted on the side”. The general trust the team had built up by keeping quality at the forefront meant that the business would be more willing to trust us to reconcile the problem appropriately when the time came.

Use Language With Care

I really hope the term Technical Debt doesn’t continue to get watered down even further as it’s a powerful concept which is incredibly useful in the right hands. We already have far too many words for “alternate implementation” that are pejoratives carrying an air of unprofessionalness about them. I would like this one to remain in the hands of the professionals so they can continue to have “grown up” conversations with their customers about when it’s appropriate to consider taking shortcuts for a short term business need without them rolling their eyes, yet again.

 

[1] See “Relentless Refactoring” for more thoughts around this (unfortunately) contentious topic.

[2] “From Test Harness To Support Tool”, “Building Systems as Toolkits” and “In The Toolbox - Home-Grown Tools” all look at the non-functional side of tooling.

[3] A batch file just wouldn’t be complete without a for loop, see “Every Solution Starts With ‘FOR /F’”.

[4] This one feature seemed destined to plague us forever, see “So Many Wrongs, But No Rights” for another tale of woe.

Friday, 28 September 2018

Beware the Easy Fix

Whenever you get a bug report be sure you can reproduce the problem before you start and check you’ve fixed the bug when you make your change.

This advice might seem blindly obvious and you’re probably wondering who on earth would try and fix a bug without reproducing the problem first and then without testing the fix works afterwards [1]. I wondered that too but I was recently involved in a bug report that seemed so cut-and-dried I thought I might have to reconsider my own obsessive desire to stick rigidly to the process. I was of course mistaken…

The Bug

A bug showed up in a new message queue processing service that meant when the message queue broker was down for longer than a minute or so the consumer lost its connection and never reconnected in the background. In turn this meant the queue would slowly back-up – the process was still alive and kicking, it just wasn’t servicing the queue.

This bug report came by way of a production incident and an experienced colleague had triaged the problem so the ticket came into the team with some useful details attached. In the ticket the final log message from the service before it went dark told us that the dispatcher thread had shut down due to the failure to reconnect. The ticket also pointed us to the bit of code where the dispatcher thread was configured.

Looking at the service code along with a quick read of the third party library documentation made it seem pretty obvious that the recovery options configured for the dispatcher were insufficient. It was set-up with only 3 short retries and a circuit breaker for good measure. As a result of the incident some monitoring had been added to the queue so there was no reason why we couldn’t just enable infinite connection retries [2] and effectively disable the circuit breaker. Fixing the dispatcher code was a doddle as the message consumer library is well designed and has good documentation.

It almost seemed too easy…

The Shortest Path

The problem with bugs in infrastructure code like this is that they almost certainly don’t have any automated test coverage because writing them is really hard [3]. In fact testing this kind of issue can be arduous even when done manually as you need to control the middleware which might be outside your control or just something which sits in the background ticking away and therefore is almost invisible unless you wrote the original code or have had to fix it before. Throw in the fact that the bug wasn’t a showstopper and it’s easy to see how you could apply Sir Tony Hoare’s principle about code “being so simple there are no obvious bugs” and just push the change out based on the ability to compile the code and the fact that it doesn’t make matters any worse (you can show you’ve not broken the ability to connect to the queue).

Of course when the problem shows up in production again you’ll know that you never really fixed the problem and you’ll have to go around the loop once more, but do what you should have done first time around as the second outage will no doubt have made a few more people annoyed.

Another Bug

Unsurprisingly the simple code change suggested by the ticket actually had no effect at all when we came to test it, and this sudden realisation that we didn’t really understand what was going on was the impetus needed to take a step back and start again from the beginning.

Whilst performing a quick disconnection test (by bouncing the middleware) we noticed that the queue was behaving weirdly and not backing up like it said in the bug report. Another rabbit hole later [4] and we discover that the queue was not set-up to be durable, which in itself turned out to be another bug.

Eventually we find a way to reproduce the problem and in the process we learn a bit more about how the middleware and message consumer library both work. However we still don’t understand why the new dispatcher configuration does not appear to be working. Luckily the library is open source and so we can debug the issue ourselves and see what is going on under the hood.

The Real Fix

Who would have guessed that internally the message consumer library had another retry and circuit breaker policy that was used to control the (re)connection attempts to the message queue broker. Unlike the dispatcher thread error recovery policy, which was configured explicitly, the message queue connection policy was controlled by a couple of defaulted arguments on the connection configuration object constructor [5].

Sadly we couldn’t be explicit and use the “wait and retry forever” policy that was available on the dispatcher so instead we had to settle for configurating the number of connection attempts to int.MaxValue.

Problem Solved

Naturally it was far simpler to test the fix because we eventually put the effort into working out how to reproduce the problem in the first place. This can be quite significant from a status reporting perspective because it means you are less likely to be over optimistic about your progress. If you’re struggling to reproduce the problem then you’re going to struggle to prove that you’ve fixed it. If you mistakenly believe that the fix is simple and you then feel under pressure to get the testing done at the end it’s harder to convince yourself to do what needs to be done rather than settle for only potentially being right.

 

[1] This is somewhat disingenuous as there are times where this is not possible, but that’s unusual in the world of mainstream software development.

[2] Without the alert on the queue size we would need to find another way to signal when processing has dropped off. For example the circuit breaker should have triggered some other alert as connection failures are to be expected, but only for a limited time before escalation needs to occur.

[3] See “Automated Integration Testing with TIBCO” for an example of how I’ve done this in the past with a TIBCO message queue.

[4] Yes, the middleware was RabbitMQ but no pun was intended, for once.

[5] I’m not suggesting the library, which is provided for free out of kindness, is at fault. On the contrary the documentation was excellent, as was the support we received on Gitter. I need to help fix this, somehow.