Wednesday 18 December 2019

Branching 0 – Git 1

My recent tirade against unnecessary branching – “Git is Not the Problem” – might have given the impression that I don’t appreciate the power that git provides. That’s not true and hopefully the following example highlights the appreciation I have for the power git provides but also why I dislike being put in that position in the first place.

The Branching Strategy

I was working in a small team with a handful of experienced developers making an old C++/ATL based GUI more accessible for users with disabilities. Given the codebase was very mature and maintenance was minimal, our remit only extended so far as making the minimal changes we needed to both the code and resource files. Hence this effectively meant no refactoring – a strictly surgical approach.

The set-up involved an integration branch per-project with us on one and the client’s team on another – master was reserved for releases. However, as they were using Stash for their repos they also wanted us to make use of its ability to create separate pull requests (PR) for every feature. This meant we needed to create independent branches for every single feature as we didn’t have permission to push directly to the integration branch even if we wanted to.

The Bottleneck

For those who haven’t had the pleasure of working with Visual Studio and C++/ATL on a native GUI with other people, there are certain files which tend to be a bottleneck, most notably resource.h. This file contains the mapping for the symbols (nay #defines) to the resource file IDs. Whenever you add a new resource, such as a localizable string, you add a new symbol and bump the two “next ID” counters at the bottom. This project ended up with us adding a lot of new resource strings for the various (localizable) annotations we used to make the various dialog controls more accessible [1].

Aside from the more obvious bottleneck this resource.h file creates, in terms of editing it in a team scenario, it also has one other undesirable effect – project rebuilds. Being a header file, and also one that has a habit of being used across most of the codebase (whether intentionally or not) if it changes then most of the codebase needs re-building. On a GUI of the size we were working on, using the development VMs we had been provided, this amounted to 45 minutes of thumb twiddling every time it changed. As an aside we couldn’t use the built-in Visual Studio editor either as the file had been edited by hand for so long that when it was saved by the editor you ended up with the diff from hell [2].

The Side-Effects

Consequently we ran into two big problems working on this codebase that were essentially linked to that one file. The first was that adding new resources meant updating the file in a way that was undoubtedly going to generate a merge conflict with every other branch because most tasks meant adding new resources. Even though we tried to coordinate ourselves by introducing padding into the file and artificially bumping the IDs we still ended up causing merge conflicts most of the time.

In hindsight we probably could have made this idea work if we added a huge amount of padding up front and reserved a large range of IDs but we knew there was another team adding GUI stuff on another branch and we expected to integrate with them more often than we did. (We had no real contact with them and the plethora of open branches made it difficult to see what code they were touching.)

The second issue was around the rebuilds. While you can git checkout –b <branch> to create your feature branch without touching resource.h again, the moment you git pull the integration branch and merge you’re going to have to take the hit [3]. Once your changes are integrated and you push your feature branch to the git server it does the integration branch merge for you and moves it forward.

Back on your own machine you want to re-sync by switching back to the integration branch, which I’d normally do with:

> git checkout <branch>
> git pull --ff-only

…except the first step restores the old resource.h before updating it again in the second step back to where you just were! Except now we’ve got another 45 minute rebuild on our hands [3].

Git to the Rescue

It had been some years since any of us had used Visual Studio on such a large GUI and therefore it took us a while to work out why the codebase always seemed to want rebuilding so much. Consequently I looked to the Internet to see if there was a way of going from my feature branch back to the integration branch (which should be identical from a working copy perspective) without any files being touched. It’s git, of course there was a way, and “Fast-forwarding a branch without checking it out” provided the answer [4]:

> git fetch origin <branch>:<branch>
> git checkout <branch>

The trick is to fetch the branch changes from upstream and point the local copy of that branch to its tip. Then, when you do checkout, only the branch metadata needs to change as the versions of the files are identical and nothing gets touched (assuming no other upstream changes have occurred in the meantime).

Discontinuous Integration

In a modern software development world where we strive to integrate as frequently as possible with our colleagues it’s issues like these that remind us what some of the barriers are for some teams. Visual C++ has been around a long time (since 1993) so this problem is not new. It is possible to break up a GUI project – it doesn’t need to have a monolithic resource file – but that requires time & effort to fix and needs to be done in a timely fashion to reap the rewards. In a product this old which is effectively on life-support this is never going to happen now.

As Gerry Weinberg once said “Things are the way they are because they got that way” which is little consolation when the clock is ticking and you’re watching the codebase compile, again.

 

[1] I hope to write up more on this later as the information around this whole area for native apps was pretty sparse and hugely diluted by the same information for web apps.

[2] Luckily it’s a fairly easy format but laying out controls by calculating every window rectangle is pretty tedious. We eventually took a hybrid approach for more complex dialogs where we used the resource editor first, saved our code snippet, reverted all changes, and then manually pasted our snippet back in thereby keeping the diff minimal.

[3] Yes, you can use touch to tweak the file’s timestamp but you need to be sure you can get away with that by working out what the effects might be.

[4] As with any “googling” knowing what the right terms are, to ask the right question, is the majority of the battle.

Monday 16 December 2019

Git is Not the Problem

Git comes in for a lot of stick for being a complicated tool that’s hard to learn, and they’re right, git is a complicated tool. But it’s a tool designed to solve a difficult problem – many disparate people collaborating on a single product in a totally decentralized fashion. However, many of us don’t need to work that way, so why are we using the tool in a way that makes our lives more difficult?

KISS

For my entire professional programming career, which now spans over 25 years, and my personal endeavours, I have used a version control tool (VCS) to manage the source code. In that time, for the most part, I have worked in a trunk-based development fashion [1]. That means all development goes on in one integration branch and the general philosophy for every commit is “always be ready to ship” [2]. As you might guess features toggles (in many different guises) play a significant part in achieving that.

A consequence of this simplistic way of working is that my development cycle, and therefore my use of git, boils down to these few steps [3]:

  • clone
  • edit / build / test
  • diff
  • add / commit
  • pull
  • push

There may occasionally be a short inner loop where a merge conflict shows up during the pull (integration) phase which causes me to go through the edit / diff / commit cycle again, but by-and-large conflicts are rare due to close collaboration and very short change cycles. Ultimately though, from the gazillions of commands that git supports, I mostly use just those 6. As you can probably guess, despite using git for nearly 7 years, I actually know very little about it (command wise). [4]

Isolation

Where I see people getting into trouble and subsequently venting their anger is when branches are involved. This is not a problem which is specific to git though, you see this crop up with any VCS that supports branches whether it be ClearCase, Perforce, Subversion, etc. Hence, the tool is not the problem, the workflow is. And that commonly stems from a delivery process mandated by the organization, meaning that ultimately the issue is one of an organizational nature, not the tooling per-se.

An organisation which seeks to reduce risk by isolating work (and by extension its people) onto branches is increasing the delay in feedback thereby paradoxically increasing the risk of integration, or so-called “merge debt”. A natural side-effect of making it harder to push through changes is that people will start batching up work in an attempt to boost "efficiency”. The trick is to go in the opposite direction and break things down into smaller units of work that are easier to produce and quicker to improve. Balancing production code changes with a solid investment in test coverage and automation reduces that risk further along with collaboration boosting techniques like pair and mob programming.

Less is More

Instead of enforcing a complicated workflow and employing complex tools in the hope that we can remain in control of our process we should instead seek to keep the workflow simple so that our tools remain easy to use. Git was written to solve a problem most teams don’t have as they neither have the volume of distributed people or complexity of product to deal with. Organisations that do have complex codebases cannot expect to dig themselves out of their hole simply by introducing a more powerful version control tool, it will only increase the cost of delay while bringing a false sense of security as programmers work in the dark for longer.

 

[1] My “Branching Strategies” article in ACCU’s Overload covers this topic if you’re looking for a summary.

[2] This does not preclude the use of private branches for spikes and/or release branches for hotfix engineering when absolutely needed. #NoAbsolutes.

[3] See “In The Toolbox - Commit Checklist” for some deeper discussion about what goes through my head during the diff / commit phase.

[4] I pondered including “log” in the list for when doing a spot of software archaeology but that is becoming much rarer these days. I also only use “fetch” when I have to work with feature branches.

Friday 13 December 2019

Choosing “a” Database, not “the” Database

One thing I’ve run across a few times over the years is the notion that an application or system has one, and only one, database product. It’s as if the answer to the question about where we should store our data must be about where we store “all” our data.

Horses for Courses

I’ve actually touched on this topic before in “Deferring the Database Choice” where our team tried to put off the question as long as possible because of a previous myopic mindset and there was a really strong possibility that we might even have a need for two different styles of database – relational and document-oriented – because we had two different types of data to store with very different constraints.

In that instance, after eventually working out what we really needed, we decided to look at a traditional relational database for the transactional data [1], while we looked towards the blossoming NoSQL crowd for the higher-volume non-transactional data. While one might have sufficed for both purposes the organisational structure and lack of operational experience at the time meant we didn’t feel comfortable putting all our eggs in that one NoSQL basket up front.

As an aside the Solution Architect [2] who was assigned to our team by the client definitely seemed out of their comfort zone with the notion that we might want to use different products for different purposes.

Platform Investment

My more recent example of this line of reasoning around “the one size fits all” misnomer was while doing some consulting at a firm in the insurance sector, an area where mainframes and legacy systems pervade the landscape.

In this particular case I had been asked to help advise on the architecture of a few new internal services they were planning. Two were really just caches of upstream data designed to reduce the per-cost call of 3rd party services while the third would serve up flood related data which was due to be incorporated into insurance pricing.

To me they all seemed like no-brainers. Even the flood data service just felt like it was probably a simple web service (maybe REST) that looks up the data in a document oriented database based on the postcode key. The volume of requests and size of the dataset did not seem remarkable in any way, nor the other caches. The only thing that I felt deserved any real thought was around the versioning of the data, if that was even a genuine consideration. (I was mostly trying to think of any potential risks that might vaguely add to the apparent lack of complexity.)

Given the company already called out from its mainframe to other web services they had built, this was a solved problem, and therefore I felt there was no reason not to start knocking up the flood data service which, given its simplicity, could be done outside-in so that they’d have their first microservice built TDD-style (an approach they wanted to try out anyway). They could even plug it in pretty quickly and just ignore the responses back to the mainframe in the short term so that they could start getting a feel for the operational aspects. In essence it seemed the perfect learning opportunity for many new skills within the department.

An Undercurrent

However, while I saw this as a low-risk venture there were questions from further up effectively about choosing the database. I suspected there were concerns about the cost but some rudimentary calculations based around a three-node cluster with redundant disks versus storage for the mainframe showed that they weren’t even in the same ballpark and we’re not even talking SSDs here either. (This also ignores the fact that they were close to maxing out the mainframe anyway.)

One of the great things about databases in these modern times is that you can download the binaries and just fire one up and get playing. Given the dataset fitted the document-oriented paradigm and there were no transactions to speak of I suggested they pick either MongoDB or Couchbase and just get started as it was the paradigm they most needed to get acquainted with, the specific vendor (to me) was less of a concern in the shorter term as the data model was simple.

Nevertheless, rather than build something first and get a feel for what makes most sense, they wanted to invite the various big NoSQL vendors in and discuss contracts and products up-front. So I arranged for the three main contenders at the time to visit the company’s offices and give a pitch, followed by some Q&A time for the management to ask any burning questions. It was during the first of these three pitches that I began to realise where the disconnect lay between my vision and theirs.

While I had always been working on the assumption that the company was most comfortable with mainframes and relational databases and that they wanted to step outside that and move to a less monolithic architecture, perhaps using the Strangler Pattern to break out the peripheral services into independent self-contained ones, they still saw a single database product sitting at the heart. Yes, the services might be built separately, and the data may well be partitioned via namespaces or collections or whatever, but fundamentally the assumption was that the data storage was still effectively monolithic.

A False Economy

In retrospect I shouldn’t really have been that surprised. The reason the mainframe had probably survived for so long was that the data was seen as the crown jewels and the problems of redundancy and backup had been solved long ago and were pretty robust. In fact if anything went wrong the vendor could helicopter some experts in (which they had done in the past). This was not the level of service offered by the new kids on the block and the company was still far from getting comfortable with cloud hosting and managed service providers which were are starting to spring up.

Hence, where I was looking at the somewhat disposable nature of the new services purely as an opportunity for learning, others higher up were looking at it as a stepping stone to moving all their data across to another platform. Coupled with this was the old-fashioned view that the decision needed to be made up-front and needed to be the right one from the off [3].

A Different Investment

Even with this misconception acknowledged and the shining cost savings to be had there was still a heavy reluctance to go with something new. I believe that in the end they put their investment into more mainframe storage instead of investing in their people and the organisation’s longer term future.

 

[1] There was definitely an element of “availability bias” here as the organisation had a volume licensing agreement with a relational database vendor.

[2] A role which highlighted their Ivory Tower approach at the time but has since fallen away as architecture has thankfully started leaning more towards shared ownership.

[3] Some of the impetus for “Don’t Fail Fast, Learn Cheaply” came from conversations I had with this organisation about their approach to career development.

Monday 9 December 2019

Automating Windows VM Creation on Ubuntu

TL;DR you can find my resulting Oz and Packer configuration files in this Oz gist and this Packer gist on my GitHub account.

As someone who has worked almost exclusively on Windows for the last 25 years I was somewhat surprised to find myself needing to create Windows VMs on Linux. Ultimately these were to be build server agents and therefore I needed to automate everything from creating the VM image, to installing Windows, and eventually the build toolchain. This post looks at the first two aspects of this process.

I did have a little prior experience with Packer, but that was on AWS where the base AMIs you’re provided have already got you over the initial OS install hurdle and you can focus on baking in your chosen toolchain and application. This time I was working on-premise and so needed to unpick the Linux virtualization world too.

In the end I managed to get two approaches working – Oz and Packer – on the Ubuntu 18.04 machine I was using. (You may find these instructions useful for other distributions but I have no idea how portable this information is.)

QEMU/KVM/libvirt

On the Windows-as-host side (until fairly recently) virtualization boiled down to a few classic options, such as Hyper-V and Virtual Box. The addition of Docker-style Windows containers, along with Hyper-V containers has padded things out a bit more but to me it’s still fairly manageable.

In contrast on the Linux front, where this technology has been maturing for much longer, we have far more choice, and ultimately, for a Linux n00b like me [1], this means far more noise to wade through on top of the usual “which distribution are you running” type questions. In particular the fact that any documentation on “virtualization” could be referring to containers or hypervisors (or something in-between), when you’re only concerned with hypervisors for running Windows VMs, doesn’t exactly aid comprehension.

Luckily I was pointed towards KVM as a good starting point on the Linux hypervisor front. QEMU is one of those minor distractions as it can provide full emulation, but it also provides the other bit KVM needs to be useful in practice – device emulation. (If you’re feeling nostalgic you can fire up an MS-DOS recovery boot-disk from “All Boot Disks” under QMEU/KVM with minimal effort which gives you a quick sense of achievement.)

What I also found mentioned in the same breath as these two was a virtualization “add-on layer” called libvirt which provides a layer on top of the underlying technology so that you can use more technology agnostic tools. Confusingly you might notice that Packer doesn’t mention libvirt, presumably because it already has providers that work directly with the lower layer.

In summary, using apt, we can install this lot with:

$ sudo apt install qemu qemu-kvm libvirt-bin  bridge-utils  virt-manager -y

Windows ISO & Product Key

We’re going to need a Windows ISO along with a related product key to make this work. While in the end you’ll need a proper license key I found the Windows 10 Evaluation Edition was perfect for experimentation as the VM only lasts for a few minutes before you bin it and start all over again.

You can download the latest Windows image from the MS downloads page which, if you’ve configured your browser’s User-Agent string to appear to be from a non-Windows OS, will avoid all the sign-up nonsense. Alternatively google for “care.dlservice.microsoft.com” and you’ll find plenty of public build scripts that have direct download URLs which are beneficial for automation.

Although the Windows 10 evaluation edition doesn’t need a specific license key you will need a product key to stick in the autounattend.xml file when we get to that point. Luckily you can easily get that from the MS KMS client keys page.

Windows Answer File

By default Windows presents a GUI to configure the OS installation, but if you give it a special XML file known as autounattend.xml (in a special location, which we’ll get to later) all the configuration settings can go in there and the OS installation will be hands-free.

There is a specific Windows tool you can use to generate this file, but an online version in the guise of the Windows Answer File Generator produced a working file with fairly minimal questions. You can also generate one for different versions of the Windows OS which is important as there are many examples that appear on the Internet but it feels like pot-luck as to whether it would work or not as the format changes slightly between releases and it’s not easy to discover where the impedance mismatch lies.

So, at this point we have our Linux hypervisor installed, and downloaded a Windows installation .iso along with a generated autounattend.xml file to drive the Windows install. Now we can get onto building the VM, which I managed to do with two different tools – Oz and Packer.

Oz

I was flicking through a copy of Mastering KVM Virtualization and it mentioned a tool called Oz which was designed to make it easy to build a VM along with installing an OS. More importantly it listed having support for most Windows editions too! Plus it’s been around for a fairly long time so is relatively mature. You can install it with apt:

$ sudo apt install oz -y

To use it you create a simple configuration file (.tdl) with the basic VM details such as CPU count, memory, disk size, etc. along with the OS details, .iso filename, and product key (for Windows), and then run the tool:

$ oz-install -d2 -p windows.tdl -x windows.libvirt.xml

If everything goes according to plan you end up with a QEMU disk image and an .xml file for the VM (called a “domain”) that you can then register with libvirt:

$ virsh define windows.libvirt.xml

Finally you can start the VM via libvirt with:

$ virsh start windows-vm

I initially tried this with the Windows 8 RTM evaluation .iso and it worked right out of the box with the Oz built-in template! However, when it came to Windows 10 the Windows installer complained about there being no product key, despite the Windows 10 template having a placeholder for it and the key was defined in the .tdl configuration file.

It turns out, as you can see from Issue #268 (which I raised in the Oz GitHub repo) that the Windows 10 template is broken. The autounattend.xml file also wants the key in the <UserData> section too it seems. Luckily for me oz-install can accept a custom autounattend.xml file via the -a option as long as we fill in any details manually, like the <AutoLogin> account username / password, product key, and machine name.

$ oz-install -d2 -p windows.tdl -x windows.libvirt.xml –a autounattend.xml

That Oz GitHub issue only contains my suggestions as to what I think needs fixing in the autounattend.xml file, I also have a personal gist on GitHub that contains both the .tdl and .xml files that I successfully used. (Hopefully I’ll get a chance to submit a formal PR at some point so we can get it properly fixed; it also needs a tweak to the Python code as well I believe.)

Note: while I managed to build the basic VM I didn’t try to do any post-processing, e.g. using WinRM to drive the installation of applications and tools from the outside.

Packer

I had originally put Packer to one side because of difficulties getting anything working under Hyper-V on Windows but with my new found knowledge I decided to try again on Linux. What I hadn’t appreciated was quite how much Oz was actually doing for me under the covers.

If you use the Packer documentation [2] [3] and online examples you should happily get the disk image allocated and the VM to fire up in VNC and sit there waiting for you to configure the Windows install. However, after selecting your locale and keyboard you’ll probably find the disk partitioning step stumps you. Even if you follow some examples and put an autounattend.xml on a floppy drive you’ll still likely hit a <DiskConfiguration> error during set-up. The reason is probably because you don’t have the right Windows driver available for it to talk to the underlying virtual disk device (unless you’re lucky enough to pick an IDE based example).

One of the really cool things Oz appears to do is handle this nonsense along with the autounattend.xml file which it also slips into the .iso that it builds on-the-fly. With Packer you have to be more aware and fetch the drivers yourself (which come as part of another .iso) and then mount that explicitly as another CD-ROM drive by using the qemuargs section of the Packer builder config. (In my example it’s mapped as drive E: inside Windows.)

[ "-drive", "file=./virtio-win.iso,media=cdrom,index=3" ]

Luckily you can download the VirtIO drivers .iso from a Fedora page and stick it alongside the Windows .iso. That’s still not quite enough though, we also need to tell the Windows installer where our drivers are located; we do that with a special section in the autounattend.xml file.

<DriverPaths>
  <PathAndCredentials wcm:action="add" wcm:keyValue="1">
    <Path>E:\NetKVM\w10\amd64\</Path>

Finally, in case you’ve not already discovered it, the autounattend.xml file is presented by Packer to the Windows installer as a file in the root of a floppy drive. (The floppy drive and extra CD-ROM drives both fall away once Windows has bootstrapped itself.)

"floppy_files":
[
  "autounattend.xml",

Once again, as mentioned right at the top, I have a personal gist on GitHub that contains the files I eventually got working.

With the QEMU/KVM image built we can then register it with libvirt by using virt-install. I thought the --import switch would be enough here as we now have a runnable image, but that option appears to be for a different scenario [4], instead we have to take two steps – generate the libvirt XML config file using the --print-xml option, and then apply it:

$ virt-install --vcpus ... --disk ...  --print-xml > windows.libvert.xml
$ virsh define windows.libvert.xml

Once again you can start the finalised VM via libvirt with:

$ virsh start windows-vm

Epilogue

While having lots of documentation is generally A Good Thing™, when it’s spread out over a considerable time period it’s sometimes difficult to know if the information you’re reading still applies today. This is particularly true when looking at other people’s example configuration files alongside reading the docs. The long-winded route might still work but the tool might also do it automatically now if you just let it, which keeps your source files much simpler.

Since getting this working I’ve seen other examples which suggest I may have fallen foul of this myself and what I’ve written up may also still be overly complicated! Please feel free to use the comments section on this blog or my gists to inform any other travellers of your own wisdom in any of this.

 

[1] That’s not entirely true. I ran Linux on an Atari TT and a circa v0.85 Linux kernel on a 386 PC in the early-to-mid ‘90s.

[2] The Packer docs can be misleading. For example it says the disk_size is in bytes and you can use suffixes like M or G to simplify matters. Except they don’t work and the value is actually in megabytes. No wonder a value of 15,000,000,000 didn’t work either :o).

[3] Also be aware that the version of Packer available via apt is only 1.0.x and you need to manually download the latest 1.4.x version and unpack the .zip. (I initially thought the bug in [2] was down to a stale version but it’s not.)

[4] The --import switch still fires up the VM as it appears to assume you’re going to add to the current image, not that it is the final image.


Monday 18 November 2019

Arbitrary Cache Timeouts

Like many other programmers I’ve probably added my fair share of caches to systems over the years, and as we know from the old joke, one of the two hardest problems in computer science is knowing when to invalidate them. It’s a hard question, to be sure, but a really annoying behaviour you can run into as a maintainer is when the invalidation appears to be done arbitrarily, usually by specifying some timeout seemingly plucked out of thin air and maybe even changed equally arbitrarily. (It may not be, but documenting such decisions is usually way down the list of important things to do.)

Invalidation

If there is a need for a cache in production, and let’s face it that’s the usual driver, then any automatic invalidation is likely to be based on doing it as infrequently as possible to ensure the highest hit ratio. The problem is that that value can often be hard-coded and mask cache invalidation bugs because it rarely kicks in. The knee-jerk reaction to “things behaving weirdly” in production is to switch everything off-and-on again thereby implicitly invalidating any caches, but this doesn’t help us find those bugs.

The most recent impetus for this post was just such a bug which surfaced because the cache invalidation logic never ran in practice. The cache timeout was set arbitrarily large, which seemed odd, but I eventually discovered it was supposed to be irrelevant because the service hosting it should have been rebooted at midnight every day! Due to the password for the account used to run the reboot task expiring it never happened and the invalidated items then got upset when they were requested again. Instead of simply fetching the item from the upstream source and caching it again, the cache had some remnants of the stale items and failed the request instead. Being an infrequent code path it didn’t obviously ring any bells so took longer to diagnose.

Design for Testability

While it’s useful to avoid throwing away data unnecessarily in production we know that the live environment rarely needs the most flexibility when it comes to configuration (see “Testing Drives the Need for Flexible Configuration”). On the contrary, I’d expect to have any cache being cycled reasonably quickly in a test environment to try and flush out any issues as I’d expect more side-effects from cache misses than hits.

If you are writing any automated tests around the caching behaviour that is often a good time to consider the other non-functional requirements, such as monitoring and support. For example, does the service or tool hosting the cache expose some means to flush it manually? While rebooting a service may do the trick it does nothing to help you track down issues around residual state and often ends up wreaking havoc with any connected clients if they’re not written with a proper distributed system mindset.

Another scenario to consider is if the cache gets poisoned; if there is no easy way to eject the bad data you’re looking at the sledgehammer approach again. If your cache is HA (highly available) and backed by some persistent storage getting bad data out could be a real challenge when you’re under the cosh. One system I worked on had random caches poisoned with bad data due to a threading serialization bug in an external library.

Monitoring

The monitoring side is probably equally important. If you generate no instrumentation data how do you know if your cache is even having the desired effect? One team I was on added a new cache to a service and we were bewildered to discover that it was never used. It turned out the WCF service settings were configured to create a new service instance for every request and therefore a new cache was created every time! This was despite the fact that we had unit tests for the cache and they were happily passing [1].

It’s also important to realise that a cache without an eviction policy is just another name for a memory leak. You cannot keep caching data forever unless you know there is a hard upper bound. Hence you’re going to need to use the instrumentation data to help find the sweet spot that gives you the right balance between time and space.

We also shouldn’t blindly assume that caches will continue to provide the same performance in future as they do now; our metrics will allow us to see any change in trends over time which might highlight a change in data that’s causing it to be less efficient. For example one cache I saw would see its efficiency plummet for a while because a large bunch of single use items got requested, cached, and then discarded as the common data got requested again. Once identified we disabled caching for those kinds of items, not so much for the performance benefit but to avoid blurring the monitoring data with unnecessary “glitches” [2].

 

[1] See “Man Cannot Live by Unit Testing Alone” for other tales of the perils of that mindset.

[2] This is a topic I covered more extensively in my Overload article “Monitoring: Turning Noise Into Signal”.

Thursday 14 November 2019

Validate in Production

The change was reasonably simple: we had to denormalise some postcode data which was currently held in a centralised relational database into some new fields in every client’s database to remove some cross-database joins that would be unsupported on the new SQL platform we were migrating too [1].

As you might imagine the database schema changes were fairly simple – we just needed to add the new columns as nullable strings into every database. The next step was to update the service code to start populating these new fields as addresses were added or edited by using data from the centralised postcode database [2].

At this point any new data or data that changed going forward would have the correctly denormalised state. However we still needed to fix up any existing data and that’s the focus of this post.

Migration Plan

To fix-up all the existing client data we needed to write a tool which would load each client’s address data that was missing its new postcode data, look it up against the centralised list, and then write back any changes. Given we were still using the cross-database joins in live for the time being to satisfy the existing reports we could roll this out in the background and avoiding putting any unnecessary load on the database cluster.

The tool wasn’t throw-away because the postcode dataset gets updated regularly and so the denormalised client data needs to be refreshed whenever the master list is updated. (This would not be that often but enough to make it worth spending a little extra time writing a reusable tool for the job for ops to run.)

Clearly this isn’t rocket science, it just requires loading the centralised data into a map, fetching the client’s addresses, looking them up, and writing back the relevant fields. The tool only took a few hours to write and test and so it was ready to run for the next release during a quiet period.

When that moment arrived the tool was run across the hundreds of client databases and plenty of data was fixed up in the process, so the task appeared completed.

Next Steps

With all the existing postcode data now correctly populated too we should have been in a position to switch the report generation feature toggle on so that it used the new denormalised data instead of doing a cross-database join to the existing centralised store.

While the team were generally confident in the changes to date I suggested we should just do a sanity check and make sure that everything was working as intended as I felt this was a reasonably simple check to run.

An initial SQL query someone knocked up just checked how many of the new fields had been populated and the numbers seemed about right, i.e. very high (we’d expect some addresses to be missing data due to missing postcodes, typos and stale postcode data). However I still felt that we should be able to get a definitive answer with very little effort by leveraging the existing we SQL we were about to discard, i.e. use the cross-database join one last time to verify the data population more precisely.

Close, but No Cigar

I massaged the existing report query to show where data from the dynamic join was different to that in the new columns that had been added (again, not rocket science). To our surprise there were quite a significant number of discrepancies.

Fortunately it didn’t take long to work out that those addresses which were missing postcode data all had postcodes which were at least partially written in lowercase whereas the ones that had worked were entirely written in uppercase.

Hence the bug was fairly simple to track down. The tool loaded the postcode data into a dictionary (map) keyed on the string postcode and did a straight lookup which is case-sensitive by default. A quick change to use a case-insensitive comparison and the tool was fixed. The data was corrected soon after and the migration verified.

Why didn’t this show up in the initial testing? Well, it turned out the tools used to generate the test data sets and also to anonymize real client databases were somewhat simplistic and this helped to provide a false level of confidence in the new tool.

Testing in Production

Whenever we make a change to our system it’s important that we verify we’ve delivered what we intended. Oftentimes the addition of a feature has some impact on the front-end and the customer and therefore it’s fairly easy to see if it’s working or not. (The customer usually has something to say about it.)

However back-end changes can be harder to verify thoroughly, but it’s still important that we do the best we can to ensure they have the expected effect. In this instance we could easily check every migrated address within a reasonable time frame and know for sure, but on large data sets this might unfeasible so you might have to settle for less. Also the use of feature switches and incremental delivery meant that even though there was a bug it did not affect the customers and we were always making forward progress.

Testing does not end with a successful run of the build pipeline or a sign-off from a QA team – it also needs to work in real life too. Ideally the work we put in up-front will make that more likely but for some classes of change, most notably where actual customer data is involved, we need to follow through and ensure that practice and theory tie up.

 

[1] Storage limitations and other factors precluded simply moving the entire postcode database into each customer DB before moving platforms. The cost was worth it to de-risk the overall migration.

[2] There was no problem with the web service having two connections to two different databases, we just needed to stop writing SQL queries that did cross-database joins.

Thursday 28 March 2019

PowerShell’s Call Operator (&) Arguments with Embedded Spaces and Quotes

I was recently upgrading a PowerShell script that used the v2 nunit-console runner to use the v3 one instead when I ran across a weird issue with PowerShell. I’ve haven’t found a definitive bug report or release note yet to describe the change in behaviour, hence I’m documenting my observation here in the meantime.

When running the script on my desktop machine, which runs Windows 10 and PowerShell v5.x it worked first time, but when pushing the script to our build server, which was running Windows Server 2012 and PowerShell v4.x it failed with a weird error that suggested the command line being passed to nunit-console was borked.

Passing Arguments with Spaces

The v3 nunit-console command line takes a “/where” argument which allows you to provide a filter to describe which test cases to run. This is a form of expression and the script’s default filter was essentially this:

cat == Integration && cat != LongRunning

Formatting this as a command line argument it then becomes:

/where:“cat == Integration && cat != LongRunning”

Note that the value for the /where argument contains spaces and therefore needs to be enclosed in double quotes. An alternative of course is to enclose the whole argument in double quotes instead:

“/where:cat == Integration && cat != LongRunning”

or you can try splitting the argument name and value up into two separate arguments:

/where “cat == Integration && cat != LongRunning”

I’ve generally found these command-line argument games unnecessary unless the tool I’m invoking is using some broken or naïve command line parsing library [1]. (In this particular scenario I could have removed the spaces too but if it was a path, like “C:\Program Files\Xxx”, I would not have had that luxury.)

PowerShell Differences

What I discovered was that on PowerShell v4 when an argument has embedded spaces it appears to ignore the embedded quotes and therefore sticks an extra pair of quotes around the entire argument, which you can see here:

> $where='/where:"cat == Integration"'; & cmd /c echo $where
"/where:"cat == Integration""

…whereas on PowerShell v5 it “notices” that the value with spaces is already correctly quoted and therefore elides the outer pair of double quotes:

> $where='/where:"cat == Integration"'; & cmd /c echo $where
/where:"cat == Integration"

On PowerShell v4 only by removing the spaces, which I mentioned above may not always be possible, can you stop it adding the outer pair of quotes:

> $where='/where:"cat==Integration"'; & cmd /c echo $where
/where:"cat==Integration"

…of course now you don’t need the quotes anymore :o). However, if for some reason you are formatting the string, such as with the –f operator that might be useful (e.g. you control the value but not the format string).

I should point out that this doesn’t just affect PowerShell v4, I also tried it on my Vista machine with PowerShell v2 and that exhibited the same behaviour, so my guess is this was “fixed” in v5.

[1] I once worked with an in-house C++ based application framework that completely ignored the standard parser that fed main() and instead re-parsed the arguments, very badly, from the raw string obtained from GetCommandLine().

Friday 22 March 2019

CI/CD Server Inline Scripts

As you might have already gathered if you’d read my 2014 post “Building the Pipeline - Process Led or Product Led?” I’m very much in favour of developing a build and deployment process locally first, then automating that, rather than clicking buttons in a dedicated CI/CD tool and hoping I can debug it later. I usually end up at least partially scripting builds anyway [1] to save time waiting for the IDE to open [2] when I just need some binaries for a dependency, so it’s not wasted effort.

Inline Scripts

If other teams prefer to configure their build or deployment through a tool’s UI I don’t really have a problem with that if I know I can replay the same steps locally should I need to test something out as the complexity grows. What I do find disturbing though is when some of the tasks use inline scripts to do something non-trivial, like perform the entire deployment. What’s even more disturbing is when that task script is then duplicated across environments and maintained independently.

Versioning

There are various reasons why we use a version control tool, but first and foremost they provide a history, which implies that we can trace back any changes that have been made and we have a natural backup should we need to roll back or restore the build server.

Admittedly most half-decent build and deployment tools come with some form of versioning built in which you gives that safety net. However having that code versioned in a separate tool and repository from the main codebase means that you have to work harder to correlate what version of the system requires what version of the build process. CI/CD tools tend to present you with a fancy UI for looking at the history rather than giving you direct access to, say, it’s internal git repo. And even then what the tool usually gives you is “what” changed, but does not also provide the commentary on “why” it was changed. Much of what I wrote in my “Commit Checklist” equally applies to build and deployment scripts as it does production code.

Although Jenkins isn’t the most polished of tools compared to, say, TeamCity it is pretty easy to configure one of the 3rd party plugins to yank the configuration files out and check them into the same repo as the source code along with a suitable comment. As a consequence any time the repo is tagged due to a build being promoted the Jenkins build configuration gets included for free.

Duplication

My biggest gripe is not with the versioning aspect though, which I believe is pretty important for any non-trivial process, but it’s when the script is manually duplicated across environments. Having no single point of truth, from a logic perspective, is simply asking for trouble. The script will start to drift as subtleties in the environmental differences become enshrined directly in the logic rather than becoming parameterised behaviours.

The tool’s text editor for inline script blocks is usually a simple edit box designed solely for trivial changes; anything more significant is expected to be handled by pasting into a real editor instead. But we all know different people like different editors and so this becomes another unintentional source of difference as tabs and spaces fight for domination.

Fundamentally there should be one common flow of logic that works for every environment. The differences between them should boil down to simple settings, like credentials, or cardinality of resources, e.g. the number of machines in the cluster. Occasionally there may be custom branches in logic, such as the need for a proxy server, but it should be treated as a minor deviation that could apply to any environment, but just happens to only be applicable to, say, one at the moment.

Testability

This naturally leads onto the inherent lack of testability outside of the tool and workflow. It’s even worse if the script makes use of some variable substitution system that the CI/CD tool provides because that means you have to manually fix-up the code before running it outside the tool, or keep running it in the tool and use printf() style debugging by looking at the task’s output.

All script engines I’m aware of accept arguments, so why not run the script as an external script and pass the arguments from the tool in the tried and tested way? This means the tool runs it pretty much the same way you do except perhaps for some minor environmental differences, like user account or current working directory which are all common problems and easily overcome. Most modern scripting languages come with a debugger too which seems silly to give up.

Of course this doesn’t mean that you have to make every single configuration setting a separate parameter to the script, that would be overly complicated too. Maybe you just provide one parameter which is a settings file for the environment with a bunch of key/value pairs. You can then tweak the settings as appropriate while you test and debug. While idempotence and the ideas behind Desired State Configuration (DSC) are highly desirable, there is no reason we can’t also borrow from the Design for Testability guidebook here too by adding features making it easier to test.

Don’t forget that scripting languages often come with unit test frameworks these days too which can allow you to mock out code which has nasty side-effects so you can check your handling and orchestration logic. For example PowerShell has Pester which really helps bring some extra discipline to script development; an area which has historically been tough due to the kinds of side-effects created by executing the code.

Complexity

When an inline script has grown beyond the point where Hoare suggests “there are obviously no deficiencies”, which is probably anything more than a trivial calculation or invocation of another tool, then it should be decomposed into smaller functional units. Then each of these units can be tested and debugged in isolation and perhaps the inline script then merely contains a couple of lines of orchestration code, which would be trivial to replicate at a REPL / prompt.

For example anything around manipulating configuration files is a perfect candidate for factoring out into a function or child script. It might be less efficient to invoke the same function a few times rather than read and write the file once, but in the grand scheme of things I’d bet it’s marginal in comparison to the rest of the build or deployment process.

Many modern scripting languages have a mechanism for loading some sort of module or library of code. Setting up an internal package manager is a pretty heavyweight option in comparison to publishing a .zip file of scripts but if it helps keep the script complexity under control and provides a versioned repository that can be reliably queried at execution time, then why not go for that instead?

Scripts are Artefacts

It’s easy to see how these things happen. What starts off as a line or two of script code eventually turns into a behemoth before anyone realises it’s not been versioned and there are multiple copies. After all, the deployment requirements historically come up at the end of the journey, after the main investment in the feature has already happened. The pressure is then on to get it live, and build & deployment, like tests, is often just another second class citizen.

The Walking Skeleton came about in part to push back against this attitude and make the build pipeline and tests part and parcel of the whole delivery process; it should not be an afterthought. This means it deserves the same rigour we apply elsewhere in our process.

Personally I like to see everything go through the pipeline, by which I mean that source code, scripts, configuration, etc. all enter the pipeline as versioned inputs and are passed along until the deployed product pops out the other end. The way you build your artefacts is inherently tied to the source code and project configuration that produces it. Configuration, whether it be infrastructure or application settings, is also linked to the version of the tools, scripts, and code which consumes it. It’s more awkward to inject version numbers into scripts, like you do with binaries, but even pushing them through the pipeline in a .zip file with version number in the filename makes a big difference to tracking the “glue”.

Ultimately any piece of the puzzle that directly affects the ability to safely deliver continuous increments of a product needs to be held in high regard and treated with the respect it deserves.

 

[1] See “Cleaning the Workspace” for more about why I don’t trust my IDE to clean up after itself.

[2] I’m sure I could load Visual Studio, etc. in “safe mode” to avoid waiting for all the plug-ins and extensions to initialise but it still seems “wrong” to load an entire IDE just to invoke the same build tool I could invoke almost directly from the command line myself.

Thursday 14 March 2019

Abstraction with Database Views

After being away from the relational database world for a few years it’s been interesting coming back and working on a mature system with plenty of SQL code. It’s been said that SQL is the assembly language of databases and when SQL code is written only using its primitives (types and tables) it’s easy to see why.

Way back in 2011 I wrote “The Public Interface of a Database” which was a distillation of my thoughts at the time about what I felt was generally wrong with much of the database code I saw. One aspect in particular which I felt was sorely underutilised was the use of views to build a logical model over the top of the physical model to allow a more emergent design to unfold. This post documents some of the ways I’ve found views to be beneficial in supporting a more agile approach to database design.

Views for Code Reuse

The first thing that struck me about the recent SQL code I saw was how much there was of it. Most queries were pretty verbose and as a consequence you had to work hard to comprehend what was going on. Just as you see the same tired examples around Orders => OrderItems => Products so the code had a similar set of 3 table joins over and over again as they formed the basis for so many queries.

One of the primary uses for database views is as a code reuse mechanism. Instead of copy-and-pasting the same bunch of joins everywhere:

FROM Orders o
INNER JOIN OrderItems oi
ON o.Id = oi.OrderId 
INNER JOIN Products p
ON oi.ProductId = p.Id

we could simply say:

FROM OrdersOrderItemsProducts

This one simplification reduces a lot of complexity and means that wherever we see that name we instantly recognise it without mentally working through the joins in our head. Views are composable too meaning that we can implement one view in terms of another rather than starting from scratch every time.

Naming

However, if the name OrdersOrderItemsProducts makes you wince then I don’t blame you because it’s jarring due to its length and unnaturalness. It’s a classic attempt at naming based on how it’s implemented rather than what it means.

I suspect a difficulty in naming views is part of the reason for their lack of use in some cases. For our classic example above I would probably go with OrderedProducts or ProductsOrdered. The latter is probably preferable as the point of focus is the Products “set” with the use of Orders being a means to qualify which products we’re interested in, like “users online”. Of course one could just easily say “unread messages” and therefore we quickly remember why naming is one of the two hardest problems in computer science.

Either way it’s important that we do spend the time required to name our views appropriately as they become the foundation on which we base many of our other queries.

Views for Encapsulation

Using views as a code reuse mechanism is definitely highly beneficial but where I think they start to provide more value are as a mechanism for revealing new, derived sets of data. The name ProductsOrdered is not radically different from the more long-winded OrdersOrderItemsProducts and therefore it still heavily reflects the physical relationship of the underlying tables.

Now imagine a cinema ticketing system where you have two core relationships: Venue => Screen => SeatingPlan and Film => Screening => Ticket => Seat. By navigating these two relationships it is possible to determine the occupancy of the venue, screen, showing, etc. and yet the term Occupancy says nothing about how that is achieved. In essence we have revealed a new abstraction (Occupancy) which can be independently queried and therefore elevates our thinking to a higher plane instead of getting bogged down in the lengthy chain of joins across a variety of base tables.

Views for Addressing Uncertainty

We can also turn this thinking upside down, so that rather than creating something new by hiding the underlying existing structure, we can start with something concrete and re-organise how things work underneath. This is the essence of refactoring – changing the design without changing the behaviour.

When databases were used as a point of integration this idea of hiding the underlying schema from “consumers” made sense as it gave you more room to change the schema without breaking a bunch of queries your consumers had already created. But even if you have sole control over your schema there is still a good reason why you might want to hide the schema, nay implementation, even from much of your own code.

Imagine you are developing a system where you need to keep daily versions of your customer’s details easily accessible because you regularly perform computations across multiple dates [1] and you need to use the correct version of each customer’s data for the relevant date. When you start out you may not know what the most appropriate way to store them because you do not know how frequently they change, what kinds of changes are made, or how the data will be used in practice.

If you assume that most attributes change most days you may well plump to just store them daily, in full, e.g.

| Date       | Name      | Valuation | ... | 
| 2019-03-01 | Company A | £102m     | ... |  
| 2019-03-01 | Company B | £47m      | ... |  
| 2019-03-02 | Company A | £105m     | ... |  
| 2019-03-02 | Company B | £42m      | ... |  
| 2019-03-03 | Company A | £105m     | ... |  
| 2019-03-03 | Company B | £42m      | ... |

On the contrary, if the attributes rarely change each day then maybe we can version the data instead:

| Name      | Version | Valuation | ... |
| Company A | 1       | £147m     | ... |
| Company A | 2       | £156m     | ... |
| Company B | 1       | £27m      | ... |

So far so good, but how do we track which version belongs to which date? Once again I can think of two obvious choices. The first is much like the original verbose table and we record it on a daily basis:

| Date       | Name      | Version |
| 2019-03-01 | Company A | 1       |
| 2019-03-01 | Company B | 1       |
| 2019-03-02 | Company A | 1       |
| 2019-03-02 | Company B | 2       |

The second is to coalesce dates with the same version creating a much more compact form:

| From       | To         | Name      | Version |
| 2019-03-01 | (null)     | Company A | 1       |
| 2019-03-01 | 2019-03-01 | Company B | 1       |
| 2019-03-02 | (null)     | Company B | 2       |

Notice how we have yet another design choice to make here – whether to use NULL to represent “the future”, or whether to put today’s date as the upper bound and bump it on a daily basis [2].

So, with all those choices how do we make a decision? What if we don’t need to make a decision, now? What if we Use Uncertainty as a Driver and create a design that is easily changeable when we know more about the shape of the data and how it’s used?

What we do know is that we need to process customer data on a per-date basis, therefore, instead of starting with a Customer table we start with a Customer view which has the shape we’re interested in:

| Date | Name | Valuation | ... | 

We can happily use this view wherever we like knowing that the underlying structure could change without us needing to fix up lots of code. Naturally some code will be dependent on the physical structure, but the point is that we’ve kept it to a bare minimum. If we need to transition from one design to another, but can’t take the downtime to rewrite all the data up-front, that can often be hidden behind the view too.

Views as Interfaces

It’s probably my background [3] but I can’t help but notice a strong parallel in the latter two examples with the use of interfaces in object-oriented code. George Box reminds us that “all models are wrong, but some are useful” and so we should be careful not to strain the analogy too far but I think there is some value in considering the relationship between views and tables as somewhat akin to interfaces and classes, at least for the purposes of encapsulation as described above.

On a similar note we often strive to create and use the narrowest interface that solves our problem and that should be no different in the database world either. Creating narrower interfaces (views) allows us to remain more in control of our implementation by leaking less.

One final type related comparison that I think worthy of mention is that it’s easier to spot structural problems when you have a “richer type system”, i.e. many well-named views. For example, if a query joins through ProductsOrdered to get to UserPreferences you can easily see something funky is going on.

Embracing Change

When you work alongside a database where the SQL code and schema gets refactored almost as heavily as the services that depend on it is a pleasurable experience [4]. Scott Ambler wrote a couple of books over a decade ago (Refactoring Databases: Evolutionary Database Design and Agile Database Techniques) which convinced me long ago that it was possible to design databases that could embrace change. Making judicious use of views certainly helped achieve that in part by keeping the accidental complexity down.

Admittedly performance concerns, still a dark art in the world of databases, gets in the way every now and but I’d rather try to make the database a better place for my successors rather than assume it can’t be done.

 

[1] In investment banking it’s common to re-evaluate trades and portfolios on historical dates both for regulatory and analytical purposes.

[2] Some interesting scenarios crop up here when repeatability matters and you have an unreliable upstream data source.

[3] I’m largely a self-taught, back-end developer with many years of writing C++ and C# based services.

[4] Having a large suite of database unit tests, also written in T-SQL, really helped as we could use TDD on the database schema too.

Friday 8 March 2019

The Perils of Multi-Phase Construction

I’ve never really been a fan of C#’s object initializer syntax. Yes, it’s a little more convenient to write but it has a big downside which is it forces you to make your types mutable by default. Okay, that’s a bit strong, it doesn’t force you to do anything, but it does promote that way of thinking and allows people to take advantage of mutability outside the initialisation block [1].

This post is inspired by some buggy code I encountered where my suspicion is that the subtleties of the object initialisation syntax got lost along the way and partially constructed objects eventually found their way into the wild.

No Dragons Yet

The method, which was to get the next message from a message queue, was originally written something like this:

Message result = null;
RawMessage message = queue.Receive();

if (message != null)
{
  result = new Message
  {
    Priority = message.Priority,
    Type = GetHeader(message, “MessageType”),
    Body = message.Body, 
  };
}

return result;

This was effectively correct. I say “effectively correct” because it doesn’t contain the bug which came later but still relies on mutability which we know can be dangerous.

For example, what would happen if the GetHeader() method threw an exception? At the moment there is no error handling and so the exception propagates out the method and back up the stack. Because we make no effort to recover we let the caller decide what happens when a duff message comes in.

The Dragons Begin Circling

Presumably the behaviour when a malformed message arrived was undesirable because the method was changed slightly to include some recovery fairly soon after:

Message result = null;
RawMessage message = queue.Receive();

if (message != null)
{
  try
  {
    result = new Message
    {
      Priority = message.Priority,
      Type = GetHeader(message, “MessageType”),
      Body = message.Body,  
    };
  }
  catch (Exception e)
  {
    Log.Error(“Invalid message. Skipping.”);
  }
}

return result;

Still no bug yet, but that catch handler falling through to the return at the bottom is somewhat questionable; we are making the reader work hard to track what happens to result under the happy / sad paths to ensure it remains correct under further change.

Object Initialisation Syntax

Before showing the bug, here’s a brief refresher on how the object initialisation syntax works under the covers [2] in the context of our example code. Essentially it invokes the default constructor first and then performs assignments on the various other properties, e.g.

var __m = new Message();
__m.Priority = message.Priority;
__m.Type = GetHeader(message, “MessageType”);
__m.Body = message.Body,  
result = __m;

Notice how the compiler introduces a hidden temporary variable during the construction which it then assigns to the target at the end? This ensures that any exceptions during construction won’t create partially constructed objects that are bound to variables by accident. (This assumes you don’t use the constructor or property setter to attach itself to any global variables either.)

Hence, with respect to our example, if any part of the initialization fails then result will be left as null and therefore the message is indeed discarded and the caller gets a null reference back.

The Dragons Surface

Time passes and the code is then updated to support a new property which is also passed via a header. And then another, and another. However, being more complicated than a simple string value the logic to parse it is placed outside the object initialisation block, like this:

Message result = null;
RawMessage message = queue.Receive();

if (message != null)
{
  try
  {
    result = new Message
    {
      Priority = message.Priority,
      Type = GetHeader(message, “MessageType”),
      Body = message.Body,  
    };

    var str = GetHeader(message, “SomeIntValue”);
    if (str != null && TryParseInt(str, out var value))
      result.IntValue = value;

    // ... more of the same ...
  }
  catch (Exception e)
  {
    Log.Error(“Invalid message. Skipping.”);
  }
}

return result;

Now the problems start. With the latter header parsing code outside the initialisation block result is assigned a partially constructed object while the remaining parsing code runs. Any exceptions that occur [3] mean that result will be left only partially constructed and the caller will be returned the duff object because the exception handler falls out the bottom.

+1 for Tests

The reason I spotted the bug was because I was writing some tests around the code for a new header which also temporarily needed to be optional, like the others, to decouple the deployments. When running the tests there was an error displayed on the console output [4] telling me the message was being discarded, which I didn’t twig at first. It was when I added a retrospective test for the previous optional fields and I found my new one wasn’t be parsed correctly that I realised something funky was going on.

Alternatives

So, what’s the answer? Well, I can think of a number of approaches that would fix this particular code, ranging from small to large in terms of the amount of code that needs changing and our appetite for it.

Firstly we could avoid falling through in the exception handler and make it easier on the reader to comprehend what would be returned in the face of a parsing error:

catch (Exception e)  
{  
  Log.Error(“Invalid message. Skipping.”);
  return null;
}

Secondly we could reduce the scope of the result variable and return that at the end of the parsing block so it’s also clearer about what the happy path returns:

var result = new Message  
{  
  // . . .  
};

var str = GetHeader(message, “SomeIntValue”);
if (str != null && TryParseInt(str, out var value)
  result.IntValue = value;

return result;

We could also short circuit the original check too and remove the longer lived result variable altogether with:

RawMessage message = queue.Receive();

if (message == null)
    return null;

These are all quite simple changes which are also safe going forward should someone add more header values in the same way. Of course, if we were truly perverse and wanted to show how clever we were, we could fold the extra values back into the initialisation block by doing an Extract Function on the logic instead and leave the original dragons in place, e.g.

try
{  
  result = new Message  
  {  
    Priority = message.Priority,  
    Type = GetHeader(message, “MessageType”),  
    Body = message.Body,
    IntValue = GetIntHeader(message, “SomeIntValue”),
    // ... more of the same ...  
  };
}  
catch (Exception e)  
{  
  Log.Error(“Invalid message. Skipping.”);  
}

But we would never do that because the aim is to write code that helps stop people making these kinds of mistakes in the first place. If we want to be clever we should make it easier for the maintainers to fall into The Pit of Success.

Other Alternatives 

I said at the beginning that I was not a fan of mutability by default and therefore it would be remiss of me not to suggest that the entire Message type be made immutable and all properties set via the constructor instead:

result = new Message  
(  
  priority: message.Priority,  
  type: GetHeader(message, “MessageType”),  
  body: message.Body,
  IntValue: GetIntHeader(message, “SomeIntValue”),
  // ... more of the same ...  
);

Yes, adding a new property is a little more work but, as always, writing the tests to make sure it all works correctly will dominate here.

I would also prefer to see use of an Optional<> type instead of a null reference for signalling “no message” but that’s a different discussion.

Epilogue

While this bug was merely “theoretical” at the time I discovered it [5] it quickly came back to bite. A bug fix I made on the sending side got deployed before the receiving end and so the misleading error popped up in the logs after all.

Although the system appeared to be functioning correctly it had slowed down noticeably which we quickly discovered was down to the receiving process continually restarting. What I hadn’t twigged just from reading this nugget of code was that due to the catch handler falling through and passing the message on it was being acknowledged on the queue twice –– once in that catch handler, and again after processing it. This second acknowledgment attempt generated a fatal error that caused the process to restart. Deploying the fixed receiver code as well sorted the issue out.

Ironically the impetus for my blog post “Black Hole - The Fail Fast Anti-Pattern” way back in 2012 was also triggered by two-phase construction problems that caused a process to go into a nasty failure mode, but that time it processed messages much too quickly and stayed alive failing them all.

 

[1] Generally speaking the setting of multiple properties implies it’s multi-phase construction. The more common term Two-Phase Construction comes (I presume) from explicit constructor methods names like Initialise() or Create() which take multiple arguments, like the constructor, rather than setting properties one-by-one.

[2] This is based on my copy of The C# Programming Language: The Annotated Edition.

[3] When the header was missing it was passing a null byte[] reference into a UTF8 decoder which caused it to throw an ArgumentNullException.

[4] Internally it created a logger on-the-fly so it wasn’t an obvious dependency that initially needed mocking.

[5] It’s old, so possibly it did bite in the past but nobody knew why or it magically fixed itself when both ends where upgraded close enough together.

Monday 4 March 2019

A Not So Minor Hardware Revision

[These events took place two decades ago, so consider it food for thought rather than a modern tale of misfortune. Naturally some details are hazy and possibly misremembered but the basic premise is still sound.]

Back in the late ‘90s I was working on a Travelling Salesman style problem (TSP) for a large oil company which had performance improvements as a key element. Essentially we were taking a new rewrite of their existing scheduling product and trying to solve some huge performance problems with it, such as taking many minutes to load, let alone perform any scheduling computations.

We had made a number of serious improvements, such as reducing the load time from minutes to mere seconds, and, given our successes so far, were tasked with continuing to implement the rest of the features that were needed to make it usable in practice. One feature was to import the set of orders from the various customer sites which were scheduled by the underlying TSP engine.

The Catalyst

The importing of orders required reading some reasonably large text files, parsing them (which was implemented using the classic Lex & YACC toolset) and pushing them into the database where upon the engine would find them and work out a schedule for their delivery.

Initially this importer was packaged as an ActiveX control, written in C and C++, and hosted inside the PowerBuilder (PB) based GUI. Working on the engine side (written entirely in C) we had created a number of native test harnesses (in C++/MFC) to avoid needing to use the PB front-end unless absolutely necessary due to its generally poor performance. Up until this point the importer appeared to work fine on our dev workstations, but when it was passed to the QA a performance problem started showing up.

The entire team (developers and tester) had all been given identical Compaq machines. Give that we needed to run Oracle locally as well as use it for development and testing we had a whopping 256 MB of RAM to play with along with a couple of cores. The workstations were running Windows NT 4.0 and we were using Visual C++ 2 to develop with. As far as we could see they looked and behaved identically too.

The Problem

The initial bug report from the QA was that after importing a fresh set of orders the scheduling engine run took orders of magnitude longer (no pun intended) to find a solution. However, after restarting the product the engine run took the normal amount of time. Hence the conclusion was that the importer ActiveX control, being in-process with the engine, was somehow causing the slowdown. (This was in the days before the low-fragmentation heap in Windows and heap fragmentation was known to be a problem for our kind of application.)

Weirdly though the developer of the importer could not reproduce this issue on their machine, or another developer’s machine that they tried, but it was pretty consistently reproducible on the QA’s machine. As a workaround the logic was hoisted into a separate command-line based tool instead which was then passed along to the QA to see if matters improved, but it didn’t. Restarting the product was the only way to get the engine to perform well after importing new orders and naturally this wasn’t a flyer with the client as this would happen in real-life throughout the day.

In the meantime I had started to read up on Windows heaps and found some info that allowed me to write some code which could help analyse the state of the heaps and see if fragmentation was likely to be an issue anyway, even with the importer running out-of-process now. This didn’t turn up anything useful at the time but the knowledge did come in handy some years later.

Tests on various other machines were now beginning to show that the problem was most likely with the QA’s machine or configuration rather than with the product itself. After checking some basic Windows settings it was posited that it might be a hardware problem, such as a faulty RAM chip. The Compaq machines we had been given weren’t cheap and weren’t using cheap RAM chips either; the POST was doing a memory check too, but it was worth checking out further. Despite swapping over the RAM (and possibly CPUs) with another machine the problem still persisted on the QA’s machine.

Whilst putting the machines back the way they were I somehow noticed that the motherboard revision was slightly different. We double-checked the version numbers and the QAs machine was one minor revision lower. We checked a few other machines we knew worked and lo-and-behold they were all on the newer revision too.

Fortunately, inside the case of one machine was the manual for the motherboard which gave a run down of the different revisions. According to the manual the slightly lower revision motherboard only supported caching of the first 64 MB RAM! Due to the way the application’s memory footprint changed during the order import and subsequent cache reloading it was entirely plausible that the new data could reside outside the cached region [1].

This was enough evidence to get the QA’s machine replaced and the problem never surfaced again.

Retrospective

Two decades of experience later and I find the way this issue was handled as rather peculiar by today’s standards.

Mostly I find the amount of time we devoted to identifying this problem as inappropriate. Granted, this problem was weird and one of the most enjoyable things about software development is dealing with “interesting” puzzles. I for one was no doubt guilty of wanting to solve the mystery at any cost. We should have been able to chalk the issue up to something environmental much sooner and been able to move on. Perhaps if a replacement machine had shown similar issues later it would be cause to investigate further [2].

I, along with most of the other devs, only had a handful of years of experience which probably meant we were young enough not to be bored by such issues, but also were likely too immature to escalate the problem and get a “grown-up” to make a more rational decision. While I suspect we had experienced some hardware failures in our time we hadn’t experienced enough weird ones (i.e. non-terminal) to suspect a hardware issue sooner.

Given the focus on performance and the fact that the project was acquired from a competing consultancy after they appeared to “drop the ball” I guess there were some political aspects that I would have been entirely unaware of. At the time I was solely interested in finding the cause [3] whereas now I might be far more aware of any ongoing “costs” in this kind of investigation and would no doubt have more clout to short-circuit it even if that means we never get to the bottom of it.

As more of the infrastructure we deal with moves into the cloud there is less need, or even ability, to deal with problems in this way. That’s great from a business point of view but I’m left wondering if that takes just a little bit more fun out of the job sometimes.

 

[1] This suggests to me that the OS was dishing out physical pages from a free-list where address ordering was somehow involved. I have no idea how realistic that is or was at the time.

[2] It’s entirely possible that I’ve forgotten some details here and maybe more than one machine was acting weirdly but we focused on the QA’s machine for some reason.

[3] I’m going to avoid using the term “root cause” because we know from How Complex Systems Fail that we still haven’t gotten to the bottom of it. For example, where does the responsibility for verifying the hardware was identical lie, etc.?