Thursday, 30 September 2021

Transient Expand-Archive Failures

[I’m sure there is something else going on here but on the off-chance someone else is also observing this and also lost at least they’ll know they’re not alone.]

We have a GitLab project pipeline that started out as a monolithic job but over the last 9 months has slowly been parallelized and now runs as over 150 jobs spread out across a cluster of 4 fairly decent [1] machines with 8 to 10 concurrent jobs per host. More recently we’ve started seeing the PowerShell Expand-Archive cmdlet failing randomly up to 5% of the time with the following error:

Remove-Item : Cannot find path {...} because it does not exist.

The line of code highlighted in the error is:

$expandedItems | % { Remove-Item $_ -Force -Recurse }

If you google this message it suggests this probably isn’t the real error but a problem with the cmdlet trying to clean-up after failing to extract the contents of the .zip file. Sadly the reason why the extraction might have failed in the first place is now lost.

Investigation

While investigating this error message I ran across two main hits – one from Stack Overflow and the other on the PowerShell GitHub project – both about hitting the classic long path problem in Windows. In our case the extracted paths, even including the build agent root, is still only 100 characters so well within the limit as the archive only has one subfolder and the filenames are short.

Also the archive is built with it’s companion cmdlet Compress-Archive so I doubt it’s an impedance mismatch in our choice of tools.

My gut reaction to anything spurious like this is that it’s the virus scanner (AV) [2]. Sadly I have no direct control over the virus scanner product choice or its configuration. In this instance the machines have Trend Micro whereas the other build agents I’ve built are VMs and have Windows Defender [3], but their load is also much lower. I managed to get the build folder excluded temporarily but that appears to have had no effect and nothing was logged in the AV to say it had blocked anything. (The “behaviour monitoring” in modern AV products often gets triggered by build tools which is annoying.)

After discounting the obvious and checking that memory exhaustion also wasn’t a factor as the memory load for the jobs is variable and the worst case loading can cause the page-file to be used, I wondered if there the problem lay with the GitLab runner cache somehow.

Corrupt Runner Cache?

To avoid downloading the .zip file artefact for every job run we utilise the GitLab runner local cache. This is effectively a .zip file of a packages folder in the project working copy that gets packed up and re-used in the other jobs on the same machine which, given our level of concurrency, means it’s constantly in use. Hence I wondered if our archive was being corrupted when the cache was being unpacked as I’ve seen embedded .zip files cause problems in the past for AV tools (even though it supposedly shouldn’t have been touching the folder). So I added a step to test our archive’s integrity before unpacking it by using 7-Zip as there doesn’t appear to be a companion cmdlet Test-Archive. I immediately saw the integrity test pass but the Expand-Archive step fail a few times so I’m pretty sure the problem is not archive corruption.

Workaround

The workaround which I’ve employed is to use 7-Zip for the unpacking step too and so far we’ve seen no errors at all but I’m left wondering why Expand-Archive was intermittently failing. Taking an extra dependency on a popular tool like 7-Zip is hardly onerous but it bumps the complexity up very slightly and needs to be accounted for in the docs / scripts.

In my 2017 post Fallibility I mentioned how I once worked with someone who was more content to accept they’d found an undocumented bug in the Windows CopyFile() function than believe there was a flaw in their code or analysis [4]. Hence I feel something as ubiquitous as Expand-Archive is unlikely to have a decompression bug and that there is some piece of the puzzle here that I’m missing. Maybe the AV is still interfering in some way that isn’t triggered by 7-Zip or the transient memory pressure caused by the heavier jobs is having an impact?

Given the low cost of the workaround (use 7-Zip instead) the time, effort and disruption needed to run further experiments to explore this problem further is sadly too high. For the time being annecdata is the best I can do.

 

[1] 8 /16 cores, 64 / 128 GB RAM, and NVMe based disks.

[2] I once did some Windows kernel debugging to help prove an anti-virus product update was the reason our engine processes where not terminating correctly under low memory conditions.

[3] Ideally servers shouldn’t need anti-virus tools at all but the principle of Defence in Depth suggests the minor performance impact is worth it to potentially help slow lateral movement.

[4] TL;DR: I quickly showed it was the latter at fault not the Windows API.

 

Monday, 27 September 2021

Lose the Source Luke?

We were writing a new service to distribute financial pricing data around the trading floor as a companion to our new desktop pricing tool. The plugin architecture allowed us to write modular components that could tap into the event streams for various reasons, e.g. provide gateways to 3rd party data streams.

Linking New to Old

One of the first plugins we wrote allowed us to publish pricing data to a much older in-house data service which had been sat running in the server room for some years as part of the contributions system. This meant we could eventually phase that out and switch over to the new platform once we had parity with it.

The plugin was a doddle to write and we quickly had pricing data flowing from the new service out to a test instance of the old service which we intended to leave running in the background for soak testing. As it was an in-house tool there was no installer and my colleague had a copy of the binaries lying around on his machine [1]. Also he was one of the original developers so knew exactly what he was doing to set it up.

A Curious Error Message

Everything seemed to be working fine at first but as the data volumes grew we suddenly noticed that the data feed would eventually hang after a few days. In the beginning we were developing the core of the new service so quickly it was constantly being upgraded but now the pace was slowing down the new service was alive for much longer. Given how mature the old service was we assumed the issue was with the new one. Also there was a curious message in the log for the old service about “an invalid transaction ID” before the feed stopped.

While debugging the new plugin code my colleague remembered that the Transaction ID meant the message sequence number that goes in every message to allow for ordering and re-transmission when running over UDP. The data type for that was a 16-bit unsigned integer so it dawned on us that we had probably messed up handling the wrapping of the Transaction ID.

Use the Source Luke

Given how long ago he last worked on the old service he couldn’t quite remember what the protocol was for resetting the Transaction ID so we decided to go and look at the old service source code to see how it handled it. Despite being at the company for a few years myself this all pre-dated me so I left my colleague to do the rummaging.

Not long after my colleague came back over to my desk and asked if I might know where the source code was. Like so many programmers in a small company I was a part-time sysadmin and generally looked after some of servers we used for development duties, such as the one where our Visual SourceSafe repository lived that contained all the projects we’d ever worked on since I joined.

The VCS Upgrade

When I first started at the company there were only a couple of programmers not working on the mainframe and they wrote their own version control system. It was very Heath Robinson and used exclusive file locks to side-step the problem of concurrent changes. Having been used to a few VCS tools by then such as PVCS, Star Versions, and Visual SourceSafe I suggested that we move to a 3rd party VCS product as we needed more optimistic concurrency controls as more people were going to join the team. Given the MSDN licenses we already had along with my own experience Visual SourceSafe (VSS) seemed like a natural choice back then [2].

Around the same time the existing development server was getting a bit long in the tooth so the company forked out for a brand new server and so I set-up the new VSS repository on that and all my code went in there along with all the subsequent projects we started. None of the people that joined after me ever touched any of the old codebase or VCS as it was so mature it hadn’t needed changing in some time and anyway the two original devs where still there to look after it.

The Office Move

A couple of years after I joined, the owners of the lovely building the company had been renting for the last few decades decided they wanted to gut and renovate it as the area in London where we were based was getting a big makeover. Hence we were forced to move to new premises about half a mile away. The new premises were nice and modern and I no longer had the vent from the portable air-conditioning machine from one of the small server rooms pumping out hot air right behind my desk [3].

When moving day came I made sure the new server with all our stuff on got safely transported to the new office’s server room so that we ready to go again on Monday morning. As we stood staring around the empty office floor my colleague pointed to the old development server which had lay dormant in the corner and asked me (rhetorically) whether we should even bother taking it with us. As far as I was concerned everything I’d ever needed had always been on the new server and so I didn’t know what was left that we’d still need.

My colleague agreed and so we left the server to be chucked in the skip when the bulldozers came.

Dormant, But Not Redundant

It turned out their original home-grown version control system had a few projects in it, including the old data service. Luckily one of the original developers who worked on the contributions side still had an up-to-date copy of that and my colleague found a local copy of the code for one of the other services but had no idea how up-to-date it was. Sadly nobody had even a partial copy of the source to the data service we were interested in but we were going to replace that anyway so in the end the loss was far less significant than we originally feared.

In retrospect I can’t believe we didn’t even just take the hard disk with us. The server was a classic tower so took up a far bit of room which was still somewhat at a premium in the new office whereas the disk could probably have sit in a desk drawer or even been fitted as an extra drive in the new midi sized development server.

 

[1] +1 for xcopy deployment which made setting up development and test instances a piece of cake.

[2] There are a lot of stories of file corruption issues with VSS but in the 7 years I’d used it with small teams, even over a VPN, we only had one file corruption issue that we quickly restored from a backup.

[3] We were on the opposite side from the windows too so didn’t even get a cool breeze from those either.

 

Thursday, 23 September 2021

The Case of the Curious Commit Message

I had taken a new contract at an investment bank and started working on a very mature codebase which was stored in ClearCase. As a long-time user [1] of version control systems one of the things that bugged me about the codebase were empty commit messages. On a mature codebase where people have come and gone it’s hard enough to work out what was going on just from the code, decent commit messages should be there to give you that extra context around the “why”.

Rallying the Troops

After attempting to sell the virtues of commit messages to my colleagues a couple of times during team meetings there were still the odd one or two that consistently avoided doing so. So I decided to try a name-and-shame approach [2] by emailing a table of names along with their percentage of non-empty commit messages hoping those that appeared at the bottom would consider changing their ways.

At the time I was just getting my head around ClearCase and there were a couple of complaints from people who felt unduly chastised because they didn’t have 100% when they felt they should. It turned out their accounts were used for some automated check-ins which had no message which I didn’t know about, so I excluded those commits and published a revised table.

Progress?

On the plus side this got people discussing what a good commit message looked like and it brought up some question marks around certain practices that others had done. For example a few team members wouldn’t write a formal message but simply paste the ID of the issue from ClearQuest [3]. Naturally this passed my “not empty” test but it raised a question about overly terse commit messages. Given where we were coming from I felt this was definitely acceptable (for the time being) as they were still using the commit message to provide more details, albeit in the form of a link to the underlying business request [5].

However, it got me thinking about whether people were not really playing ball and might be gaming the system so I started looking into overly terse commit messages and I’m glad to say everyone was entering into the spirit of things [4]. Everyone except one person who had never even been on the initial radar but who had a sizable number of commits with the simple message:

    nt

That’s right, just the two letters ‘n’ and ‘t’. (There were others but this was the most prevalent and memorable.)

A Curious Message

Looking at the diffs that went with these messages it wasn’t obvious what “nt” meant. My initial instinct was that it was an abbreviation of some sort, perhaps a business term I was unaware of as the developer was involved in the more maths heavy side of the project. They were far more common before my “shake-up” so I was pleased that whatever this term was it was being replaced by more useful comments now but I was still intrigued. Naturally I walked across the room to the very pleasant developer in question and asked him what “nt” meant.

It turned out it didn’t mean anything, and the developer was largely unaware they even existed! So where did they come from?

The Mist Clears

Luckily while we were chatting he started making a new change and I saw the ClearCase check-out dialog appear and the initial message was a few letters of garbage. I looked at what he intended to type in the editor and it dawned on me what was happening – the “nt” was the latter part of the word “int”.

Just as with Visual SourceSafe, the ClearCase Visual Studio plugin would trigger when you started editing a file and nothing else was checked out at that point. It would pop-up a dialog so you could configure how the check-out was done. For example you might want to put an exclusive lock on the file [6] or you could provide a message so others could see what files were being edited concurrently. By default the focus in this dialog was on the OK button so it was possible to dismiss this dialog without even really seeing it…

Hence this is what was going on:

  1. The dev typed “int” to start a declaration as part of a new set of code changes.
  2. The “i” keypress triggered the ClearCase plugin which noticed this was the start of a new check-out and promptly threw up a dialog with the remaining letters “nt” in the message field.
  3. By then the dev had already pressed “space” at the end of the type name which, due to the default button focus, caused the dialog to immediately disappear.
  4. When he committed the changes at the end he never edited the message anyway, he would just click the commit button and move on.

Case closed. From a UI perspective it probably falls into the same category (although with less disastrous consequences) as those unexpected popups that ask if you want to reboot your machine, NOW. Ouch!

 

[1] I was introduced to them on my very first job and have been fortunate enough to use one on virtually every job since, even if I ended up setting one up :o).

[2] In retrospect I probably didn’t try hard enough to sell it and should have taken a more personal approach for the laggards as maybe there were good reasons why they weren’t doing it, e.g. tooling.

[3] Yes, an enterprise level defect tracking tool with all the pain you’d expect from such a product.

[4] For non-trivial things that is, the message “typo” still appeared for some of those but that raised a whole different set of questions around not compiling or testing changes before committing them!

[5] Including the ticket number at the start of a commit message is something I promote in my Commit Checklist.

[6] This was useful for non-mergeable files like DTS packages and media assets but often ended up creating more harm than good as they got left locked and you had to get an admin to unlock them, and they were in another team.