Monday 4 March 2019

A Not So Minor Hardware Revision

[These events took place two decades ago, so consider it food for thought rather than a modern tale of misfortune. Naturally some details are hazy and possibly misremembered but the basic premise is still sound.]

Back in the late ‘90s I was working on a Travelling Salesman style problem (TSP) for a large oil company which had performance improvements as a key element. Essentially we were taking a new rewrite of their existing scheduling product and trying to solve some huge performance problems with it, such as taking many minutes to load, let alone perform any scheduling computations.

We had made a number of serious improvements, such as reducing the load time from minutes to mere seconds, and, given our successes so far, were tasked with continuing to implement the rest of the features that were needed to make it usable in practice. One feature was to import the set of orders from the various customer sites which were scheduled by the underlying TSP engine.

The Catalyst

The importing of orders required reading some reasonably large text files, parsing them (which was implemented using the classic Lex & YACC toolset) and pushing them into the database where upon the engine would find them and work out a schedule for their delivery.

Initially this importer was packaged as an ActiveX control, written in C and C++, and hosted inside the PowerBuilder (PB) based GUI. Working on the engine side (written entirely in C) we had created a number of native test harnesses (in C++/MFC) to avoid needing to use the PB front-end unless absolutely necessary due to its generally poor performance. Up until this point the importer appeared to work fine on our dev workstations, but when it was passed to the QA a performance problem started showing up.

The entire team (developers and tester) had all been given identical Compaq machines. Give that we needed to run Oracle locally as well as use it for development and testing we had a whopping 256 MB of RAM to play with along with a couple of cores. The workstations were running Windows NT 4.0 and we were using Visual C++ 2 to develop with. As far as we could see they looked and behaved identically too.

The Problem

The initial bug report from the QA was that after importing a fresh set of orders the scheduling engine run took orders of magnitude longer (no pun intended) to find a solution. However, after restarting the product the engine run took the normal amount of time. Hence the conclusion was that the importer ActiveX control, being in-process with the engine, was somehow causing the slowdown. (This was in the days before the low-fragmentation heap in Windows and heap fragmentation was known to be a problem for our kind of application.)

Weirdly though the developer of the importer could not reproduce this issue on their machine, or another developer’s machine that they tried, but it was pretty consistently reproducible on the QA’s machine. As a workaround the logic was hoisted into a separate command-line based tool instead which was then passed along to the QA to see if matters improved, but it didn’t. Restarting the product was the only way to get the engine to perform well after importing new orders and naturally this wasn’t a flyer with the client as this would happen in real-life throughout the day.

In the meantime I had started to read up on Windows heaps and found some info that allowed me to write some code which could help analyse the state of the heaps and see if fragmentation was likely to be an issue anyway, even with the importer running out-of-process now. This didn’t turn up anything useful at the time but the knowledge did come in handy some years later.

Tests on various other machines were now beginning to show that the problem was most likely with the QA’s machine or configuration rather than with the product itself. After checking some basic Windows settings it was posited that it might be a hardware problem, such as a faulty RAM chip. The Compaq machines we had been given weren’t cheap and weren’t using cheap RAM chips either; the POST was doing a memory check too, but it was worth checking out further. Despite swapping over the RAM (and possibly CPUs) with another machine the problem still persisted on the QA’s machine.

Whilst putting the machines back the way they were I somehow noticed that the motherboard revision was slightly different. We double-checked the version numbers and the QAs machine was one minor revision lower. We checked a few other machines we knew worked and lo-and-behold they were all on the newer revision too.

Fortunately, inside the case of one machine was the manual for the motherboard which gave a run down of the different revisions. According to the manual the slightly lower revision motherboard only supported caching of the first 64 MB RAM! Due to the way the application’s memory footprint changed during the order import and subsequent cache reloading it was entirely plausible that the new data could reside outside the cached region [1].

This was enough evidence to get the QA’s machine replaced and the problem never surfaced again.

Retrospective

Two decades of experience later and I find the way this issue was handled as rather peculiar by today’s standards.

Mostly I find the amount of time we devoted to identifying this problem as inappropriate. Granted, this problem was weird and one of the most enjoyable things about software development is dealing with “interesting” puzzles. I for one was no doubt guilty of wanting to solve the mystery at any cost. We should have been able to chalk the issue up to something environmental much sooner and been able to move on. Perhaps if a replacement machine had shown similar issues later it would be cause to investigate further [2].

I, along with most of the other devs, only had a handful of years of experience which probably meant we were young enough not to be bored by such issues, but also were likely too immature to escalate the problem and get a “grown-up” to make a more rational decision. While I suspect we had experienced some hardware failures in our time we hadn’t experienced enough weird ones (i.e. non-terminal) to suspect a hardware issue sooner.

Given the focus on performance and the fact that the project was acquired from a competing consultancy after they appeared to “drop the ball” I guess there were some political aspects that I would have been entirely unaware of. At the time I was solely interested in finding the cause [3] whereas now I might be far more aware of any ongoing “costs” in this kind of investigation and would no doubt have more clout to short-circuit it even if that means we never get to the bottom of it.

As more of the infrastructure we deal with moves into the cloud there is less need, or even ability, to deal with problems in this way. That’s great from a business point of view but I’m left wondering if that takes just a little bit more fun out of the job sometimes.

 

[1] This suggests to me that the OS was dishing out physical pages from a free-list where address ordering was somehow involved. I have no idea how realistic that is or was at the time.

[2] It’s entirely possible that I’ve forgotten some details here and maybe more than one machine was acting weirdly but we focused on the QA’s machine for some reason.

[3] I’m going to avoid using the term “root cause” because we know from How Complex Systems Fail that we still haven’t gotten to the bottom of it. For example, where does the responsibility for verifying the hardware was identical lie, etc.?

1 comment:

  1. Reading this I got an increasing sense of deja vu, but mine was not quite the same. Back when I was working on Railtrack (or software was only a small part of that big disaster) I to got a report from Test that I couldn't reproduce. Again, testers could see it but not developers just like your example.

    Eventually I got it down to a Visual C++ 2.0 bug which only effected Intel machines. The testers had Intel machines, developers had AMD based Dell's. The compiler was making an optimisation in integer initialization which was fine on AND but failed in Intel. The fix was to disable optimisation.

    Yes I know we expect it to be the other way around - AND is a copy of Intel after all - but that's why it sticks in my mind.

    Otherwise, my story is pretty much yours.

    ReplyDelete