The OldWood Thing: support

Showing posts with label support. Show all posts

Thursday, 30 September 2021

Transient Expand-Archive Failures

[I’m sure there is something else going on here but on the off-chance someone else is also observing this and also lost at least they’ll know they’re not alone.]

We have a GitLab project pipeline that started out as a monolithic job but over the last 9 months has slowly been parallelized and now runs as over 150 jobs spread out across a cluster of 4 fairly decent [1] machines with 8 to 10 concurrent jobs per host. More recently we’ve started seeing the PowerShell Expand-Archive cmdlet failing randomly up to 5% of the time with the following error:

Remove-Item : Cannot find path {...} because it does not exist.

The line of code highlighted in the error is:

$expandedItems | % { Remove-Item $_ -Force -Recurse }

If you google this message it suggests this probably isn’t the real error but a problem with the cmdlet trying to clean-up after failing to extract the contents of the .zip file. Sadly the reason why the extraction might have failed in the first place is now lost.

Investigation

While investigating this error message I ran across two main hits – one from Stack Overflow and the other on the PowerShell GitHub project – both about hitting the classic long path problem in Windows. In our case the extracted paths, even including the build agent root, is still only 100 characters so well within the limit as the archive only has one subfolder and the filenames are short.

Also the archive is built with it’s companion cmdlet Compress-Archive so I doubt it’s an impedance mismatch in our choice of tools.

My gut reaction to anything spurious like this is that it’s the virus scanner (AV) [2]. Sadly I have no direct control over the virus scanner product choice or its configuration. In this instance the machines have Trend Micro whereas the other build agents I’ve built are VMs and have Windows Defender [3], but their load is also much lower. I managed to get the build folder excluded temporarily but that appears to have had no effect and nothing was logged in the AV to say it had blocked anything. (The “behaviour monitoring” in modern AV products often gets triggered by build tools which is annoying.)

After discounting the obvious and checking that memory exhaustion also wasn’t a factor as the memory load for the jobs is variable and the worst case loading can cause the page-file to be used, I wondered if there the problem lay with the GitLab runner cache somehow.

Corrupt Runner Cache?

To avoid downloading the .zip file artefact for every job run we utilise the GitLab runner local cache. This is effectively a .zip file of a packages folder in the project working copy that gets packed up and re-used in the other jobs on the same machine which, given our level of concurrency, means it’s constantly in use. Hence I wondered if our archive was being corrupted when the cache was being unpacked as I’ve seen embedded .zip files cause problems in the past for AV tools (even though it supposedly shouldn’t have been touching the folder). So I added a step to test our archive’s integrity before unpacking it by using 7-Zip as there doesn’t appear to be a companion cmdlet Test-Archive. I immediately saw the integrity test pass but the Expand-Archive step fail a few times so I’m pretty sure the problem is not archive corruption.

Workaround

The workaround which I’ve employed is to use 7-Zip for the unpacking step too and so far we’ve seen no errors at all but I’m left wondering why Expand-Archive was intermittently failing. Taking an extra dependency on a popular tool like 7-Zip is hardly onerous but it bumps the complexity up very slightly and needs to be accounted for in the docs / scripts.

In my 2017 post Fallibility I mentioned how I once worked with someone who was more content to accept they’d found an undocumented bug in the Windows CopyFile() function than believe there was a flaw in their code or analysis [4]. Hence I feel something as ubiquitous as Expand-Archive is unlikely to have a decompression bug and that there is some piece of the puzzle here that I’m missing. Maybe the AV is still interfering in some way that isn’t triggered by 7-Zip or the transient memory pressure caused by the heavier jobs is having an impact?

Given the low cost of the workaround (use 7-Zip instead) the time, effort and disruption needed to run further experiments to explore this problem further is sadly too high. For the time being annecdata is the best I can do.

[1] 8 /16 cores, 64 / 128 GB RAM, and NVMe based disks.

[2] I once did some Windows kernel debugging to help prove an anti-virus product update was the reason our engine processes where not terminating correctly under low memory conditions.

[3] Ideally servers shouldn’t need anti-virus tools at all but the principle of Defence in Depth suggests the minor performance impact is worth it to potentially help slow lateral movement.

[4] TL;DR: I quickly showed it was the latter at fault not the Windows API.

Monday, 3 September 2018

Surgical Support Needs Surgical Tools

In the world of IT support there is the universal solution to every problem – turn it off and on again. You would hope that this kind of drastic action is only drawn upon when all other options have been explored or that the problem is known a priori to require such a response and is the least disruptive course of action. Sadly what was often just a joke in the past has become everyday life for many systems as rebooting or restarting is woven into the daily support routine.

In my professional career I have worked on a number of systems where there are daily scheduled jobs to reboot machines or restart services. I’m not talking about the modern, proactive kind like the Chaos Monkey which is used to probe for weaknesses, or when you force cluster failovers to check everything’s healthy; I’m talking about jobs where the restart is required to ensure the correct functioning of the system – disabling them would cripple it.

Sledgehammers

The need for the restart is usually to overcome some kind of corrupt or immutable internal state, or to compensate for a resource leak, such as memory, which degrades the service to an unacceptable level. An example of the former I’ve seen is to flush a poisoned cache or pick up the change in date, while for the latter it might be unclosed database connections or file handles. The notion of “recycling” processes to overcome residual effects has become so prominent that services like IIS provide explicit support for it [1].

Depending on where the service sits in the stack the restart could be somewhat disruptive, if it’s located on the edge, or largely benign if it’s purely internal, say, a background message queue listener. For example I once worked on a compute farm where one of the front-end services was restarted every night and that caused all clients to drop their connection resulting in a support email being sent due to the “unhandled exception”. Needless to say everyone just ignored all the emails as they only added to the background noise making genuine failures harder to spot.

These kind of draconian measures to try and attain some system stability actually make matters worse as the restarts then begin to hide genuine stability issues which eventually start happening during business hours as well and therefore cause grief for customers as unplanned downtime starts occurring. The impetus for one of my first ACCU articles “Utilising More Than 4GB of Memory in a 32-bit Windows Process” came from just such an issue where a service suddenly starting failing with out-of-memory errors even after a restart if the load was awkwardly skewed. It took almost four weeks to diagnose and properly fix the issue during which there were no acceptable workarounds – just constant manual intervention from the support team.

I also lost quite a few hours on the system I mentioned earlier debugging a problem in the caching mechanism which was masked by a restart and only surfaced because the restart failed to occur. No one had remembered about this failure mode because everyone was so used to the restart hiding it. Having additional complexity in the code for a feature that will never be used in practice is essentially waste.

Cracking Nuts

Although it’s not true in all cases (the memory problem just described being a good example) the restart option may be avoidable if the process exposed additional behaviours that allowed for a more surgical approach to recovery to take place. Do you really need to restart the entire process just to flush some internal cache, or maybe just a small part of it? Similarly if you need to bump the business date via an external stimulus can that not be done through a “discoverable” API instead of hidden as part of a service restart [2]?

In some of my previous posts and articles, e.g “From Test Harness To Support Tool”, “Home-Grown Tools” and more recently in “Libraries, Console Apps, and GUIs”, I have described how useful I have found writing simple custom tools to be for development and deployment but also, more importantly, for support. What I think was missing from those posts that I have tried to capture in this one, most notably through its title, is to focus on resolving system problems with the minimal amount of disruption. Assuming that you cannot actually fix the underlying design issue without a serious amount of time and effort can you at least alleviate the problem in the shorter term by adding simple endpoints and tools that can be used to make surgical-like changes inside the critical parts of the system?

For example, imagine that you’re working on a grid computing system where tasks are dished out to different processes and the results are aggregated. Ideally you would assume that failures are going to happen and that some kind of timeout and retry mechanism would be in place so that when a process dies the system recovers automatically [3]. However, if for some reason that automatic mechanism does not exist how are you going to recover? Given that someone (or something) somewhere is waiting for their results how are you going to “unblock the system” and allow them to make progress, without also disturbing all your other users who are unaffected?

You can either try and re-submit the failed task and allow the entire job to complete or kill the entire job and get the client to re-submit its job. As we’ve seen one way to achieve the latter would be to restart parts of the system thereby flushing the job from it. When this is a once in a million event it might make sense [4] but once the failures start racking up throwing away every in-flight request just to fix the odd broken one becomes more and more disruptive. Instead you need a way to identify the failed task (perhaps using the logs) and then instruct the system to recover such as by killing just that job or by asking it to resubmit it to another node.

Hence, ideally you’d just like to hit one admin endpoint and say something like this:

> admin-cli kill-job --server prod --job-id 12345

If that’s not easily achievable and there is distributed state to clear up you might need a few separate little tools instead that can target each part of system, for example:

> admin-cli remove-node –s broker-prod --node NODE99
> admin-cli remove-results -s results-prod --job 12345
> admin-cli remove-tasks –s taskq-prod --job 12345
> admin-cli reset-client –s ui-prod --client CLT42
> admin-cli add-node –s broker-prod --node NODE99

This probably seems like a lot of work to write all these little tools but what I’ve found in practice is that usually most of the tricky logic in the services already exists – you just need to find a way to invoke it externally with the correct arguments.

These days it’s far easier to graft a simple administration endpoint onto an existing service. There are plenty of HTTP libraries available that will allow you to expose a very basic API which you could even poke with CURL. If you’re already using something more meaty like WCF then adding another interface should be trivial too.

Modern systems are becoming more and more decoupled through the use of message queues which also adds a natural extension point as they typically already do all the heavy lifting and you just need to add new message handlers for the extra behaviours. One of the earliest distributed systems I worked on used pub/sub on a system-wide message bus both for functional and administrative use. Instead of using lots of different tools we had a single admin command line tool that the run playbook generally used (even for some classic sysadmin stuff like service restarts) as it made the whole support experience simpler.

Once you have these basic tools it then becomes easy to compose and automate them. For example on a similar system I started by adding a simple support stored procedure to find failed tasks in a batch processing system. This was soon augmented by another one to resubmit a failed task, which was then automated by a simple script. Eventually it got “productionised” and became a formal part of the system providing the “slow retry” path [3] for automatic recovery from less transient outages.

Design for Supportability

One of the design concepts I’ve personally found really helpful is Design for Testability; something which came out of the hardware world and pushes us to design our systems in a way that makes them much easier test. A knock-on effect of this is that you can reduce your end-to-end testing burden and deliver quicker.

A by-product of Design for Testability is that it causes you to design your system in a way that allows internal behaviours to be observed and controlled in isolation. These same behaviours are often the same ones that supporting a system in a more fine-grained manner will also require. Hence by thinking about how you test your system components you are almost certainly thinking about how they would need to be supported too.

Ultimately of course those same thoughts should also be making you think about how the system will likely fail and therefore what needs to be put in place beforehand to help it recover automatically. In the unfortunate event that you can’t recover automatically in the short term you should still have some tools handy that should facilitate a swift and far less disruptive manual recovery.

[1] Note that this is different from a process restarting itself because it has detected that it might have become unstable, perhaps through the use of the Circuit Breaker pattern.

[2] Aside from the benefits of comprehension this makes the system far more testable as it means you can control the date and therefore write deterministic tests.

[3] See “When Does a Transient Failure Stop Being Transient” for a tangent about fast and slow retries.

[4] Designing any distributed system that does not tolerate network failures is asking for trouble in my book but the enterprise has a habit of believing the network is “reliable enough”.

Wednesday, 11 July 2018

Support-Friendly Tooling

One of the techniques I briefly mentioned in my last post “Treat All Test Environments Like Production” was how constraining the test environments by adhering to the Principle of Least Privilege drove us to add diagnostic specific features to our services and tools.

In some cases that might be as simple as exposing some existing functionality through an extra command line verb or service endpoint. For example a common technique these days is to add a “version” verb or “–-version” switch to allow you to check which build of a particular tool or service you have deployed [1].

As Bertrand Meyer suggests in his notion of Command/Query Separation (CQS) any behaviour which is a query in nature should have no side-effects and therefore could also be freely available to use for diagnostic purposes – security and performance issues notwithstanding. Naturally these queries would be over-and-above any queries you might run directly against your various data stores, i.e. databases, file-system, etc. using the vendors own lower-level tools.

Where it gets a little more tricky is on the “command” side as we might need to investigate the operation but without disturbing the current state of the system. In an ideal world it should be possible to execute them against a part of the system reserved for such eventualities, e.g. a special customer or schema that looks and acts like a real one but is owned by the team and therefore its side-effects are invisible to any real users. (This is one of the techniques that falls under the in-vogue term of “testing in production”.)

If the issue can be isolated to a particular component then it’s probably more effective to focus on that part of the system by replaying the operation whilst simultaneously redirecting the side-effects somewhere else (or avoiding them altogether) so that the investigation can be safely repeated. One technique here is to host the component in another type of process, such as a GUI or command line tool and provide a variety of widgets or switches to control the input and output locations. Alternatively you could use the Null Object pattern to send the side-effects into oblivion.

In its most simplest form it might be a case of adding a “--ReadOnly” switch that disables all attempts to write to back-end stores (but leaves logging intact if that won’t interfere). This would give you the chance to safely debug the process locally using production inputs. As an aside this idea has been formalised in the PowerShell world via the “-WhatIf” switch which allows you to run a script whilst disabling (where supported) the write actions of any cmdlets.

If the operation requires some form of bulk processing where there is likely to be far too much output for stdout or because you need a little more structure to the data then you can add multiple switches instead, such as the folder to write to and perhaps even a different format to use which is easier to analyse with the usual UNIX command line tools. If implementing a whole different persistence mechanism for support is considered excessive [2] you could just allow, say, an alternative database connection string to be provided for the writing side and point to a local instance.

Earlier I mentioned that the Principle of Least Privilege helped drive out the need for these customisations and that’s because restricting your access affects you in two ways. The first is that by not allowing you to make unintentional changes you cannot make the situation worse simply through your analysis. For example if you happened to be mistaken that a particular operation had no side-effects but it actually does now, then they would be blocked as a matter of security and an error reported. If done in the comfort of a test environment you now know what else you need to “mock out” to be able to execute the operation safely in future. And if the mocking feature somehow gets broken, your lack of privilege has always got your back. This is essentially just the principle of Defence in Depth applied for slightly different reasons.

The second benefit you get is a variation of yet another principle – Design for Testability. To support such features we need to be able to substitute alternative implementations for the real ones, which effectively means we need to “program to an interface, not an implementation”. Of course this will likely already be a by-product of any unit tests we write, but it’s good to know that it serves another purpose outside that use case.

What I’ve described might seem like a lot of work but you don’t have to go the whole hog and provide a separate host for the components and a variety of command-line switches to enable these behaviours, you could probably get away with just tweaking various configuration settings, which is the approach that initially drove my 2011 post “Testing Drives the Need for Flexible Configuration”. What has usually caused me to go the extra step though is the need to use these features more than just once in a blue moon, often to automate their use for longer term use. This is something I covered in much more detail very recently in “Libraries, Console Apps & GUIs”.

[1] Version information has been embedded in Windows binaries since the 3.x days back in the ‘90s but accessing it easily usually involved using the GUI shell (i.e. Explorer) unless the machine is remote and has limited access, e.g. the cloud. Luckily PowerShell provides an alternative route here and I’m sure there are plenty of third party command line tools as well.

[2] Do not underestimate how easy it is these days to serialise whole object graphs into JSON files and then process them with tools like JQ.

Friday, 6 July 2018

Treat All Test Environments Like Production

One of the policies I pushed for from the start when working on a greenfield system many years ago was the notion that we were going to treat all test environments (e.g. dev and UAT) like the production environment.

As you can probably imagine this was initially greeted with a heavy dose of scepticism. However all the complaints I could see against the idea were dysfunctional behaviours of the delivery process. All the little workarounds and hacks that were used to back-up their reasons for granting unfettered access to the environments seemed to be the result of poorly thought out design, inadequate localised testing or organisational problems. (See “Testing Drives the Need for Flexible Configuration” for how we addressed one of those concerns.)

To be clear, I am not suggesting that you should completely disable all access to the environment; on the contrary I believe that this is required even in production for those rare occasions when you just cannot piece together the problem from your monitoring and source code alone. No, what I was suggesting was that we employ the same speed bumps and privileges in our test environments that we would in production. And that went for the database too.

The underlying principle I was trying to enshrine here was that shared testing environments, by their very nature, should be treated with the utmost care to ensure a smooth delivery of change. In the past I have worked on systems where dev and test environments were a free-for-all. The result is that you waste so much time investigating issues that are orthogonal to your actual problem because someone messed with it for their own use and just left it in a broken state. (This is another example of the “Broken Windows” syndrome.)

A secondary point I was trying to make was that your test environments are also, by definition, your practice runs at getting things right. Many organisations have a lot of rigour around how they deploy to production but very little when it comes to the opportunities leading up to it. In essence your dev and test environments give you two chances to get things right before the final performance – if you’re not doing dress rehearsals beforehand how can you expect it to go right on the day? When production deployments go wrong we get fearful of them and then risk aversion kicks in meaning we do them less often and a downward spiral kicks in.

The outcome of this seemingly “draconian” approach to managing the development and test environments was that we also got to practice supporting the system in two other environments, and in a way that prepared us for what we needed to do when the fire was no longer just a drill. In particular we quickly learned what diagnostic tools we should already have on the box and, most importantly, what privileges we needed to perform certain actions. It also affected what custom tools we built and what extra features we added to the services and processes to allow safe use for analysis during support (e.g. a --ReadOnly switch).

The Principle of Least Privilege suggests that for our incident analysis we should only require read access to any resource, such as files, the database, OS logs, etc. If you know that you are protected from making accidental mistakes you can be more aggressive in your approach as you feel confident that the outcome of any mistake will not result in breaking the system any further [1][2]. Only at the point at which you need to make a change to the system configuration or data should the speed bumps kick in and you elevate yourself temporarily, make the change and immediately drop back to mere mortal status again.

The database was an area in particular where we had all been bitten before by support issues made worse through the execution of ad-hoc SQL passed around by email or pasted in off the wiki. Instead we added a new schema (i.e. namespace) specifically for admin and support stored procedures that were developed properly, i.e. they were written test-first. (See “You Write Your SQL Unit Tests in SQL” for more on how and why we did it this way.) This meant applying certain kinds of workarounds were easier to administer because they were essentially part of the production codebase, not just some afterthought that nobody maintained.

On the design front this also started to have an interesting effect as we found ourselves wanting to leverage our production service code in new ways to ensure that we avoided violating invariants by hosting the underlying service components inside new containers, i.e. command line tools or making them scriptable. (See “Building Systems as Toolkits”.)

The Interface Segregation Principle is your friend here as it pushes you towards having separate interfaces for reading and writing making it clearer which components you can direct towards a production service if you’re trying to reproduce an issue locally. For example our calculation engine support tool allowed you to point any “readers” towards real service endpoints whilst redirecting the the writers to /dev/null (i.e. using the Null Object pattern) or to some simple in-memory implementation (think Dictionary) to pass data from one internal task to the next.

I find it somewhat annoying that we went to a lot of effort to give ourselves the best chance of designing and building a supportable system that also provided traceability only for the infrastructure team to disallow our request for personal per-environment support accounts, saying instead that we needed to share a single one! Even getting them to give us a separate account for dev, UAT and production was hard work. It sometimes feel like the people who complain most about a lack of transparency and rigour are the same ones that deny you access to exactly that.

I know there were times when it felt as though we could drop our guard in dev or UAT “just this once” but I don’t remember us ever doing that. Instead we always used it as an opportunity to learn more about what the real need was and how it could become a bona fide feature rather than just a hack.

[1] That’s not entirely true. A BA once concocted a SQL query during support that ended up “bug checking” SQL Server and brought the entire system to a grinding halt. They then did it again by accident after it was restarted :o).

[2] A second example was where someone left the Sysinternals DebugView tool running overnight on a server whereupon it filled up the log window and locked up a service due to the way OutputDebugString works under the covers.

Wednesday, 6 December 2017

Network Saturation

The first indication that we seemed to have a problem was when some of the background processing jobs failed. The support team naturally looked at the log files where the jobs had failed and discovered that the cause was an inability to log-in to the database during process start-up. Naturally they tried to log-in themselves using SQL Server Management Studio or run a simple “SELECT GetDate();” style query via SQLCMD and discovered a similar problem.

Initial Symptoms

With the database appearing to be up the spout they raised a priority 1 ticket with the DBA team to investigate further. Whilst this was going on I started digging around the grid computation services we had built to see if any more light could be shed on what might be happening. This being the Windows Server 2003 era I had to either RDP onto a remote desktop or use PSEXEC to execute remote commands against our app servers. What surprised me was that these were behaving very erratically too.

This now started to look like some kind of network issue and so a ticket was raised with the infrastructure team to find out if they knew what was going on. In the meantime the DBAs came back and said they couldn’t find anything particularly wrong with the database, although the transaction log consumption was much higher than usual at this point.

Closing In

Eventually I managed to remote onto our central logging service [1] and found that the day’s log file was massive by comparison and eating up disk space fast. TAILing the central log file I discovered page upon page of the same error about some internal calculation that had failed on the compute nodes. At this point it was clearly time to pull the emergency chord and shut the whole thing down as no progress was being made for the business and very little in diagnosing the root of the problem.

With the tap now turned off I was able to easily jump onto a compute node and inspect its log. What I discovered there was every Monte Carlo simulation of every trade it was trying to value was failing immediately in some set-up calculation. The “best efforts” error handling approach meant that the error was simply logged and the valuation continued for the remaining simulations – rinse and repeat.

Errors at Scale

Of course what compounded the problem was the fact that there were approaching 100 compute nodes all sending any non-diagnostic log messages, i.e. all warnings and errors, across the network to one central service. This service would in turn log any error level messages in the database’s “error log” table.

Consequently with each compute node failing rapidly (see “Black Hole - The Fail Fast Anti-Pattern”) and flooding the network with thousands of log messages per-second the network eventually became saturated. Those processes which had long-lived network connections (we used a high-performance messaging product for IPC) would continue to receive and generate traffic, albeit slowly, but establishing new connections usually resulted in some form of timeout being hit instead.

The root cause of the compute node set-up calculation failure was later traced back to some bad data which itself had resulted from poor error handling in some earlier initial batch-level calculation.

Points of Failure

This all happened just before Michael Nygard published his excellent book Release It! Some months later when I finally read it I found myself frequently nodding my head as his tales of woe echoed my own experiences.

One of the patterns he talks about in his book is the use of bulkheads to stop failures “jumping the cracks”. On the compute nodes the poor error handling strategy meant that the same error occurred over-and-over needlessly instead of failing once. The use of a circuit breaker could also have mitigated the volume of errors generated and triggered some kind of cooling off period.

Duplicating the operational log data in the same database as the business data might have been a sane thing to do when the system was tiny and handling manual requests, but as the system became more automated and scaled out this kind of data should have been moved elsewhere where it could be used more effectively.

One of the characteristics of a system like this is that there are a lot of calculations forming a pipeline, so garbage-in, garbage-out means something might not go pop right away but sometime later when the error has compounded. In this instance an error return value of –1 was persisted as if it was normal data instead of being detected. Latter stages could do sanity checks on data to avoid poisoning the whole thing before it’s too late. It should also have been fairly easy to run a dummy calculation on the core inputs before opening the flood gates to mitigate a catastrophic failure, at least, for one due to bad input data.

Aside from the impedance mismatch in the error handling of different components there was also a disconnect in the error handling in the code that was biased towards one-off trader and support calculations, where the user is present, versus batch processing where the intention is for the system to run unattended. The design of the system needs to take both needs into consideration and adjust the error handling policy as appropriate. (See “The Generation, Management and Handling of Errors” for further ideas.)

Although the system had a monitoring page it only showed the progress of the entire batch – you needed to know the normal processing speed to realise something was up. A dashboard needs a variety of different indicators to show elevated error rates and other anomalous behaviour, ideally with automatic alerting when the things start heading south. Before you can do that though you need the data to work from, see “Instrument Everything You Can Afford To”.

The Devil is in the (Non-Functional) Details

Following Gall’s Law to the letter this particular system had grown over many, many years from a simple ad-hoc calculation tool to a full-blown grid-based compute engine. In the meantime some areas around stability and reliably had been addressed but ultimately the focus was generally on adding support for more calculation types rather than operational stability. The non-functional requirements are always the hardest to get buy-in for on an internal system but without them it can all come crashing down and end in tears with some dodgy inputs.

[1] Yes, back then everyone built their own logging libraries and tools like Splunk.

Friday, 13 October 2017

The User-Agent is not Just for Browsers

One of the trickiest problems when you’re building a web service is knowing who your clients are. I don’t mean your customers, that’s a much harder problem, no, I literally mean you don’t know what client software is talking to you.

Although it shouldn’t really matter who your consumers are from a technical perspective, once your service starts to field requests and you’re working out what and how to monitor it, knowing this becomes far more useful.

Proactive Monitoring

For example the last API I worked on we were generating 404’s for a regular stream of requests because the consumer had a bug in their URL formatting and erroneously appended an extra space for one of the segments. We could see this at our end but didn’t know who to tell. We had to spam our “API Consumers” Slack channel in the hope the right person would notice [1].

We also had consumers sending us the wrong kind of authorisation token, which again we could see but didn’t know which team to contact. Although having a Slack channel for the API helped, we found that people only paid attention to it when they noticed a problem. It also appeared, from our end, that devs would prefer to fumble around rather than pair with us on getting their client end working quickly and reliably.

Client Detection

Absent any other information a cloud hosted service pretty much only has the client IP to go on. If you’re behind a load balancer then you’re looking at the X-Forwarded-For header instead which might give you a clue. Of course if many of your consumers are also services running in the cloud or behind the on-premise firewall they all look pretty much the same.

Hence as part of our API documentation we strongly encouraged consumers to supply a User-Agent field with their service name, purpose, and version, e.g. MyMobileApp:Test/1.0.56. This meant that we would now have a better chance of talking to the right people when we spotted them doing something odd.

From a monitoring perspective we can then use the User-Agent in various ways to slice-and-dice our traffic. For example we can now successfully attribute load to various consumers. We can also filter out certain behaviours from triggering alerts when we know, for example, that it’s their contract tests passing bad data on purpose.

By providing us with a version number we can also see when they release a new version and help them ensure they’ve deprecated old versions. Whilst you would expect service owners to know exactly what they’ve got running where, you’d be surprised how many don’t know they have old instances lying around. It also helps identify who the laggards are that are holding up removal of your legacy features.

Causality

A somewhat related idea is the use of “trace” or “correlation” IDs, which is something I’ve covered before in “Causality - A Mechanism for Relating Distributed Diagnostic Contexts”. These are unique IDs for diagnosing problems with requests and it’s useful to include a prefix for the originating system. However that system may not be your actual client if there are various other services between you and them. Hence the causality ID covers the end-to-end where the User-Agent can cover the local client-server hop.

You would think that the benefit of passing it was fairly clear – it allows providers to proactively help consumers fix their problems. And yet like so many non-functional requirements it sits lower down their backlog because it’s only optional [2]. Not only that but by masking themselves it actually hampers delivery of new features because you’re working harder than necessary to keep the existing lights on.

[1] Ironically the requests were for some automated tests which they didn’t realise were failing!

[2] We wanted to make the User-Agent header mandatory on all non-production environments [3] to try and convince our consumers of the benefits but it didn’t sit well with the upper echelons.

[3] The idea being that its use in production then becomes automatic but does not exclude easy use of diagnostic tools like CURL for production issues.

Monday, 26 January 2015

IIS Hosted Web API Not Caching Internal HTTP Responses

Executive Summary: Set “Load Profile” to true for the IIS application pool.

I had a fun afternoon last week trying to help a colleague work out why the responses to the internal HTTP requests from one of my team’s web APIs weren’t being cached by the .Net HTTP component. It appeared that this was all working fine in the pre-production environment, but not production itself.

Caching HTTP Responses

The code (which I’m not overly familiar with) uses one of the standard .Net HTTP request classes (WebRequest I believe) to make authentication calls to another internal, RESTful service. It was anticipated by the owner of this service that responses could be cached by the .Net framework thereby alleviating the need for our own service to manually implement a caching layer. Naturally the point of (briefly) caching the HTTP responses was to reduce load on the underlying authentication servers and they had made sure that their response headers were correctly configured to make this “all just work”.

In theory this was true, but in practice it turned out to be a little more complicated once the code as running under the real IIS worker process (w3wp.exe).

Administrator Rights

My colleague had already done a fair bit of analysis and experimentation with a spare production server and the wonderful Process Monitor tool from Sysinternals. By watching the w3wp.exe process whilst he sent requests with CURL he could see that the process was getting Access Denied errors when accessing parts of the Registry and file-system.

He had also tried configuring the IIS application pool to run under another non-service account, i.e. a support user’s account, and that appeared to work. Comparing the behaviours with Process Monitor for the two accounts he could verify that the problem was one of permissions, or so it seemed.

A natural response when seeing Access Denied problems is to grant the underprivileged account admin rights and that’s what he did and lo-and-behold things started working. However further experimentation suggested that this didn’t always fix the problem, which was mightily confusing.

At this point I joined the party...

Verifying Admin Rights

Past experience has taught me that just adding an account to the Administrators group is not always enough - it’s best to actually check the security attributes of the process in question to double-check that it really has become elevated. Conversely when removing rights I make the same check to ensure they’ve been revoked correctly.

For this check I used another of Sysinternals tools - Process Explorer. We hit an initial “Access Denied” for some process information [1] so I used the technique of starting it with “psexec -s -i” which was probably a bit of overkill (See “Process Explorer Failed to Display DLLs or Handles” for when this might be necessary). After right-clicking on w3wp.exe to show the process’s Properties and then switching to the Security tab I could see that the process was indeed a member of BUILTIN\Administrators.

It later transpired that the w3wp.exe process not correctly acquiring or relinquishing its power might have been the source of some of the earlier confusion. I too witnessed the removal of the account from the local Administrators group and after recycling the web site application pool found the w3wp.exe process still appeared to be a member. The process start time strongly suggested it had been restarted too when we recycled the app pool.

The Default Profile

We then switched back to Process Monitor and looked at the Registry and file-system access. The HTTP client was accessing various “Internet Settings” registry keys under the HKEY_USERS tree - HKU\.Default in particular. It was then reading and writing to an internet cache that was stored under “C:\Windows\Temp\Temporary Internet Files\...”.

I had always suspected that user profiles might have a role to play in this problem because I remembered back in the dim-and-distant past that the native WinInet library couldn’t (or perhaps shouldn’t) be used for making HTTP calls when running as a service. Microsoft later added another native component called WinHttp that could be used by services. IIRC the main problem back then was around configuring proxy settings, and whilst we weren’t suffering from that specific ailment this time, the registry keys it appeared to be accessing was pointing in the same direction.

We then revoked the admin rights from the service account, checked they had been revoked with Process Explorer, and then executed the test HTTP requests again so I could see whether the same registry and file-system access patterns occurred, and they did. A brief play with the REG command-line tool and DIR helped prove that, although the keys and folders existed, the unprivileged service account wouldn’t have access to them.

LoadProfile=true

This suggested to me that the problem was likely down to the fact that the service account was being used by the w3wp.exe process without a user profile being loaded. As such, due to the lack of a profile, it was trying to do the same thing it would with just a default profile on hand, but it didn’t have the necessary rights to update the cache in the “C:\Windows\Temp\Temporary Internet Files” folders.

Changing the Application Pool setting for “Load Profile” to “true” seemed to fix the problem. We then applied this to another spare production server (after first proving that it was not behaving correctly) and that worked too.

Admin Rights Not Required

I think this tale highlights why the answer to many deployment issues can be solved by apparently making the service account a local administrator. The huge gain in power that right confers allows the process to read, and more importantly write, to so much of the system. Even if the correct answer is an entirely different setting (as in this case) taking the sledgehammer approach can get you running, but at a potentially disastrous cost [2].

Anyone doing service style deployments on Windows should know how to weld tools like NTRIGHTS and CACLS (or their more modern equivalents) to grant just the right amount of permissions for their processes.

[1] I’ve spent so many years supporting pre-Windows Vista systems that I still forget occasionally that I need to elevate admin tools. I usually start with a non-elevated command prompt for safety and then wonder later why a tool doesn’t work properly because I’ve forgotten that I’ve gone for safety first. Which of course is the whole point :-).

[2] If the service is compromised it might grant an intruder further access. Even if it’s an internal service putting it in a sandbox helps ensure it does what you think it should.

Tuesday, 30 September 2014

Single Points of Failure - The SAN

The tactical solution that was put together in my “The Surgical Team” post had being running perfectly for about 6 months when I get another one of those weird mornings where I find myself being escorted to a meeting room. This time though it’s the War Room...

It seemed our simple web service that had been happily servicing requests in under 10 ms had on occasion recently exceeded that by quite some margin - a margin measured in seconds not milliseconds! It seems that the “on occasion” had now turned into “quite a few” and the failure count was now measured in whole percentage points and rising.

Naturally I pulled out the trusty old Unix command line tools (GoW is my current favourite distribution) and grepped out the number of failed requests from the IIS log files (the last column in the default log format is the time taken in ms [1]). I did this across all four production web servers and confirmed that there were web requests taking many seconds - up to 30 seconds in some cases for a request to complete!

I switched to our application logs which log every API request to see what the internal view of the requests looked like. Nope, we still handled the request within microseconds internally (the rest is IIS + MVC overhead). Very occasionally I did see something really odd - the timestamp on the log message didn’t really agree with the measurement we had taken for the length of the call.

As far as I could tell our application, which is so utterly trivial [2], was behaving itself but something either in IIS or the OS or elsewhere was causing the request to hang. It definitely wasn’t Gen 2 garbage collections either because we log when they occur and they happened nowhere near any of the hangs we were seeing. Sniffing the traffic with WireShark confirmed the behaviour but it all made absolutely no sense. There were reports of some odd behaviour in the logs of other services too but they were not nearly so time sensitive and it was much harder to track the anomalies down. In one case there was a task scheduled onto an empty CLR thread pool and it appeared to be seconds before it was picked up.

The fact that it was not isolated to just our service, although the evidence there was weak, coupled with the fact that it affected all our web servers, but not always at the same moment in time suggested to me that it was likely to be infrastructure related. My gut reaction was something security related as I also noticed that the clocks were well out of sync. I had seen Windows servers behave oddly before when their clocks were out of sync, but our servers were still well within the legendary 5 minute window for Kerberos. Perhaps a busted domain controller or a busted NIC in a domain controller?

None of the servers had been patched recently, but just to see whether it might be a virtualization problem one web service guest VM was moved to an entirely different cluster and host VM version. This still made no difference though.

This went on for a few days. The only thing I had unearthed by running grep across the entire set of IIS logs since we went live, was that the problem had really surfaced a month or so earlier during a bank holiday. There had been some parallel investigation to find out exactly what changes were made back then, but we were advised that it really can’t be the source of the problem - just some tangential minor network reconfiguration. It sure looked like a smoking gun, but at the same time it had gone away and never really made any sense.

Someone put together a simple tool to make a web request against a local IIS instance to try and see if it was machine or network related and it proved pretty quickly that it was machine related. The next step was to try and disable everything non-essential, such as logging, and this came up trumps. So logging was disabled on all four web servers and the problem disappeared. But why?

As the title of this post has already given away the answer you won’t be surprised to discover it was the SAN. Something within the product was causing replication to hang and that was causing a cascading failure across the network. All the disks on all the web server VMs were virtualized and replicated via the SAN. Essentially the volume of logging the web servers were doing, whilst not insanely high (IMHO), was causing something within the SAN to get very upset.

Looking back I could see that that timestamp on some log messages didn’t make sense, but it had never occurred to me that the delay would be caused by a (silent) problem writing to the drive [3] hosted via the SAN. I’ve never worked directly with virtual machines before because all the years I worked in finance we were thrashing physical machines 24x7 by running intensive calculations. I’m also somewhat surprised that log files are not written to some form of local storage, even if it’s a virtualized local drive, as the point of writing to local storage is usually to avoid too many side-effects from the act of logging. Plus you have a log to look at if some part of the remote infrastructure fails. Perhaps I’m out of touch here - is it cheaper to throw all your log messages across the wire to some remote storage? I have used remote logging for operations style logging (i.e. info, warning, error), but not for diagnostic logging (i.e. debug/trace). Presumably SANs are supposed to be so reliable and performant that they are as reliable as the sun rising and setting?

The last time I checked, which was a few months after this incident happened, the SAN’s vendor had still not resolved the problem and the application logging was still disabled.

[1] The wording for the time-taken field in the IIS log file has a real air of lawyer speak about it. Basically it’s the time from the moment the request is first noticed to the point that the last packet is sent (IIRC). The weasel words seem to be a reminder that network delays outside of IIS’s control means the time may well be longer than that. And for really small responses (i.e. a single network packet in size) it could be even less reliable. Given that the resolution is 15 ms anyway I don’t think anyone would seriously consider it a truly accurate source in the first place.

[2] It was just a straight binary search across a (paged) array of 100 million pairs of fixed-length strings (close to 3 GB in size).

[3] When a physical disk starts to fail you can usually see some evidence in the event log, such as reads or writes failing. But in this case there was nothing in the guest’s event log to indicate the write took an inordinately long time. I can’t say whether the infrastructure team could see anything in the host’s logs but one would hope they would have been checked...

Tuesday, 26 November 2013

Logging Stack Traces Should be Unnecessary

I like a nice clean log file. The left hand margin should be a fixed width and easy on the eye so that my built-in pattern recognition gets to work as effectively as possible. It should also be easy to parse with the age old UNIX command line tools (and LogParser) without me having to pre-process the file first to get shot of the noise.

What really messes with this is when I see stack traces in the log file. They are always huge and contain far too much duplication because modern software design principles suggest we write short, simple methods, with overloads and reuse by forwarding calls instead of cutting-and-pasting functionality:-

. . .
void DoSomething(string[], int, int)
void DoSomething(string[])
void DoSomething(IEnumerable<string>)
. . .

So, seriously, has that level of detail ever enabled you to solve a problem? Without knowing what the parameter values are how much do stack traces even tell you? Agreed, if all you’ve got is a crash dump then a stack trace is invaluable, but I’m talking about logging stack traces which by definition means that you’re already writing other diagnostic information too.

Design Smell?

I’ve always favoured keeping stack traces out of log files on the basis that they are of little value in comparison to other techniques, and so far I’ve found that I don’t miss them. In my experience, if the design of the code is right and the error message (e.g. exception message) is well written it should be fairly easy to reason about where in the code the problem is, which is effectively what a stack traces tells you. In short that means a simple GREP on the source code to find where the message is generated.

You might argue that a stack trace tells you that up front so why make more effort than necessary, which is of course true, but you’re also going to need the context, which a stack trace will not tell you unless it logs all its parameter values too. And for that we need to visit the log file, and if we’re going to do that how much time and effort are we really saving at the cost of extra background noise? More importantly this is the moment when the quality of our log message entries will begin to shine or we find ourselves lost in the trees looking for the wood. Hopefully during development you’ve already been dog-fooding your own logs to get a feel for how well you can solve real support problems when using them.

Test Infrastructure

The second part of the puzzle of avoiding needing all this gratuitous text is an ability to reproduce the problem easily within a debugger. Hopefully from the context you should be able to explore the problem in isolation - varying different inputs to see how the code is reacting. If the design is simple you should easily be able to step through an existing test case and see where the points of trouble might be, e.g. some missing or dodgy error handling.

At this stage, while the overarching goal is to fix the problem at hand, the fact that a bug has crept in means that the development process has failed to some degree and therefore I’d be taking this as an opportunity to compensate by doing a review. It’s likely I won’t action anything there-and-then, instead favouring to make some notes so that any eventual action can be triaged and prioritised separately.

Depending on the complexity of the system, this is the point at which I might rely on any custom tooling I’ve built to help isolate certain aspects of the system so that they can be exercised in a tightly controlled and deterministic environment, e.g. console app test harness that hosts the various services in-process.

Minimal Traces

What I despise most about many stack traces I see in forum posts is the sheer volume of noise. There is clearly some value in them, more so for “generic” unhandled exceptions like a NullReferenceException that have no message, but do they have to extend to reams and reams of text? When I log an exception I write the entire exception chain with both the type and the message; all on a single line. This is done using an extension method for the Exception type in C#. The same could easily be done for the stack trace, the only reason I haven’t done it is because I haven’t needed to, yet. But if I did write one what would it do?

Firstly I’d strip away all the arguments as they are fairly meaningless without their values. I’d also collapse all overloads into a single method as forwarding calls are uninteresting too. The bottom part of any stack trace is a load of boilerplate system code, so I’d chuck that away and make the entry point to my code the first interesting point of reference, which I should be able to find because assembly names and namespaces tend to follow a corporate convention. The same is probably true for the top of the stack, but the very first frame may be meaningful so perhaps I’d keep that, although if I had to keep just one method name it would be the last method of my code I’d keep as that is the first point that has any real meaning. Finally, I’d rotate what’s left and stitch it together with pipes, probably something like this (ignoring the unfortunate word-wrap):-

2001-01-01 ERR Unhandled exception - OrderController.GetOrder|>ProductService.FindProduct {NullReferenceException}

I definitely don’t need the full namespace names, just the class and method, although I’d argue that with decent method names even the classes might easily be inferred from just the method name and context. Doesn’t that look a whole lot less noisy?

Whilst I might not convince you to drop stack traces entirely from your logs, at least entertain the idea that you can represent them in a far more support-friendly fashion than what the runtime throws out by default.

Thursday, 21 March 2013

You Want Windows 98 Support in 2013?

A few weeks back I got an email from a chap in Australia who wanted to know if I could fix one of my DDE tools (DDE Command) to work under Windows 98 SE. After a quick check of the calendar to make sure I hadn’t entered a time warp I couldn’t help but be a little curious about where this mysterious Windows box was running and why it couldn’t be upgraded…

It turned out that this machine was fitted with a propriety ISA card (remember those?) for which support was discontinued in 1996! This chap seemed to be going through heroic efforts to keep it running by using Excel to grab data from the card via DDE. He then stumbled upon my command line tool which worked fine on Windows 7 (as it does right back to Windows 2000), but was failing on Windows 98 like this:-

C:\> ddecmd servers
ERROR: Failed to query DDE servers: DMLERR_DLL_NOT_INITIALIZED

I had not seen that kind of error before and it’s pretty fundamental too - the underlying DDEML library was apparently not initialised. All ddecmd commands seemed to report the same problem. Luckily I had more than an inkling of what it might be because the tool is pretty simple and my DDE classes have barely changed in over decade. Also I didn’t really fancy trying to cobble together a VM and source a Windows 98 licence just to help this chap out. One of the reasons I provide the source code to all my tools is exactly to allow someone to find their own (paid) support if I can’t accommodate them myself.

One of the key differences between the Windows 95/98/ME lineage and the NT/2K/XP/Vista/etc one is that the former is ANSI internally whereas the latter is Unicode. Naturally for backwards compatibility reasons the NT line can also run ANSI binaries too, with a slight overhead as it translates back and forth as required. One of the decisions I made back in February 2008 was to switch to Unicode builds by default; however the Windows build targets were still left as they were when I first ported my libraries to Win32 back in the mid ‘90’s!

#define WINVER         0x0400   //! Windows 95+
#define _WIN32_WINNT   0x0400   //! Windows NT 4.0+
#define _WIN32_WINDOWS 0x0400   //! Windows 95+
#define _WIN32_IE      0x0400   //! IE 4.0+

Luckily I have no need for anything more fancy, especially with my command line tools so it was a simpler matter of flicking the switch in Visual C++ (7.1, aka VS2003, is still my favoured version) and out popped an ANSI build. In the meantime because of the time lag between the UK and Australia I suggested to the chap that he download my GUI based DDE tool (DDEQuery) and try that. I’d remembered that this tool hasn’t been touched since 2005 and so the binary would have been an ANSI build anyway. Unsurprisingly, it worked. So I shipped off an ANSI build of DDE Command and it was “case closed”.

Every time I think it’s time to leave the past behind and get-with-the-program a question like this pops up and it feels great to still be able to support such an old OS. Although I still use VC++ 7.1 (on XP) for my personal C++ code, I still compile it with modern versions of VC++ and GCC to gain access to all that extra static analysis so that someone can just pick it up if needs be. My day job now consists of writing C# and although it’s currently .Net 3.5 I have my sights firmly set on .Net 4.5 and C# 5 as I’m no luddite.

Even though the corporate desktop standard is still Windows XP in many organisations and we now have to suffer all those annoying “you’re using an outdated browser version” banners, it feels good to know that there are others out there that have it so much worse.

Monday, 21 January 2013

The Perils of Interactive Service Account Sessions

It’s common within the Enterprise to run the services (or daemon processes if you prefer) under a special “service account”. This account is often a domain account that has very special privileges and as such no one is generally expected to use it for anything other than running processes that are part of The System. Sometimes you might need to elevate to that account to diagnose a permissions problem, but those occasions should be very rare.

What you want to avoid doing is logging on interactively to a Windows machine using that account[1], such as remotely via MSTSC. What you should do is logon with your own credentials, or better yet those of a “break glass account” and then elevate yourself using, say, the RUNAS tool. This allows you to open a separate command prompt, or run another process under a separate set of credentials - usually the service account, e.g.

C:\> runas /user:chris@domain cmd.exe

There are various switches to control the loading of the user profile, etc. but that is the basic form. Once you have the command prompt or process open you can do your thing in a limited kind of sandbox.

The first reason for not logging in interactively is that by default Terminal Services will only let you have 2 connections open. Given that some developer’s (and admins) have a habit of leaving themselves logged in for extended periods, you invariably have to hunt down who has the connections open and ask one of them to drop out. If one or other user is logged in interactively using the service account it becomes a much harder job of finding out who “owns” that session and, as we’ll see, just toasting their session can be dangerous.

The main problem I’ve come across with logging in is down to the way scheduled tasks that are configured to run using separate credentials (in this case the service account) end up running in the interactive session (even without the “interactive” box checked). If you’ve ever had seemingly random console windows popping up whilst you’re logged i, this could be what they are. If you’re lucky the keyboard focus won’t be stolen, but if it is or you’re clicking with the mouse at the wrong time you can block the I/O to the process by accidentally enabling Quick Edit mode. Or worse yet you hit the close box as it pops into life.

You might notice those effects, but the more deadly one is logging off. If one of these scheduled tasks is running at the time you logoff, it will be killed. You might not notice it at first (especially if it gets scheduled again soon after) but the scheduled task will have a failed status and the very curious error code of 0xC000013A (application terminated by Ctrl+C).

The second issue I’ve seen relates to the service account not picking up changes in Windows AD group membership. I’ve read various chapters from Keith Brown’s excellent book Programming Windows Security (which is admittedly getting a bit long in the tooth) but can’t see why this would happen. Basically the account was removed from an AD group, and then later reinstated. However at the time the account was re-added to the group there was an interactive session on just one machine in the farm.

The other machines in the environment picked up the change, but the one with the interactive session didn’t. I could understand that an existing process might cache the group membership, but even starting new command prompts didn’t help. The scheduled tasks that were running, which were also new processes each time didn’t pick the change up either. After logging the session off and logging straight back on again everything was fine.

Maybe it was a one off, or perhaps it’s a known problem. Either way my Google Fu was clearly letting me down - that and the fact that the key words to describe the problem are about as vague as you can get in IT (group, windows, cache, etc). Hopefully some kind soul will leave a comment one day which explains my experience and brings me closure.

[1] I’m sure there are some edge cases to that advice, but I can’t personally remember a time when I needed to actually logon to a machine with a service account. Well, I can, but that’s because the software used to hide the passwords forced me to do it when all I needed was an elevated command prompt. That aside I haven’t.

The OldWood Thing