The tactical solution that was put together in my “The Surgical Team” post had being running perfectly for about 6 months when I get another one of those weird mornings where I find myself being escorted to a meeting room. This time though it’s the War Room...
It seemed our simple web service that had been happily servicing requests in under 10 ms had on occasion recently exceeded that by quite some margin - a margin measured in seconds not milliseconds! It seems that the “on occasion” had now turned into “quite a few” and the failure count was now measured in whole percentage points and rising.
Naturally I pulled out the trusty old Unix command line tools (GoW is my current favourite distribution) and grepped out the number of failed requests from the IIS log files (the last column in the default log format is the time taken in ms [1]). I did this across all four production web servers and confirmed that there were web requests taking many seconds - up to 30 seconds in some cases for a request to complete!
I switched to our application logs which log every API request to see what the internal view of the requests looked like. Nope, we still handled the request within microseconds internally (the rest is IIS + MVC overhead). Very occasionally I did see something really odd - the timestamp on the log message didn’t really agree with the measurement we had taken for the length of the call.
As far as I could tell our application, which is so utterly trivial [2], was behaving itself but something either in IIS or the OS or elsewhere was causing the request to hang. It definitely wasn’t Gen 2 garbage collections either because we log when they occur and they happened nowhere near any of the hangs we were seeing. Sniffing the traffic with WireShark confirmed the behaviour but it all made absolutely no sense. There were reports of some odd behaviour in the logs of other services too but they were not nearly so time sensitive and it was much harder to track the anomalies down. In one case there was a task scheduled onto an empty CLR thread pool and it appeared to be seconds before it was picked up.
The fact that it was not isolated to just our service, although the evidence there was weak, coupled with the fact that it affected all our web servers, but not always at the same moment in time suggested to me that it was likely to be infrastructure related. My gut reaction was something security related as I also noticed that the clocks were well out of sync. I had seen Windows servers behave oddly before when their clocks were out of sync, but our servers were still well within the legendary 5 minute window for Kerberos. Perhaps a busted domain controller or a busted NIC in a domain controller?
None of the servers had been patched recently, but just to see whether it might be a virtualization problem one web service guest VM was moved to an entirely different cluster and host VM version. This still made no difference though.
This went on for a few days. The only thing I had unearthed by running grep across the entire set of IIS logs since we went live, was that the problem had really surfaced a month or so earlier during a bank holiday. There had been some parallel investigation to find out exactly what changes were made back then, but we were advised that it really can’t be the source of the problem - just some tangential minor network reconfiguration. It sure looked like a smoking gun, but at the same time it had gone away and never really made any sense.
Someone put together a simple tool to make a web request against a local IIS instance to try and see if it was machine or network related and it proved pretty quickly that it was machine related. The next step was to try and disable everything non-essential, such as logging, and this came up trumps. So logging was disabled on all four web servers and the problem disappeared. But why?
As the title of this post has already given away the answer you won’t be surprised to discover it was the SAN. Something within the product was causing replication to hang and that was causing a cascading failure across the network. All the disks on all the web server VMs were virtualized and replicated via the SAN. Essentially the volume of logging the web servers were doing, whilst not insanely high (IMHO), was causing something within the SAN to get very upset.
Looking back I could see that that timestamp on some log messages didn’t make sense, but it had never occurred to me that the delay would be caused by a (silent) problem writing to the drive [3] hosted via the SAN. I’ve never worked directly with virtual machines before because all the years I worked in finance we were thrashing physical machines 24x7 by running intensive calculations. I’m also somewhat surprised that log files are not written to some form of local storage, even if it’s a virtualized local drive, as the point of writing to local storage is usually to avoid too many side-effects from the act of logging. Plus you have a log to look at if some part of the remote infrastructure fails. Perhaps I’m out of touch here - is it cheaper to throw all your log messages across the wire to some remote storage? I have used remote logging for operations style logging (i.e. info, warning, error), but not for diagnostic logging (i.e. debug/trace). Presumably SANs are supposed to be so reliable and performant that they are as reliable as the sun rising and setting?
The last time I checked, which was a few months after this incident happened, the SAN’s vendor had still not resolved the problem and the application logging was still disabled.
[1] The wording for the time-taken field in the IIS log file has a real air of lawyer speak about it. Basically it’s the time from the moment the request is first noticed to the point that the last packet is sent (IIRC). The weasel words seem to be a reminder that network delays outside of IIS’s control means the time may well be longer than that. And for really small responses (i.e. a single network packet in size) it could be even less reliable. Given that the resolution is 15 ms anyway I don’t think anyone would seriously consider it a truly accurate source in the first place.
[2] It was just a straight binary search across a (paged) array of 100 million pairs of fixed-length strings (close to 3 GB in size).
[3] When a physical disk starts to fail you can usually see some evidence in the event log, such as reads or writes failing. But in this case there was nothing in the guest’s event log to indicate the write took an inordinately long time. I can’t say whether the infrastructure team could see anything in the host’s logs but one would hope they would have been checked...