The OldWood Thing: December 2016

One of the things you quickly discover when moving from building services on-premise to “the cloud” is quite how many more bits of wire and kit suddenly sit between you and your consumer. Performance-wise this already elongated network path can then be further compounded when the framework you’re using invokes unintuitive behaviour by default [1].

The Symptoms

The system was a new REST API built in C# on the .Net framework (4.6) and hosted in the cloud with AWS. This AWS endpoint was then further fronted by Akamai for various reasons. The initial consumer was an on-premise adaptor (also written in C#) which itself had to go through an enterprise grade web proxy to reach the outside world.

Naturally monitoring was added in fairly early on so that we could start to get a feel for how much added latency moving to the cloud would bring. Our first order approximation to instrumentation allowed us to tell how long the HTTP requests took to handle along with a breakdown of the major functions, e.g. database queries and 3rd party requests. Outside the service we had some remote monitoring too that could tell us the performance from a more customer-like position.

When we integrated with the 3rd party service some poor performance stats caused us to look closer into our metrics. The vast majority of big delays were outside our control, but it also raised some other questions as the numbers didn’t quite add up. We had expected the following simple formula to account for virtually all the time:

HTTP Request Time ~= 3rd Party Time + Database Time

However we were seeing a 300 ms discrepancy in many (but not all) cases. It was not our immediate concern as there was bigger fish to fry but some extra instrumentation was added to the OWIN pipeline and we did a couple of quick local profile runs to look out for anything obviously out of place. The finger seemed to point to time lost somewhere in the Nancy part of the pipeline, but that didn’t entirely make sense at the time so it was mentally filed away and we moved on.

Serendipity Strikes

Whilst talking to the 3rd party about our performance woes with their service they came back to us and asked if we could stop sending them a “Expect: 100-Continue” header in our HTTP requests.

This wasn’t something anyone in the team was aware of and as far as we could see from the various RFCs and blog posts it was something “naturally occurring” on the internet. We also didn’t know if it was us adding it or one of the many proxies in between us and them.

We discovered how to turn it off, and did, but it made little difference to the performance problems we had with them, which were in the order of seconds, not milliseconds. Feeling uncomfortable about blindly switching settings off without really understanding them we reverted the change.

The mention of this header also cropped up when we started investigating some errors we were getting from Akamai that seemed to be more related to a disparity in idle connection timeouts.

Eventually, as we learned more about this mysterious header someone in the team put two-and-two together and realised this was possibly where our missing time was going too.

The Cause

Our REST API uses PUT requests to add resources and it appears that the default behaviour of the .Net HttpClient class is to enable the sending of this “Expect: 100-Continue” header for those types of requests. Its purpose is to tell the server that the headers have been sent but that it will delay sending the body until it receives a 100-Continue style response. At that point the client sends the body, the server can then process the entire request and the response is handled by the client as per normal.

Yes, that’s right, it splits the request up so that it takes two round trips instead of one!

Now you can probably begin to understand why our request handling time appeared elongated and why it also appeared to be consumed somewhere within the Nancy framework. The request processing is started and handled by the OWN middleware as that only depends on the headers, it then enters Nancy which finds a handler, and so requests the body in the background (asynchronously). When it finally arrives the whole request is then passed to our Nancy handler just as if it had been sent all as a single chunk.

The Cure

When you google this problem with relation to .Net you’ll see that there are a couple of options here. We were slightly nervous about choosing the nuclear option (setting it globally on the ServicePointManager) and instead added an extra line into our HttpClient factory so that it was localised:

var client = new HttpClient(...);
...
client.DefaultRequestHeaders.ExpectContinue = false;

We re-deployed our services, checked our logs to ensure the header was no longer being sent, and then checked the various metrics to see if the time was now all accounted for, and it was.

Epilogue

In hindsight this all seems fairly obvious, at least, once you know what this header is supposed to do, and yet none of the people in my team (who are all pretty smart) joined up the dots right away. When something like this goes astray I like to try and make sense of why we didn’t pick it up as quickly as perhaps we should have.

In the beginning there were so many new things for the team to grasp. The difference in behaviour between our remote monitoring and on-premise adaptor was assumed to be one of infrastructure especially when we had already battled the on-premise web proxy a few times [2]. We saw so many other headers in our requests that we never added so why would we assume this one was any different (given none of us had run across it before)?

Given the popularity and maturity of the Nancy framework we surmised that no one would use it if there was the kind of performance problems we were seeing, so once again were confused as to how the time could appear to be lost inside it. Although we were all aware of what the async/await construct does none of us had really spent any serious time trying to track down performance anomalies in code that used it so liberally and so once again we had difficulties understanding perhaps what the tool was really telling us.

Ultimately though the default behaviour just seems so utterly wrong that none of use could imagine the out-of-the-box settings would cause the HttpClient to behave this way. By choosing this default we are in essence optimising PUT requests for the scenario where the body does not need sending, which we all felt is definitely the exception not the norm. Aside from large file uploads or massive write contention we were struggling to come up with a plausible use case.

I don’t know what forces caused this decision to be made as I clearly wasn’t there and I can’t find any obvious sources that might explain it either. The internet and HTTP has evolved so much over the years that it’s possible this behaviour provides the best compatibility with web servers out-of-the-box. My own HTTP experience only covers the last few years along with few more around the turn of the millennium, but my colleagues easily cover the decades I’m missing so I don’t feel I’m missing anything obvious.

Hopefully some kind soul will use the comments section to link to the rationale so we can all get a little closure on the issue.

[1] Violating The Principle of Least Astonishment for configuration settings was something I covered more generally before in “Sensible Defaults”.

[2] See “The Curse of NTLM Based HTTP Proxies”.