While investigating the issue that led to the discovery of the strange default behaviour of the .Net HttpClient class which I wrote up in “Surprising Defaults – HttpClient ExpectContinue” we also unearthed some other weirdness in a web proxy that sat between our on-premise adapter and our cloud hosted service.
Web proxies are something I’ve had cause to complain about before (see “The Curse of NTLM Based HTTP Proxies”) as they seem to interfere in unobvious ways and the people you need to consult with to resolve them are almost always out of reach [1]. In this particular instance nobody we spoke to in the company’s networks team knew anything about it and trying to identify if it’s your on-premise proxy and not something broken with any of the other intermediaries that sit between you and the endpoint is often hard to establish.
The Symptoms
Whilst trying to track down where the “Expect: 100-Continue” header was coming from, as we didn’t initially believe it was from our code, we ran a WireShark trace to see if we could capture the traffic from, and to, our box. What was weird in the short trace that we captured was that the socket looked like it kept closing after every request. Effectively we sent a PUT request, the response would come back, and immediately afterwards the socket would be closed (RST).
Naturally we put this on the yak stack. Sometime later when checking the number of connections to the TIBCO server I used the Sysinternals’ TCPView tool to see what the service was doing and again I noticed that sockets were being opened and closed repeatedly. As we had 8 threads concurrently processing the message queue it was easy to see 8 sockets open and close again as in TCPView they go green on creation and red briefly on termination.
At least, that appeared to be true for the HTTP requests which went out to the cloud, but not for the HTTP requests that went sideways to the internal authentication service. However they also had an endpoint hosted in the cloud which our cloud service used and we didn’t see that behaviour with them (i.e. cloud-to-cloud), or when we re-configured our on-premise service to use it either (i.e. on-premise-to-cloud). This suggested it was somehow related to our service, but how?
The HttpClient we were using for both sets of requests were the same [2] and so we were pretty sure that it wasn’t our fault, this time, although as the old saying goes “once bitten, twice shy”.
Naturally when it comes to working with HTTP one of the main diagnostic tools you reach for is CURL and so we replayed our requests via that to see if we could reproduce it with a different (i.e. non-.Net based) technology.
Phased Switchover
While the service we were writing was new, it was intended to replace an existing one and so part of the rollout plan was to phase it in slowly. This meant that all reads and writes would go to both versions of the service but only the one where any particular customer’s data resided would succeed. The consumers of the service would therefore get a 404 from us if the data hadn’t been migrated, which in the early stages of development applied to virtually every request.
A few experiments later to compare the behaviour for requests of migrated data versus unmigrated data and we had an answer. For some reason a proxy between our on-premise adapter and our web hosted service endpoint was injecting a “Connection: Close” header when a PUT or DELETE [3] request returned a 404. The HttpClient naturally honoured the response and duly closed the underlying socket.
However it did not have this behaviour for a GET or HEAD request that returned a 404 (I can’t remember about POST). Hence the reason we didn’t see this behaviour with the authentication service was because we only sent GETs, and anyway, they returned a 200 with a JSON error body instead of a 404 for invalid tokens [4].
Epilogue
I wish I could say that we tracked down the source of the behaviour and provide some closure but I can’t. The need for the on-premise adapter flip-flopped between being essential and merely a performance test aid, and then back again. The issue remained as a product backlog item so we wouldn’t forget it, but nothing more happened while I was there.
We informed the network team that we were opening and closing sockets like crazy, which these days with TLS is somewhat more expensive and therefore would generate extra load, but had to leave that with them along with an offer of help if they wanted to investigate it further, as much for our own sanity.
It’s problems like these which cause teams to deviate from established conventions because ultimately one is within their control while the other is outside it and the path of least resistance is nearly always seen as a winner from the business perspective.
[1] I’m sure they’re not hidden on purpose but unless you have a P1 incident it’s hard to get their attention as they’re too busy dealing with actual fires to worry about a bit of smoke elsewhere.
[2] The HttpClient should be treated as a Singleton and not disposed per request, which is a common mistake.
[3] See “PUT vs POST and Idempotency” for more about that particular choice.
[4] The effects of this style of API response on monitoring and how you need to refactor to make the true outcome visible are covered in my recent Overload article “Monitoring: Turning Noise into Signal”.
No comments:
Post a Comment