Tuesday, 30 October 2018

Always Reply Within Your SLA – Succeed or Abort

Way back in 2012 I wrote the blog post “Service Providers Are Interested In Your Timeouts Too” about how you can help service teams understand your intentions so that they can handle requests more efficiently. That was written at a time when I had been working for many years on internal systems where there were no real SLAs per-se, often just a “best efforts” approach with manual intervention required to “unblock” the system when the failures start occurring [1]. In contrast I have always strived to create self-healing systems as much as possible so that only truly remarkable events require any kind of human remediation.

In more recent years I’ve spent far more time working on web services where there is a much stronger notion of an SLA and therefore a much higher probability that if you fail to meet your SLA then the client will attempt to perform some kind of recovery rather than hang around and wait for the reply [2]. Hence what I wrote about wasting resources on dead requests in that earlier blog post have started to become more significant.


A consequence of this ideology is that I’ve started to become far more interested in the approach of always responding within the SLA even if that means aborting mid-request. Often an SLA is seen as an aspiration rather than any kind of hard deadline, something which we hope to achieve more often than not, where “more often” usually involves quoting some (arbitrary) number of “nines”. For those requests that fall outside this magical number all bets are off and you might get an answer in a useful timeframe or you might not. This kind of uncertainty has always bothered me as a client consumer.

Hence, I’ve started moving towards building services that always provide a reply within the SLA whether or not the request has been satisfied. Instead of tying up valuable resources in the hope that when the answer finally arrives the client still has a vested interest in it, I’d prefer to just abandon the request and let the client know the SLA would be violated if it had continued servicing it. In essence the request times-out server-side, where the time-out is the SLA.

What this means for the client is that they have a definitive reply (network issues notwithstanding) to their request within the time limit allowed. More importantly if they want to allow more time to handle the request than the SLA allows for then they need to tell the service that they’re willing to wait. Essentially this creates a priority system and allows the service to decide what to do with requests that are happy to hang around for a bit longer.


Implementation-wise what this mostly boils down to is ensuring that every non-trivial piece of work (think: database query, network call, disk read, etc.) must be made with a bounded call time, i.e. one where a timeout can be provided so that the caller always regains control in a timely fashion. Similarly we don’t start any work that we suspect we can’t finish in time either. This generally manifests as aborting on the first timeout which is usually given the entire SLA and therefore you’re never going to recover in time.

Internally the maximum timeout starts with the SLA and as each background query is sent it is timed and the timeout gets progressively shorter [3]. As the load increases and internal queries take longer the chances of a request aborting rises but at least the load on the upstream systems doesn’t keep rising too. Ultimately it’s just a classic negative feedback loop.


Unfortunately what makes implementing this somewhat less than idea is that we still don’t really have cancellable requests in many frameworks and you’re never entirely sure what happens when the timeout triggers. If the underlying operation is abandoned, but has to complete anyway because it can’t be cancelled, you may not be much better off. The modern async-enhanced programming world is great for avoiding tying up threads in the happy path but once you start considering the failure modes it’s much harder to reason about and, more importantly, control what’s going to happen. Despite the fact that under the covers the world of I/O has practically always been asynchronous the higher layers still assume a synchronous model with syntactic sugar only helping to reinforce that perspective.

So far I don’t have nearly enough production-level data points to know if it’s an idea that is truly worth the effort to implement or not. Being able to reject work outright because you’ve already missed the SLA isn’t too onerous but does mean you need to tap into the processing pipeline early before the request is queued in the background to know when the internal clock has started ticking. What’s harder to determine is whether you really get any benefit out of the additional complexity needed to track your request’s progress and if aborting upstream requests creates a more or equally unstable service due to the way the timeouts leave their underlying requests dangling.

I still think it’s an approach worth pursuing but I wouldn’t be surprised to find The Morning Paper covering something from decades ago that shows it’s just a fools errand :o).


[1] See “Support-Friendly Tooling” for some other examples about how this can play out if reliability out-of-the-box is “assumed”.

[2] In one instance that would mean abandoning the request and potentially taking on some small financial risk on behalf of the customer.

[3] Naturally for parallel / scatter-gather I/O it’s the time of the longest concurrent request.

No comments:

Post a Comment