Friday, 2 November 2012

Service Providers Are Interested In Your Timeouts Too

There are established figures for the attention span of humans and so when it comes to serving up web pages you can pretty well put a figure on how long you think a person will wait for a response. If you’re in a horizontal market serving up stuff to buy then that’s going to be pretty short - seconds I would have thought. On the other hand, when it comes to submitting your tax return I should imagine these users are a little more forgiving because ultimately it’s their wallet on the line. What this means is that, if in the act of handling a request you have to contact another service, you might be able to get away with a simple timeout to protect yourself.

When it comes to services that provide a much more variable set of data it’s preferable to be able to tell them more about what your expectations are, especially if they have no way of detecting and handling a client-side disconnect (it might be part of the protocol). The other obvious alternative is to support some form of cancellation mechanism. But that can be tricky to implement on the client as it’ll likely involves threads or re-entrancy or asynchronous style event handling. Many UI’s still have the classic single-threaded/synchronous mindset and so cancellable I/O may well be considered a “more advanced feature”.

To give an example, say you want all the orders for a particular customer and you have a large variation in result set sizes, it may be that the majority are small and return very quickly, but you have a few biggies that take considerably longer. Once the service starts to be put under load the larger requests may start to take much longer, so much so that the client times-out. If the user (real or another system) is impatient, or has some form of naive retry logic, it may just try again. The problem is that internally the system might well still be servicing the original request, whilst the retries come in. If the request servicing is poorly written and has no timeout of its own it will just end up competing with dead requests somewhat akin to a feedback loop.

In Vol 2 of Pattern Languages of Program Design (PLOPD) Gerard Meszaros published a few patterns for dealing with reactive systems. One of these is “Fresh Work Before Stale”. The premise is that if you have a long queue and you operate it as a FIFO then everybody waits and so everybody’s experience sucks. Instead, if you operate it something more like a LIFO then, when the system is busy, at least some clients get a decent level of service at the expense of really poor service for others. I’m sure PhD’s have been written on the subject and I’m probably trivialising it here, but I like the simplicity of this particular idea.

Another alternative is to give the service provider more information about your expectations so that they have the opportunity to expend no more effort on your behalf than is absolutely necessary. In essence it’s a variation of the Fail Fast pattern because if you know at the point of starting the task that you either can’t complete it in time, or even that you’ve already exceeded to Time To Live (TTL) then there is little point in continuing. One additional difference in the latter case is that the client will have already given up. Either way you have been given the opportunity to protect yourself from performing unnecessary work which will improve matters when under load.

The system at my current client provides a calculation framework that can be used by both real people and scheduled jobs to perform a variety of workloads that include user-submitted and bulk batch processing style requests. The length of a calculation is unknown up front but generally speaking the users are after a quick turnaround whilst some of the batch jobs take many hours. Although the batch processing work is generally predictable, the user submitted requests are largely out of our control and so they could submit a job that takes hours by accident. In fact they may not realise it and keep submitting it, much like the impatient web surfer who keeps hitting “refresh” in their browser in the hope of a quicker response. For really long running batch jobs it’s possible they could miss the reporting deadline, but they’re still not terminated because a late answer still has value for other reasons.

Balancing this kind of workload is hard because you need to ensure that system resources are not being wasted due to a hung job (e.g. the code has gone into an infinite loop) but at the same time you need to cater for a massive variation in processing time. If the work is queued up you also probably want to distinguish between “time spent processing” and “wall clock time”. A user will only wait for, say, 5 minutes from the time of submission, so if it doesn’t even make it out of the queue by then it should be binned. In contrast the batch jobs are more likely to be controlled by the amount of time spent processing the request. If you submit thousands of jobs up front you expect some waiting to occur before they get actioned. If there is some kind of business deadline you’ll need to prioritise the work and maybe use a fixed-point-in-time as the timeout, say, 9 am.

Naturally these various techniques can be composed so that if you have a hard deadline of 7:00 am, and you pick up a job at 6:58, then the maximum timeout for any service call is the minimum of 2 minutes and the normal service call timeout. It doesn’t matter that you might normally allow 5 minutes - if you can’t get an answer within 2 you’re stuffed.

The flip-side to all this is of course is increased complexity. Re-runability is also effected when you have hard deadlines because you have to adjust the deadline first which makes support that little more difficult. For UIs you have a whole other world of pain just trying to manage the users expectations and impatience. But they were never the target here - this is very much about helping the links in the back-end chain to help you by being more aware of what you’re asking them to do.

No comments:

Post a Comment