On the system I’m currently working on we have always seen the occasional failure caused by the database going to sleep for longer than the timeout on a particular write. When I say occasional we were seeing this once every few months, maybe. There is some polling of the table, but effectively there should never be any overlap between the query used to retrieve data and one used to update it later[1]. This eventually started to become a little more frequent and so I decided to add a little extra code to make the specific write query a little more resilient by allowing the 30 second default timeout to be more more configurable. The update is very small and so the actual transaction, when given a chance to do its work is tiny, so the extended timeout didn’t feel too dangerous. I intended to extend it a little bit at a time to ensure I didn’t mask any other issues by accident.
As is always the way I never quite got the time to finish plumbing the changes in in one hit; despite their relative simplicity.
When I finally came back to it I noticed a colleague had decided to take a different route - the more classic back-off and retry. This made me feel a little uneasy because I felt this choice was more risky. The first, and probably least important reason was that the code was now more complex. Alright, so it wasn’t reams and reams of code, but it was far more than just the one line required to bump the timeout. Yes, it could have be simplified by factoring the retry logic using Execute Around Method. But, what concerned me more was the impact on SQL Server of the constant retries from the numerous calling processes there are...
Everyone knows what deadlock is because it’s a classic interview question. But many people I’ve interviewed are far less familiar with its alter ego - livelock. Where with deadlock nobody makes progress but everyone is standing around doing nothing, with livelock the opposite is happening, everyone is so busy trying to avoid one another they still make no progress. The canonical example I know is two people walking towards each other in a corridor but don’t manage to pass because they both keep sidestepping to avoid the path of the other. Change the two people to two processes which are both backing-off and retrying and you can hopefully see where I’m going.
If SQL Server was blocking the update whilst some other lengthy read was occurring[2] it felt better to me to wait for longer on the basis that SQL Server will require no more resources to achieve that. In contrast, backing off and retrying means that the client will timeout, reconnect, and re-execute the query which means a plan has to be re-generated, etc. Even with connection pools and query plan caching it must still add some overhead and therefore would be more costly than waiting longer. If you have many clients doing this you will start to approach the effects you see with livelock.
I wrote about similar effects to this in “Service Providers Are Interested In Your Timeouts Too” where the client of a web service would timeout and retry but with the service still suck executing the previous query it just created more contention. I’m sure SQL Server has been around more than long enough to deal with this admirably, but I’m sure the constant retries don’t come for free.
In this particular case I’m sure the level of contention was easily dealt with either way, but when you have a performance problem that you cannot definitely pin down it’s better to inch your way forward than to change too many things in one go.
[1] Use of the READPAST hint helps reduce the readers blocking each other.
[2] We had no obvious candidates for why this might be happening at first. As time went on and the table size grew into many millions of rows there was some correlation with a few support queries that were being run occasionally but nothing really concrete. If it had been a major problem we probably would have stuck the SQL profiler on and tried to force the issue somehow. Intuition is never a good substitute for hard data but sometimes it’s all you have to go on.