Monday 18 November 2019

Arbitrary Cache Timeouts

Like many other programmers I’ve probably added my fair share of caches to systems over the years, and as we know from the old joke, one of the two hardest problems in computer science is knowing when to invalidate them. It’s a hard question, to be sure, but a really annoying behaviour you can run into as a maintainer is when the invalidation appears to be done arbitrarily, usually by specifying some timeout seemingly plucked out of thin air and maybe even changed equally arbitrarily. (It may not be, but documenting such decisions is usually way down the list of important things to do.)

Invalidation

If there is a need for a cache in production, and let’s face it that’s the usual driver, then any automatic invalidation is likely to be based on doing it as infrequently as possible to ensure the highest hit ratio. The problem is that that value can often be hard-coded and mask cache invalidation bugs because it rarely kicks in. The knee-jerk reaction to “things behaving weirdly” in production is to switch everything off-and-on again thereby implicitly invalidating any caches, but this doesn’t help us find those bugs.

The most recent impetus for this post was just such a bug which surfaced because the cache invalidation logic never ran in practice. The cache timeout was set arbitrarily large, which seemed odd, but I eventually discovered it was supposed to be irrelevant because the service hosting it should have been rebooted at midnight every day! Due to the password for the account used to run the reboot task expiring it never happened and the invalidated items then got upset when they were requested again. Instead of simply fetching the item from the upstream source and caching it again, the cache had some remnants of the stale items and failed the request instead. Being an infrequent code path it didn’t obviously ring any bells so took longer to diagnose.

Design for Testability

While it’s useful to avoid throwing away data unnecessarily in production we know that the live environment rarely needs the most flexibility when it comes to configuration (see “Testing Drives the Need for Flexible Configuration”). On the contrary, I’d expect to have any cache being cycled reasonably quickly in a test environment to try and flush out any issues as I’d expect more side-effects from cache misses than hits.

If you are writing any automated tests around the caching behaviour that is often a good time to consider the other non-functional requirements, such as monitoring and support. For example, does the service or tool hosting the cache expose some means to flush it manually? While rebooting a service may do the trick it does nothing to help you track down issues around residual state and often ends up wreaking havoc with any connected clients if they’re not written with a proper distributed system mindset.

Another scenario to consider is if the cache gets poisoned; if there is no easy way to eject the bad data you’re looking at the sledgehammer approach again. If your cache is HA (highly available) and backed by some persistent storage getting bad data out could be a real challenge when you’re under the cosh. One system I worked on had random caches poisoned with bad data due to a threading serialization bug in an external library.

Monitoring

The monitoring side is probably equally important. If you generate no instrumentation data how do you know if your cache is even having the desired effect? One team I was on added a new cache to a service and we were bewildered to discover that it was never used. It turned out the WCF service settings were configured to create a new service instance for every request and therefore a new cache was created every time! This was despite the fact that we had unit tests for the cache and they were happily passing [1].

It’s also important to realise that a cache without an eviction policy is just another name for a memory leak. You cannot keep caching data forever unless you know there is a hard upper bound. Hence you’re going to need to use the instrumentation data to help find the sweet spot that gives you the right balance between time and space.

We also shouldn’t blindly assume that caches will continue to provide the same performance in future as they do now; our metrics will allow us to see any change in trends over time which might highlight a change in data that’s causing it to be less efficient. For example one cache I saw would see its efficiency plummet for a while because a large bunch of single use items got requested, cached, and then discarded as the common data got requested again. Once identified we disabled caching for those kinds of items, not so much for the performance benefit but to avoid blurring the monitoring data with unnecessary “glitches” [2].

 

[1] See “Man Cannot Live by Unit Testing Alone” for other tales of the perils of that mindset.

[2] This is a topic I covered more extensively in my Overload article “Monitoring: Turning Noise Into Signal”.

No comments:

Post a Comment