Saturday 8 August 2020

Weekend Maintenance as Chaos Engineering

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.

No comments:

Post a Comment