Tuesday, 22 December 2015

The Cost of Not Designing the Database Schema

The tale I wrote about in “Single Points of Failure - The SAN” didn’t entirely conclude at the point the issue was identified and apparently resolved. Whilst the vast majority of problems disappeared there was still a spike every now and then that caused the simple web service we wrote to take hundreds of milliseconds to respond, way more than a gen 2 garbage collection would take. We also logged when garbage collections occurred and they were never in sight when this glitch showed up.

After taking some time off I ended up joining the team who were responsible for calling that tactical web service and so I became privy to the goings-on upstream. It turned out the remaining blips were often occurring when an early morning batch process was run. It made little sense at the time that it could affect an entirely unrelated service, but with what I now knew about the SAN I felt the evidence pointed to a smoking gun. But how to truly explain it?

More Performance Woes

One of the changes being made when I joined this team was increased visibility (for the team) about how the services they owned were behaving in production. One service in particular was beginning to show signs of trouble and with the Christmas period looming it was felt something needed to be done about it pronto.

Interestingly the investigation of timeouts caused me to start correlating data with the other service we had had problems with earlier. On one particular day this daily batch process was delayed by a couple of hours and on that very same day the unexplained timeouts in the downstream service shifted too. Whilst correlation does not imply causality, the smoke from the gun was thickening. But it still didn’t make sense how the problem was “jumping the cracks”.

The investigation for my current team’s service turned to the Oracle database and it unearthed some stats that showed the database was making quite a few reads to satisfy the most common query type – retrieving the transactions for an account.

The Mists Begin to Clear

I started to apply the “5 Whys” technique to see if I could piece together a coherent picture that would address the immediate concern, but might also encompass the other one too. The question I started with was this:

“Why are the upstream service HTTP requests timing out?”
  1. Because they are waiting for a database connection. Why?
  2. Because each query is taking much longer. Why?
  3. Because the database is constantly hitting the SAN. Why?
  4. Because the database has to read so many pages. Why?
  5. Because the table being queried is badly organised.
Switching to the problem of unexplained timeouts in the other service for a moment it all started to make sense. This batch process that runs in the early morning generates a huge amount of “non-cacheable” reads (essentially a table scan) which is saturating the SAN and therefore causing the similar SAN related problems to what we had before.

Sadly my hypothesis was never acknowledged or discussed outside the team as they had stopped asking questions when they realised the database query was taking too long. However within the team it was accepted as highly plausible so I felt comfortable that at least we had some closure, and more importantly a theory to consider if things showed up again.

The temporary solution to the database problem was to stick a whole load more RAM in it to vastly improve caching and therefore reduce query times enough during the day to avoid the bottlenecks for now.

I posited that this change would also fix (or at least heavily reduce) the problems of unknown timeouts in the other service because Oracle would need to perform far less physical reads, and therefore the load on the SAN would also be reduced. This is exactly what I observed, so the gun was smoking even more now.

Addressing the Root Cause

Fundamentally the problem was down to the database having to do way more I/O work than should be necessary to satisfy the query. The table in question is essentially a set of transactions for an account which are being queried by the account’s ID.

The table was implemented as a simple heap with an index for the account ID. Whilst this meant that the transactions for an account could be found by the index, due to the heap structure the transactions were spread right across the table’s entire set of pages. Essentially the database did a few reads of the table index to find the rows in question and then (pathologically speaking) did one read per-row to get the data itself. Hence, for accounts with many transactions that was a huge number of random I/O’s.

I wasn’t there when the table was designed and so I have no knowledge about what the rationale was. Maybe it was just “the simplest thing that would possibly work” and they thought they’d have time to address scalability later? Or maybe they expected a different read / write pattern? Either way it’s not the structure I would have expected out-of-the-box for this kind of table.

Given that the table stores data for an account, and the key for that account is the primary means of lookup, we should be looking to keep all the data for an account close together. Hence using a table physically structured around the account ID (a “clustered index” on SQL Server and “index-organised table” on Oracle) will provide fast access and excellent locality of reference because all the pages for each account will be stored together. This way the database only has to navigate the index to the start of the specific account’s data and then do a few sequential page reads to get the rest.

No Time to Fix It

The problem with modern businesses is that they run 24x7 these days and so there is no time for downtime and maintenance. So whilst a differently organised table may well now be the best approach, the cost of implementing that change may be too high. Due to the current volume of data, taking the database offline and rebuilding it was not considered possible given the current state of the business and market.

Instead the DBAs decided to add a covering index that could be built online which included all the data so the query optimiser could satisfy the main query solely from the index. Essentially they created the clustered table via an index. Of course every write now had to update the table, original index and the new one. It should have been possible at that point to drop the original index, but I’m not sure if that happened as they’d also have to prove it wasn’t being used by another query.

Back to the SAN

In the meantime I was asked to investigate some other unexplained timeouts that occurred well outside the morning batch processing window. Knowing what we did now about the database and the SAN someone questioned whether the DBAs were already implementing this new index in production?

They weren’t but they were testing the approach in the QA environment. The correlation again was very strong and so someone investigated what the topology was for the databases in the QA environment and they discovered that some of the storage pools shared a portion of the SAN with production which was clearly unintentional. Oops.

Early Warning Indicators

Hindsight is a wonderful thing and it’s good that they were gaining visibility of their service’s behaviour, but that was only able to identify immediate glitches. There also needs to be some element of trend analysis to spot when things are beginning to head south.

For me the stance on instrumentation is that you measure everything you can afford to. Any lengthy computation or external I/O (i.e. anything that could block) should be recorded so that you can get a handle on what operations are behaving strangely now, and how they are changing over time as the service ages and adapts to new loads. It’s pretty easy to add too (see “Simple Instrumentation”).

Without some form of trend analysis you become like a slow-boiled frog that isn’t noticing how the surroundings are changing. All of a sudden what once took milliseconds now takes tens of milliseconds but you haven’t noticed it creep up. Everything appears to be normal right up to the point that performance drops off the cliff and you’re fire-fighting to bring it back under control.

You also cannot just monitor everything and expect to make sense of it all when a crisis hits. The data by itself is no use if you don’t understand how it relates to the moving parts of the system – you need to know why certain things change together, or not. From this you can build a heartbeat so that you really know how the system is evolving over time.

No comments:

Post a Comment