Wednesday, 9 December 2015

Don’t Be Afraid to Throw Away Data

One of the problems I’ve seen come up in various projects is what to do with all the data that we’re being given that we don’t need to use at that moment in time? For example I worked on a new system which was estimated as needing to hold around 100 TB when in production as it had a data retention period of just over a year. We only needed a subset of it initially.

More recently the same problem came up on a 3rd party data feed where the vast majority of the data could be discarded because only a couple of attributes where actually being used. In both cases I struggled to convince the business to reconsider mindlessly hoarding data in production that they didn’t need, either now or in the foreseeable future.

Storage Costs

Fundamentally storing the data as part of the production data set was going to cost a non-trivial amount of money. Whilst 100 TB does not sound like a huge amount these days, once you consider that it’ll be held on top class storage (e.g. solid state), the cost begins to become noticeable. In contrast there are plenty of really cheap storage options for the parts of the data that likely has an SLA ranging from months, to “never”.


We should also not forget that the more data we store in our production data stores the harder it will be to recover when (not if) something goes wrong. Why waste time restoring non-essential data when the aim is to get the business back on its feet ASAP? If you treat every piece of data the same you have no ability to prioritise.

What If?

In both instances the argument from the business was one about “what about when we need to use the data in the future?”. They were worried that if they throw it away and later discover that it’s useful then they’d have to wait ages again until they had accumulated enough.

In both cases what they failed to distinguish is the difference use cases for the data. To them it was an all-or-nothing deal and I strongly suspect that was down to the mentality of using a single database product so they could keep all the eggs in one basket.

Production Queries versus Analysis

Production infrastructure will usually be sized and tuned to cope with the fixed subset of requests used by it. The more varied the demands the harder it is to provide a service that meets all its needs and therefore its SLAs. If those demands involve the ability to run ad-hoc queries then all bets are off. I’ve seen people crash production databases by running poorly written ad-hoc queries (usually by accident).

In contrast, in my experience, data analysis requirements often come with much lower expectations. It’s entirely possible that just a sample of the data might be required rather than every byte ever produced. The data store may be tuned and arranged completely differently if it’s likely to be handling unknown queries. Given the less critical nature of the data it probably comes with far lower support guarantees, and therefore running costs.

Partition Appropriately

The idea of using separate databases for separate purposes is nothing new – the traditional “transactional versus reporting” split has been around for decades. It’s just another specialisation of the more general principle regarding the Separation of Concerns.

With ever cheaper hardware and cloud computing at one’s fingertips it might seem that modern databases can handle any disparate load you care to throw at them because most data sets could probably fit into RAM these days if you decided to spend the money.

Sadly the cost of enterprise-grade hardware still makes me wince, especially when the internal price factors in all the costs of the data centre, infrastructure staff, etc. Only a couple of years ago I was quoted £36 per GB for storage on an enterprise project expected to store many tens of terabytes [1].

Many businesses are still holding on tightly to their own data centres, for various reasons, and so the answer is not always as clear cut as it first appears.

Deferring Decisions

In both cases what I was essentially trying to do was help the business defer some decisions that I suspected were not important in the shorter term. Rather than blindly assume that all data is valuable and get stuck spinning our wheels on speculative requirements, we should consider whether dropping the data is the easiest approach, for now. If that’s absolutely not possible, then consider other ways to put the unused data to one side until we know more about how it will be used.

In the former case cited at the beginning we were continually getting bogged down in discussions about how to store the data that we didn’t understand up-front in a way that would make it available at a (much) later date. Aside from being just a schema design issue it also meant we had to factor it into the discussions around performance. In the end we reached an agreement where we could dump all the incoming raw data (after processing) onto a compressed volume so that it wouldn’t be lost, but the cost of understanding and re-importing what we didn’t understand today would be borne at the time when it was actually required.

As for the latter case we proposed keeping the production cache tiny by only storing what mattered, and that the full payload would be pushed out to a queue where it could be imported into an independent, non-production database organised for analysis.


[1] The costing model was entirely based around the notion of SAN storage for everything. The modern document-oriented databases like MongoDB are architected for commodity hardware which really messes with those kinds of costing models.

1 comment:

  1. When I used to work in the backup & archiving world I always proposed that the killer product isn't one that performs backup or archiving but instead tells you (well figures out) what data you could delete!