Friday 21 October 2016

When Mocks Became Production Services

We were a brand new team of 5 (PM + devs) tasked with building a calculation engine. The team was just one part of a larger programme that encompassed over a dozen projects in total. The intention was for those other teams to build some of the services that ours would depend on.

Our development process was somewhat DSM-like in nature, i.e. iterative. We built a skeleton based around a command-line calculator and fleshed it out from there [1]. This skeleton naturally included vague interfaces for some of the services that we knew we’d need and that we believed would be fulfilled by some of the other teams.

Fleshing Out the Skeleton

Time marched on. Our calculator was now being parallelised and we were trying to build out the distributed nature of the system. Ideally we would like to have been integrating with the other teams long ago but the programme RAG status wasn’t good. Every other team apart from us was at “red” and therefore well behind schedule.

To compensate for the lack of collaboration and integration with the other services we needed we resorted to building our own naïve mocks. We found other sources of the same data and built some noddy services that used the file-system in a dumb way to store and serve it up. We also added some simple steps to the overnight batch process to create a snapshot of the day’s data using these sources.

Programme Cuts

In the meantime we discovered that one of the services we were to depend on had now been cancelled and some initial testing with another gave serious doubts about its ability to deliver what we needed. Of course time was marching on and our release date was approaching fast. It was fast dawning on us that these simple test mocks we’d built may well have to become our production services.

One blessing that came out of building the simple mocks so early on was that we now had quite a bit of experience on how they would behave in production. Hence we managed to shore things up a bit by adding some simple caches and removing some unnecessary memory copying and serialization. The one service left we still needed to invoke had found a more performant way for us to at least bulk extract a copy of the day’s data and so we retrofitted that into our batch preparation phase. (Ideally they’d serve it on demand but it just wasn’t there for the queries we needed.)

Release Day

The delivery date arrived. We were originally due to go live a week earlier but got pushed back by a week because an important data migration got bumped and so we were bumped too. Hence we would have delivered on time and, somewhat unusually, we were well under budget our PM said [2]. 

So the mocks we had initially built just to keep the project moving along were now part of the production codebase. The naïve underlying persistence mechanism was now a production data store that needed high-availability and backing up.

The Price

Whilst the benefits of what we did (not that there was any other real choice in the end) were great, because we delivered a working system on time, there were a few problems due to the simplicity of the design.

The first one was down to the fact that we stored each data object in its own file on the file-system and each day added over a hundred-thousand new files. Although we had partitioned the data to avoid the obvious 400K files-per-folder limit in NTFS we didn’t anticipate running out of inodes on the volume when it quickly migrated from a simple Windows server file share to a Unix style DFS. The calculation engine was also using the same share to persist checkpoint data and that added to the mess of small files. We limped along for some time through monitoring and zipping up old data [3].

The other problem we hit was that using the file-system directly meant that the implementation details became exposed. Naturally we had carefully set ACLs on the folders to ensure that only the environment had write access and our special support group had read access. However one day I noticed by accident that someone had granted read access to another group and it then transpired that they were building something on top of our naïve store.

Clearly we never intended this to happen and I’ve said more about this incident previously in “The File-System Is An Implementation Detail”. Suffice to say that an arms race then developed as we fought to remove access to everyone outside our team whilst others got wind of it [4]. I can’t remember whether it happened in the end or not but I had put a scheduled task together than would use CALCS to list the permissions and fail if there were any we didn’t expect.

I guess we were a victim of our success. If you were happy with data from the previous COB, which many of the batch systems were, you could easily get it from us because the layout was obvious.

Epilogue

I have no idea whether the original versions of these services are still running to this day but I wouldn’t be surprised if they are. There was a spike around looking into a NoSQL database to alleviate the inode problem, but I suspect the ease with which the data store could be directly queried and manipulated would have created too much inertia.

Am I glad we put what were essentially our mock services into production? Definitely. Given the choice between not delivering, delivering much later, and delivering on time with a less than perfect system that does what’s important – I’ll take the last one every time. In retrospect I wish we had delivered sooner and not waited for a load of other stuff we built as the MVP was probably far smaller.

The main thing I learned out of the experience was a reminder not to be afraid of doing the simplest thing that could work. If you get the architecture right each of the pieces can evolve to meet the ever changing requirements and data volumes [5].

What we did here fell under the traditional banner of Technical Debt – making a conscious decision to deliver a sub-optimal solution now so it can start delivering value sooner. It was the right call.

 

[1] Nowadays you’d probably look to include a slice through the build pipeline and deployment process up front too but we didn’t get any hardware until a couple of months in.

[2] We didn’t build half of what we set out to, e.g. the “dashboard” was a PowerShell generated HTML page and the work queue involved doing non-blocking polling on a database table.

[3] For regulatory reasons we needed to keep the exact inputs we had used and couldn’t guarantee on being able to retrieve them later from the various upstream sources.

[4] Why was permission granted without questioning anyone in the team that owned and supported it? I never did find out, but apparently it wasn’t the first time it had happened.

[5] Within reason of course. This system was unlikely to grow by more than an order of magnitude in the next few years.

No comments:

Post a Comment