Tuesday 16 September 2014

The Surgical Team

I first read Fred Brooks’ essay The Surgical Team (from his seminal work The Mythical Man Month) over a decade ago. More recently I heard Allan Kelly talk at Agile Cambridge 2013 about Xanpan and the same idea of building a team and flowing the work through them came up again. This has always seemed perfectly logical to me, but as a freelancer who is hired for a project which tends to be quite long lived, I’ve never seen this play out for real. That changed last year and I got to be part of a surgical team.

First On The Scene

It was the Monday after the so-called “hurricane” and I was the first programmer from our team in that morning. Another would be in later when the chaos died down and the other two had the day off. I grabbed a coffee and put my bag down, but before I could unpack my laptop I was told to meet the project manager in a meeting room as soon as I could. Needless to say I was expecting the worst…

On the contrary the meeting took an entirely different tone. Due to an unexplainable performance problem a tactical solution was now needed to provide a low-latency service for doing some simple data mapping on a multi-GB data set. As always the service needed to be in production ASAP, which in this case meant 2 weeks. I was told we should drop what we were working on (a greenfield project in mid-flow) and focus entirely on this new problem.

Backup Arrives

By the time my first fellow programmer had made it in I had already worked out a possible solution with the main architect (it wasn’t rocket science) and we got started on building it. I created a Git repo, Trello board to track the tasks and the skeleton Visual Studio solution. After that was pushed I started working on the data mapping logic and my colleague started on putting together the web API aspect so that we could stay out of each other’s way. When the other two team members came back the following day, one picked up the continuous integration aspect to got the build pipeline working, and the other starting putting together the performance tests as they were essential to our success. Nearly everything was written in a test-first manner with a focus on acceptance tests for the service itself and unit tests for some of the internal machinery that would have been hard to test at system level.

We had a functionally complete demo service up and running within 3 days that the downstream team could code against, and by the end of the week we had started to bottom out some of the deployment issues which were only held up because the test and production infrastructure was not available. The service itself was ready within 2 weeks and what spilled over were the deployment and operations aspects which only became apparent once we had infrastructure to deploy on. The other three developers had already gone back to the original project by this time and I was left to mop up the loose ends.

Post Mortem 

As a team we had been together for less than 6 months, with the 4 (experienced) developers all being new, both to the client and each other. Looking back it had taken us more than two weeks just to get the build pipeline working on our main project, so being able to re-purpose the CI and development servers was a massive leg up. We also had existing security groups and accounts that we could re-use for the services whilst waiting for the project specific ones to be created. Finally the service itself was a web API which we were already creating for the main project and so we could drag over a lot of the scripts and common helper code we had written and tested there. Once of the best things about writing code that is “always ready to ship” is that you can reuse it in a hurry with a large degree of confidence that it’s already been decently tested.

What is probably evident from the second half of the story is that being new to the client we had no first hand experience of what it meant to put a service into production, so once we reached that point we slowed to a crawl. For instance we did not know who we needed to speak to about the support and operational requirements and so couldn’t have the conversation early enough to implement the service heartbeat, sort out the logging format, event log IDs, etc. when we easily had the capacity to handle them.

Epilogue

It was a pleasurable feeling to experience that time pressure again, but only because we knew it had an end close by and ultimately we were going to be heroes if the service was delivered on time. I’m also pleased to see that even under such pressure we stuck to the same development process we were already using - no testing corners were cut either because of the time pressure or the “allegedly” tactical nature of the solution. In essence the fastest way was the right way.

No comments:

Post a Comment