Tuesday, 22 December 2015

The Cost of Not Designing the Database Schema

The tale I wrote about in “Single Points of Failure - The SAN” didn’t entirely conclude at the point the issue was identified and apparently resolved. Whilst the vast majority of problems disappeared there was still a spike every now and then that caused the simple web service we wrote to take hundreds of milliseconds to respond, way more than a gen 2 garbage collection would take. We also logged when garbage collections occurred and they were never in sight when this glitch showed up.

After taking some time off I ended up joining the team who were responsible for calling that tactical web service and so I became privy to the goings-on upstream. It turned out the remaining blips were often occurring when an early morning batch process was run. It made little sense at the time that it could affect an entirely unrelated service, but with what I now knew about the SAN I felt the evidence pointed to a smoking gun. But how to truly explain it?

More Performance Woes

One of the changes being made when I joined this team was increased visibility (for the team) about how the services they owned were behaving in production. One service in particular was beginning to show signs of trouble and with the Christmas period looming it was felt something needed to be done about it pronto.

Interestingly the investigation of timeouts caused me to start correlating data with the other service we had had problems with earlier. On one particular day this daily batch process was delayed by a couple of hours and on that very same day the unexplained timeouts in the downstream service shifted too. Whilst correlation does not imply causality, the smoke from the gun was thickening. But it still didn’t make sense how the problem was “jumping the cracks”.

The investigation for my current team’s service turned to the Oracle database and it unearthed some stats that showed the database was making quite a few reads to satisfy the most common query type – retrieving the transactions for an account.

The Mists Begin to Clear

I started to apply the “5 Whys” technique to see if I could piece together a coherent picture that would address the immediate concern, but might also encompass the other one too. The question I started with was this:

“Why are the upstream service HTTP requests timing out?”
  1. Because they are waiting for a database connection. Why?
  2. Because each query is taking much longer. Why?
  3. Because the database is constantly hitting the SAN. Why?
  4. Because the database has to read so many pages. Why?
  5. Because the table being queried is badly organised.
Switching to the problem of unexplained timeouts in the other service for a moment it all started to make sense. This batch process that runs in the early morning generates a huge amount of “non-cacheable” reads (essentially a table scan) which is saturating the SAN and therefore causing the similar SAN related problems to what we had before.

Sadly my hypothesis was never acknowledged or discussed outside the team as they had stopped asking questions when they realised the database query was taking too long. However within the team it was accepted as highly plausible so I felt comfortable that at least we had some closure, and more importantly a theory to consider if things showed up again.

The temporary solution to the database problem was to stick a whole load more RAM in it to vastly improve caching and therefore reduce query times enough during the day to avoid the bottlenecks for now.

I posited that this change would also fix (or at least heavily reduce) the problems of unknown timeouts in the other service because Oracle would need to perform far less physical reads, and therefore the load on the SAN would also be reduced. This is exactly what I observed, so the gun was smoking even more now.

Addressing the Root Cause

Fundamentally the problem was down to the database having to do way more I/O work than should be necessary to satisfy the query. The table in question is essentially a set of transactions for an account which are being queried by the account’s ID.

The table was implemented as a simple heap with an index for the account ID. Whilst this meant that the transactions for an account could be found by the index, due to the heap structure the transactions were spread right across the table’s entire set of pages. Essentially the database did a few reads of the table index to find the rows in question and then (pathologically speaking) did one read per-row to get the data itself. Hence, for accounts with many transactions that was a huge number of random I/O’s.

I wasn’t there when the table was designed and so I have no knowledge about what the rationale was. Maybe it was just “the simplest thing that would possibly work” and they thought they’d have time to address scalability later? Or maybe they expected a different read / write pattern? Either way it’s not the structure I would have expected out-of-the-box for this kind of table.

Given that the table stores data for an account, and the key for that account is the primary means of lookup, we should be looking to keep all the data for an account close together. Hence using a table physically structured around the account ID (a “clustered index” on SQL Server and “index-organised table” on Oracle) will provide fast access and excellent locality of reference because all the pages for each account will be stored together. This way the database only has to navigate the index to the start of the specific account’s data and then do a few sequential page reads to get the rest.

No Time to Fix It

The problem with modern businesses is that they run 24x7 these days and so there is no time for downtime and maintenance. So whilst a differently organised table may well now be the best approach, the cost of implementing that change may be too high. Due to the current volume of data, taking the database offline and rebuilding it was not considered possible given the current state of the business and market.

Instead the DBAs decided to add a covering index that could be built online which included all the data so the query optimiser could satisfy the main query solely from the index. Essentially they created the clustered table via an index. Of course every write now had to update the table, original index and the new one. It should have been possible at that point to drop the original index, but I’m not sure if that happened as they’d also have to prove it wasn’t being used by another query.

Back to the SAN

In the meantime I was asked to investigate some other unexplained timeouts that occurred well outside the morning batch processing window. Knowing what we did now about the database and the SAN someone questioned whether the DBAs were already implementing this new index in production?

They weren’t but they were testing the approach in the QA environment. The correlation again was very strong and so someone investigated what the topology was for the databases in the QA environment and they discovered that some of the storage pools shared a portion of the SAN with production which was clearly unintentional. Oops.

Early Warning Indicators

Hindsight is a wonderful thing and it’s good that they were gaining visibility of their service’s behaviour, but that was only able to identify immediate glitches. There also needs to be some element of trend analysis to spot when things are beginning to head south.

For me the stance on instrumentation is that you measure everything you can afford to. Any lengthy computation or external I/O (i.e. anything that could block) should be recorded so that you can get a handle on what operations are behaving strangely now, and how they are changing over time as the service ages and adapts to new loads. It’s pretty easy to add too (see “Simple Instrumentation”).

Without some form of trend analysis you become like a slow-boiled frog that isn’t noticing how the surroundings are changing. All of a sudden what once took milliseconds now takes tens of milliseconds but you haven’t noticed it creep up. Everything appears to be normal right up to the point that performance drops off the cliff and you’re fire-fighting to bring it back under control.


You also cannot just monitor everything and expect to make sense of it all when a crisis hits. The data by itself is no use if you don’t understand how it relates to the moving parts of the system – you need to know why certain things change together, or not. From this you can build a heartbeat so that you really know how the system is evolving over time.

Acceptance Test Is Not an Environment

In a traditional software development process where you did analysis, development and then testing, there is often the use of shared environments, and therefore there is often a one-to-one relationship between the name of the environment and the type of testing performed. For example UAT (User Acceptance Testing) tends to come right at the very end of the process just before production. If you are working on a back-end system there may well be no “U” in the UAT and so it really just becomes a more production-like test environment.

In a modern development process there is more of a distinction between the type of tests we are running and the environment in which we are running them. We are always trying to achieve a balance between getting the fastest feedback possible on whether our changes are correct, whilst still ensuring that enough of the system is being tested in a manner similar to production so that we minimise any problems due to environmental differences.

In my C Vu article “The Developer’s Sandbox” I described a number of different ways that you might partition a system (and test data) to allow a variety of different levels of non-unit testing. In essence I am mostly interested in running fast, automated test suites in some isolated manner to gain rapid feedback. However I also like to do a bit of manual exploratory testing, especially when making changes around deployment or infrastructure code. And demoing new features is also important too to ensure that we’re building “the right thing”.

What I’ve found is that there is often some confusion when talking about testing that conflates the suite of tests being exercised with the configuration of the system it’s being run on. For example I will try and run every automated test possible on my local machine before committing my changes. This means I’m probably running some combination of unit, component, integration, acceptance and system tests against a variety of mock and real components and services depending on how expensive or not they are to use.

Similarly on the build server we will run exactly the same suite of tests but because we have more time we can use the real dependencies where possible and only rely on mocks where we have to. The closer the code gets to production the closer the test environment has to get to production too.

As a consequence this means there is no one-to-one relationship between the test suite configuration and the environment where it is run. By default we tend to optimise for the developer feedback loop which means the out-of-the-box configuration is usually “localhost” everywhere [1]. In contrast the build server, development and test environments will likely have real networks, databases, message queues, etc. in play and so the same suite of tests will increase the amount of infrastructure and integration for a more production-like quality, perhaps at the expense of performance. The point is that we aim to run the same tests and only vary the configuration. Hence when talking about automated testing it may require us to qualify it with the environment configuration we might be running with to avoid confusion.

One natural observation might be that it’s not right to call the running of the acceptance test suite on a developer’s local machine “acceptance tests” as some element of the “acceptance” must come from it being run in a more-production like manner. Whilst I get the sentiment, I think that misses the point about developer’s leveraging the traditionally more costly tests in a constrained, but by no means useless environment, to gain earlier feedback around the functional behaviour. No, it doesn’t mean it’s signed-off and ready for production just because it works on my machine, but it does mean that at a fundamental level the change is sound and worthy of pushing further down the deployment pipeline.



[1] I always say that I should be able to unplug from the network and go out into the garden where there is no Wi-Fi and still be able to write code and have a high degree of confidence that it works. Modern tooling (and a sane approach to developer licensing) makes that possible even when databases, message queues, etc. are in the equation without having to restrict ourselves to relying solely on unit testing.

Observable State versus Persisted State

A while back I was working on a replacement service that was intending to use one of those new-fangled document-oriented databases (Couchbase as it goes). During the sprint planning meeting we had a contentious story around persisting data and what it meant to handle multiple writes in a single “business transaction”. There was some consternation that because there is no native transaction support (or locking) to ensure we got an atomic commit on success, or a rollback if a problem occurred somewhere, then we couldn’t deliver the story on that technology stack.

Effectively we had reached the point where we were handling the stories around idempotency and the story had wording in it that assumed a classic relational all-or-nothing style of transactional writing which we naturally couldn’t have. The crux of the question was whether we could perform our writes in such a way that if an error occurred any invariants would still remain, and if the request was retried then we’d be able to complete it after being left temporarily in a potentially half-finished state.

Atomic Multi-Document Writes

The problem revolved around creating a number of child documents (e.g. Orders) for a root document (e.g. Customer). When using a traditional database the child records could just be written as-is because they will not be visible until the transaction is committed (ignoring dirty reads). If an error occurs at any point whilst writing, the whole lot are removed. If the database goes down before the commit is persisted it will roll-back the transaction if it needs to on restart. Either way any invariants violated during the writes are invisible outside the transaction.

Non-Atomic Multi-Document Writes

Whilst writes are atomic at a document level, they are not when multiple documents (or many, separate writes to the same document) are involved. As such we need to perform each insert, update and delete in a way that assumes we might lose connectivity at that moment.

The first problem is ensuring that a failure after any single write cannot leave the data in a state where any invariants have been violated. For instance if the model says that there is a two-way relationship between two documents, then only having one-half of it is unacceptable because navigating the other way will generate an error.

As a consequence of partially written data being a possibility due to a lack of transactions, we likely have to adopt an error handling strategy that either unwinds the state or moves it forward to achieve the original desired outcome [1]. For this to happen we will almost certainly be looking at using idempotent writes where we can try the same action again and again and not incur any additional side-effects if it has already completed successfully (e.g. a counter is incremented once, and only once).

The Observable Effects of Idempotency

And so we come back to the problem we encountered when discussing the story – what exactly does idempotency mean? The way it was worded in the story was that any failed business transaction must not leave any residual state behind. Given the way that the database works and the kind of business transaction we were trying to do meant that this was simply impossible to achieve. With an air of defeat the discussion turned to how we can switch back to using a traditional transactional database to meet this story.

However, I wanted clarification around what it meant for “no state” to be left within the database. What I thought the intent of that phrase really meant was “no observable state” should be left around if the transaction fails. If we consider the system as a black box, not a white one, then we can leave residual state lying around just so long as it is not visible outside the system. And as long as the system is only accessible via our public API we can control how temporary state can remain hidden.

But how? In this instance if we ordered our writes carefully enough we can ensure that any invariants remain intact after every single write. We just need to be careful about how we define when a piece of data becomes visible through the public API.

Example: File-System Writes

To understand how this can be achieved think about how a modern day editor, such as MS Word, saves documents. It does not just open the file and start writing because if it did and the machine failed both the old and new documents would be lost. Instead it follows a sequence something like this, to minimise the loss of data:
  1. Write the new document to a temporary file.
  2. Rename the current backup file to a temporary name.
  3. Rename the old document to make it the backup.
  4. Rename the temporary file to the document’s name.
  5. Delete the old backup file.
In fact this pattern of file-system behaviour (write + rename) is so common that NTFS even recognises it to make sure the newly written document carries over the previous file’s creation date to make it appear as if it just updated the old file.

What makes this work is that the really dangerous work is done off to the side (i.e. writing the new version of the document) leaving just some file-system metadata changes (3 renames and a delete) to “commit” the change. I touched on this idea before in “Copy & Rename (Like Copy & Swap But For File-Systems)” after having to deal with torn files due to a badly written file transfer process.

Idempotent Writes

The way to achieve the same effect in the database is also by writing in a particular way and by tagging each business transaction with a unique ID that we can use to replay or recover from after a failure.

In our example we split the writes up into two stages:
  1. First insert the child documents.
  2. Then update the parent document to refer to them.
It might seem as though the child documents would be visible after the initial write but they aren’t because the public API only publishes the ID of children who are referenced in the parent. As such there may be state persisted, but it is not observable until the single write at the end of the parent document, which is atomic.

The relationship is actually bidirectional (you can find a child and lookup its parent) which might seem like a loophole until you consider the previous point – the child is not publicly visible until the parent has been committed. You can’t ask for the child because you have no way of knowing of its existence via the public API.

The way the idempotent ID works is that it is logged against certain writes so that we can tell what has and hasn’t been performed already. So in our example above each child document is created (possibly with the idempotent ID [2]) and when we add the references into the parent we tag it with the idempotent ID so that we know we completed the transaction. If it fails at any point we can just discard the temporary child documents and recreate them. This does mean we have the potential for detritus to be left around on failures, but they should be rare and can be “garbage collected” in slow time using a background process [2].

Scalability

This technique works for simple object models which is how I’ve used it. It can be extended to some degree if you are willing to add complexity to your model (and probably increase the number of I/Os) by creating more elaborate “invariants”. For example if the sender could have controlled the child document ID it might mean that the public API would have to navigate from child to parent to validate its existence (presence of the document alone not being enough).

Given the choice between using a classic transactional database and having to think really hard about this stuff it’s probably not worth it. But if you have a simple object model and are looking at alternatives for performance reasons, then you need to think a bit differently if you’re going to cope without transactions.


[1] Just ignoring a part-failed request and leaving the data in a valid, but unusual state, should be possible but highly undesirable from a support perspective. It’s hard enough piecing together what’s happened without being plagued unnecessarily by zombie data.


[2] It’s not essential if you always re-submit and roll forward, but can help in the aftermath if cleaning up. It would probably be required though if you needed to roll-back first as it may be the only key you have to the document at that point.

Wednesday, 9 December 2015

Don’t Be Afraid to Throw Away Data

One of the problems I’ve seen come up in various projects is what to do with all the data that we’re being given that we don’t need to use at that moment in time? For example I worked on a new system which was estimated as needing to hold around 100 TB when in production as it had a data retention period of just over a year. We only needed a subset of it initially.

More recently the same problem came up on a 3rd party data feed where the vast majority of the data could be discarded because only a couple of attributes where actually being used. In both cases I struggled to convince the business to reconsider mindlessly hoarding data in production that they didn’t need, either now or in the foreseeable future.

Storage Costs

Fundamentally storing the data as part of the production data set was going to cost a non-trivial amount of money. Whilst 100 TB does not sound like a huge amount these days, once you consider that it’ll be held on top class storage (e.g. solid state), the cost begins to become noticeable. In contrast there are plenty of really cheap storage options for the parts of the data that likely has an SLA ranging from months, to “never”.

Redundancy

We should also not forget that the more data we store in our production data stores the harder it will be to recover when (not if) something goes wrong. Why waste time restoring non-essential data when the aim is to get the business back on its feet ASAP? If you treat every piece of data the same you have no ability to prioritise.

What If?

In both instances the argument from the business was one about “what about when we need to use the data in the future?”. They were worried that if they throw it away and later discover that it’s useful then they’d have to wait ages again until they had accumulated enough.

In both cases what they failed to distinguish is the difference use cases for the data. To them it was an all-or-nothing deal and I strongly suspect that was down to the mentality of using a single database product so they could keep all the eggs in one basket.

Production Queries versus Analysis

Production infrastructure will usually be sized and tuned to cope with the fixed subset of requests used by it. The more varied the demands the harder it is to provide a service that meets all its needs and therefore its SLAs. If those demands involve the ability to run ad-hoc queries then all bets are off. I’ve seen people crash production databases by running poorly written ad-hoc queries (usually by accident).

In contrast, in my experience, data analysis requirements often come with much lower expectations. It’s entirely possible that just a sample of the data might be required rather than every byte ever produced. The data store may be tuned and arranged completely differently if it’s likely to be handling unknown queries. Given the less critical nature of the data it probably comes with far lower support guarantees, and therefore running costs.

Partition Appropriately

The idea of using separate databases for separate purposes is nothing new – the traditional “transactional versus reporting” split has been around for decades. It’s just another specialisation of the more general principle regarding the Separation of Concerns.

With ever cheaper hardware and cloud computing at one’s fingertips it might seem that modern databases can handle any disparate load you care to throw at them because most data sets could probably fit into RAM these days if you decided to spend the money.

Sadly the cost of enterprise-grade hardware still makes me wince, especially when the internal price factors in all the costs of the data centre, infrastructure staff, etc. Only a couple of years ago I was quoted £36 per GB for storage on an enterprise project expected to store many tens of terabytes [1].

Many businesses are still holding on tightly to their own data centres, for various reasons, and so the answer is not always as clear cut as it first appears.

Deferring Decisions

In both cases what I was essentially trying to do was help the business defer some decisions that I suspected were not important in the shorter term. Rather than blindly assume that all data is valuable and get stuck spinning our wheels on speculative requirements, we should consider whether dropping the data is the easiest approach, for now. If that’s absolutely not possible, then consider other ways to put the unused data to one side until we know more about how it will be used.

In the former case cited at the beginning we were continually getting bogged down in discussions about how to store the data that we didn’t understand up-front in a way that would make it available at a (much) later date. Aside from being just a schema design issue it also meant we had to factor it into the discussions around performance. In the end we reached an agreement where we could dump all the incoming raw data (after processing) onto a compressed volume so that it wouldn’t be lost, but the cost of understanding and re-importing what we didn’t understand today would be borne at the time when it was actually required.

As for the latter case we proposed keeping the production cache tiny by only storing what mattered, and that the full payload would be pushed out to a queue where it could be imported into an independent, non-production database organised for analysis.

 

[1] The costing model was entirely based around the notion of SAN storage for everything. The modern document-oriented databases like MongoDB are architected for commodity hardware which really messes with those kinds of costing models.

Tuesday, 8 December 2015

Poor Performance of log4net Context Properties

Back in September last year I wrote a post about a simple, low-latency web service we had to put together in a short timeframe (see “The Surgical Team”). During this project we hit a serious performance snag with log4net caused by using its context properties collection in the log message.

The Early Warning Indicator

The web service we were building had to perform a simple lookup on a multi-GB data set in less than 10 ms. We were going to use .Net and ASP.Net and so given its garbage collected environment we sought to verify first of all that garbage collections were not going to be a problem. Hence we got the CI pipeline in place and wrapped up a call to a NOP web API handler in a performance test to measure the basic solution infrastructure.

After a few false starts with the test itself (and the test framework) we found that each call was only taking a couple of ms [1] with an occasionally longer one as we hit a 2nd generation garbage collection (GC) which “stops the world” (at least in .Net 4). Even though the expensive GC meant we were occasionally more than an order of magnitude outside our SLA, the business were happy to go with it given the time constraints [2].

The Klaxon Goes Off

All of a sudden the core part of the build is succeeding but the performance tests are failing. We initially put it down to a blip, perhaps in the infrastructure, and ignore it for now. Then it trips again and again and so we decide to check the last few commits to see what’s happening as this now feels like it might be our code.

One of the developers had recently added log4net into the mix for diagnostic logging and to report SLA violations via the Windows event log, but that had been a few commits before things started going south. The commit that seemed to have pushed us over the edge was the addition of the HTTP request correlation ID to the log message. This was done via the context property bag in log4net, e.g.

%date %-5level %thread %property{correlationId} ...

This didn’t seem quite right at first though as we’d used the same property bag elsewhere, e.g. in the event log message. But of course we quickly realised the event log message only happened when we were already outside the SLA.

We reverted the change and the performance tests were green once more. At this point we didn’t know what it was about using properties that was causing it, or even if it really was that so we looked for another way as the information was important for support and monitoring (see “Causality – Relating Distributed Diagnostic Contexts”).

(This also got noticed and formally reported by someone else a few months later, and is now tracked in the log4net JIRA under LOGNET-421 and LOG4NET-429.)

Pattern Converters to the Rescue

Fortunately the property bag isn’t the only way to insert custom content into a log4net message (without just concatenating it into the message which was our fall-back), it also supports custom fields, called “pattern converters”.

As you can see from the documentation you specify your own field names and use them as placeholders too, e.g.

%date %-5level %thread %correlationId ...

Unlike the built-in property bag you have to do a little bit more work, such as creating some (thread-local) storage for the property value that you can fish out later when you need to invoke log4net to write the message [3].

public class CorrelationIdConverter : PatternConverter
{
  protected override void Convert(TextWriter writer,
                                  object state)
  {
    // The ID is stored in a thread-local property.
    writer.Write(Correlation.Id);
  }
}

You’ll also need to tell log4net about the field name and the class that should be invoked to format the value [4].

<converter>
  <name value="correlationId" />
  <type value="MyLib.CorrelationIdConverter, MyLib"/>
</converter>

We watched the build machine performance test results closely when the change was pushed but, just as we had hoped, the effect wasn’t noticeable.

 

[1] I am in no way suggesting that “a couple of ms” for an empty web API call could be considered decent performance. Quite frankly I was amazed when we discovered how much time and memory the ASP.Net pipeline itself used.

[2] We did suggest ways to mitigate this, such as fronting the service with a load balancer / service that could do a “best of three” as the service was read-only.

[3] Technically speaking it was stored in the ASP.Net HttpContext, if it exists, and if not we fall back to using traditional thread-local storage.

[4] We were actually using programmatic configuration of log4net there, but on another project we had to use the traditional log4net.config file and sadly it starts to add a bit of noise.

Friday, 4 December 2015

The Importance of Leading by Example

I didn’t have time to write a short blog post, so I wrote a longer one instead.” – Mark Twain (mostly)

As I mentioned recently in “Missing the Daily Commute by Train” my previous engagement was a little out of the ordinary for me. Normally I’m just another programmer working in a delivery team alongside other developers, testers, BAs, etc. but in this instance I was doing a more hands-off consultant style role.

This is quite an unusual type of engagement for the consultancy through which I am currently working (Equal Experts). Whilst they prefer to put together an entire team, including developers, testers, UX specialists, etc. to try and ensure the best chance of delivering a successful outcome (a product the customer really wants), sometimes they have to settle for less control.

Small Cogs, Big Wheel

To date I’ve worked on a few different projects at a couple of sites and by-and-large the consultancy firm has only provided developers as that is what the client has wanted (they have preferred to provide their own people for the other roles). Even so, despite being there to deliver a software product we have naturally been able to put our many collective years of software development experience into effect and provided input, not only on the aspects of coding, but also on how the team works.

It’s very tempting to try and solve the entire organisation’s problems, but we have to remember that first-and-foremost we are there to deliver a specific project. If we can also impart other knowledge about how the team, and even possibly the division or organisation, might consider doing things better then that’s an added bonus. Shared correctly any advice is often welcomed and sometimes the desired outcome is achieved. Not every battle may be won, but a few changes here-and-there can often make a big difference to how the team behaves and the quality of work delivered.

Being a Bigger Cog

What was different about this recent engagement was that we were there explicitly to try and instigate such over-arching changes in the first place. Re-organising the teams from being silos grouped around similar skill sets (i.e. programming language/platform, testing, etc.) to cross-functional teams where the entire technology stack was catered for was an obvious change as it meant the delivery of any feature should mostly stay within the delivery team.

Much of what we did was to observe what was happening and try and correct the team’s and, to a much lesser extent, an individual’s behaviour. The latter was rare and really just signified a behaviour that was probably common to that person’s role rather than actually about one specific individual.

As an outsider (to the team) this was reasonably easy to do because the advice often involved getting the right people to talk to each other. Learning to decipher a Kanban board and act on it appropriately is by no means trivial, but it generally involves focusing on helping only one person in the team – the Scrum Master. Many of the techniques we employ when running a project actually involve giving up on modern technology and tools and going back to pen and paper or whiteboards. In essence there is no learning curve for the tooling, only in the approach [1].

In contrast there are often many developers and testers in a team and getting all of them to understand what it takes to write sustainable software without actually helping them directly felt like a more insurmountable task. You can talk all day long about concepts like test-driven development and continuous integration but many of the barriers to implementing those lay at the architectural level.

As I mentioned at the beginning the initial driver for the engagement was “an agile transformation” and due to the organisation’s structure the sponsor was effectively interested in us changing the way they manage the project rather than the way the software is developed. In short, changing the engineering practices were near the bottom of the list of priorities.

No Time to Improve

When the initial dust had settled around re-organising the teams and throwing out the vast quantities of paperwork they would generate to facilitate hand-offs, we got to start addressing the technical aspects. This, sadly, didn’t go anywhere near as well.

Where the team re-organisation happened very quickly (a couple of months) and, somewhat surprisingly, caused very little disruption, the problems with the architecture and development practices would require a significant investment in time and therefore also money. It would also likely have an impact on the delivery schedule due to the refactoring needed to get a fluid delivery pipeline in place [2].

As I discussed only recently in “Don’t Fail Fast, Learn Cheaply” modern development practices often have an air of redundancy about them and refactoring was a classic case in point. Right from the start the notion that you would rewrite existing code to support a sustainable pace of change in the future was rejected. When the management team focus on how to stop the more disruptive 10% being less productive instead of enabling the 90% to work much better you know you have an uphill battle.

Ultimately the entire codebase, except for one small area, was not written with modern, automated testability in mind. As a consequence there would need to be a non-trivial amount of work to do just to get the code into a position where it was possible to even begin testing through unit or acceptance tests. Doing this “rewrite” costs time, which when you are used to dedicating 100% of your time delivering production code means a substantial short-term hit.

A side-effect of having no slack in the schedule also means that there is no time to improve ones skills. Abraham Lincoln suggests that time spent “sharpening our axe” is essential as without learning we never find more efficient ways to do our work. In an industry like software development where the goal posts are moving all the time it’s paramount that we keep inspecting our axe to see when it’s becoming blunt.

The knock-on effects of the abyss they had found themselves in meant that they were in a vicious cycle where the architecture doesn’t easily support automated testing and changing the architecture to support automated testing was too difficult. With no rapid feedback there were lots of bugs which caused rework for testers and developers and that put more time pressure on them.

Eating the Elephant

There is a saying about how you go about eating an elephant – one bite at a time. With a system that resembles a Big Ball of Mud it’s often hard to even know where to start. Once upon a time I was brave enough to spend three days just sorting out the #includes (and forward declaring types) in a large C++ codebase just so that I can get the simplest unit test in place. If I had known it would take 3 days I’m not sure I would have started, but it paid huge dividends once it was done, both for the change I needed to make at the time and subsequent ones. This kind of experience gives you the courage needed later on to be brave and tackle these kinds of seemingly insurmountable problems.

And this is where I began to notice where not being embedded in the team delivering code on a daily basis was beginning to constrain the work we, as consultants, were trying to do. From the perspective of the developers at the client it’s all very well having a consultant impress upon you the theory about how it will help in the long term, but I won’t be the one having the uncomfortable conversations about why I appear to be taking so long now.

I kept trying to put myself back in that position all those years ago to try and remind myself what it felt like so that I could find another way to make the developers more comfortable with doing the right thing. I also reread one of my earliest blog posts “Refactoring - Do You Tell Your Boss?” to find inspiration there too. I thought it might be possible to use pairing as a way to help ease the pressure but that never went anywhere either.

Accountability

In retrospect I think the difference between what happens when we’re inside the team helping them with delivery versus outside them consulting, is down to accountability. When we’re in the team our heads are on the block too and so leading by example is fundamental to getting people to follow. When we’re outside the team we’re ultimately not (seen as) accountable for the changes we might have a hand in advising on.

When you truly believe in techniques like refactoring, test-first development, pair programming, etc. you exude an air of confidence that makes it easier to just accept that it’s the best way to work. You stop asking for permission to do things that are rightfully yours to do and you focus on how best to achieve the desired outcome, which caters both for the short term and the longer term goals.

In the past when I’ve thought about what it means to “lead by example” I’ve probably focused on the mechanics of the various software engineering practices – what it means to develop in a test-first manner or how you refactor something. What I’m realising now is that showing someone how to use the tools and techniques is not nearly as important as showing them what can be achieved by applying them. By tackling the really gnarly messes one bite at a time we illustrate that it’s plausible to eat the entire elephant, and that is the impetus they need to start the leap from theory to practice.

 

[1] I realise that sounds like I’m trivialising how to run agile teams which is clearly not that easy when you look at how many enterprises are run. The cultural problems are universal and usually the root cause, so therefore not something that is easily changed, but you can still make a very big impact by forcing more collaboration and taking their enterprise-level toys away.

[2] Of course by not changing the process at all the speed of delivery will continue to drop anyway as change becomes harder. But that has probably been so slow and gradual that it’s largely gone unnoticed.

Thursday, 3 December 2015

More | Stand-Up

2015-12-01 20.28.55

The consultancy that I’m currently working through (Equal Experts, aka EE) had a Christmas Party the other night at a swanky bar in London. They had already got a band lined up (The Git Clones) which comprised of various EE associates and so I thought it might be a good opportunity once again to try my hand at a little stand-up comedy.

Given the nature of the company – software development – the audience seemed to be a perfect fit. Being a Christmas party I knew there would be plenty of +1’s in the crowd too so I thought I’d better break them in gently. That said, I also felt it was the responsibility of any +1’s to pay more attention to what their partners had been telling them if didn’t want to feel left out…

Sadly the sound system appeared to be heavily optimised for the bass player in the band and so those of us who only spoke were inaudible to the majority of the audience. The knock-on effect of this for me was that all I could hear at the front was the chatter from those who carried on talking (and rightly so). This gave me the appearance that most people weren’t that engaged and so I skipped over the more “hard-core” stuff later on [1]. It wasn’t until I got off the stage that I discovered how little most people could hear.

Fortunately the response was very positive from those that did know what was going on which is good to know. Hence for those of you that thought the mime artist up on stage before the band was pretty rubbish, this was what you should have heard:

“I went to the opticians the other day as I started seeing keyboards, mice and printers out the corner of my eye. She said it’s just peripheral vision.”

“I’ve decided it’s time to upgrade to fibre so I’ve started eating All Bran for breakfast.”

“The last time I was at the dentist they told me I had a scaling problem. I said ‘that’s awkward as I don’t have room for any more teeth’.”

“During my mid-thirties I blew a whole load of cash on a top-of-the-range gaming rig. I think I was suffering from a Half-Life crisis.”

“My wife and I have been together for 25 years so I thought I ought to get her a token ring. It turns out you can only get 100-BASE-TX these days.”

“My son was getting hassled at school to share his music collection with BitTorrent. I told him not to give in to peer-to-peer pressure.”

“The last time I flew I put my phone into airplane mode and it promptly assumed the crash position.”

“I forgot my password the other day so I tried a dictionary attack. I just kept hitting the administrator over the head with it until he told me what it was.”

“Are electronic cigarettes just vapourware?”

“I spent an hour the other night trying to upload a picture of Marcel Marceau, but the server kept responding with ‘415 unsupported mime type’.”

“When you’re looking to score some new hallucinogenic drugs do you first consult Trip Advisor?”


“Was the Tower of Pisa built using lean manufacturing?”

“I know it’s all the rage but I think the writing’s on the wall for Kanban.”

“Is a cross-functional team just a bunch of grumpy LISP programmers?”

“If you want to adopt ‘agile release trains’ do you have to use Ruby on Rails?”

“My team isn’t very good at this agile stuff; our successes are just stories and our failures are all epics.”

“One company I worked at used mob programming. The hired a bunch of henchman to stand around with baseball bats to ensure everyone did an 80 hour week.”

“I asked one company at an interview if they used spikes. They said ‘yes, your head goes on one if you don’t deliver on time’. ”

“The other day the head of QA asked my why all our automated tests only cover the happy path. I told him they were ‘rose tinted specs’.”

“When working from home I like to get my kids to help out with the coding. I call it ‘au pair programming’.”

“If poor quality code leads to technical debt, does that make bad programmers loan sharks?”

“The problem with the technical debt financial metaphor is that in the end everyone loses interest.”


“I once tried out some numerical recipes that involved currying functions, but all I got was a nan.”

“Is it any wonder that modern programmers are obese when they’re addicted to syntactic sugar.”

“C++ comes with complexity guarantees. If you use C++ it’s guaranteed to be complex.”

“Is the removal of a dependency injection framework from a codebase known as ‘spring cleaning’?”

“They say that Java and C# programmers overuse reflection. Given the quality of their code I’d say they aren’t reflecting enough.”

“Is it me or are Java and C# so similar these days that they’re harder to tell apart than The Adams Family and The Munsters?”

“If you think keeping up with The Kardashians is hard, you should try JavaScript frameworks!”

“I heard they were setting up a spin-off of EE to specialise in JavaScript work. It’s going to be called Equal, Equal, Equal Experts.”

“When Sherlock Holmes talks about ‘a three-pipe problem’ does he mean one that uses grep, sed, awk and sort?”

“When it comes to creating UML diagrams of micro-service architectures I never know where to draw the line.”

“Our DR strategy is less active/passive and more passive/aggressive. When the system goes down we just sit around tutting loudly until someone fixes it.”

“Most systems I work on have ‘fives nines’ reliability. They’re usually available about 45% of the time.”

“I blame Facebook for the quality of SQL that young programmers produce. They’re obsessed with ‘likes’.”

“I wouldn’t bother upgrading your database as the SQL is never as good as the original.”

“The hardest problem in computer science is dealing with nans. They’re always ringing you up and asking you to fix their machine for them.”

[1] If you’re intrigued to know what the more “programmer oriented” stuff was you look at the set for my first ever “gig” at the ACCU 2015 Conference.