Tuesday, 22 December 2015

The Cost of Not Designing the Database Schema

The tale I wrote about in “Single Points of Failure - The SAN” didn’t entirely conclude at the point the issue was identified and apparently resolved. Whilst the vast majority of problems disappeared there was still a spike every now and then that caused the simple web service we wrote to take hundreds of milliseconds to respond, way more than a gen 2 garbage collection would take. We also logged when garbage collections occurred and they were never in sight when this glitch showed up.

After taking some time off I ended up joining the team who were responsible for calling that tactical web service and so I became privy to the goings-on upstream. It turned out the remaining blips were often occurring when an early morning batch process was run. It made little sense at the time that it could affect an entirely unrelated service, but with what I now knew about the SAN I felt the evidence pointed to a smoking gun. But how to truly explain it?

More Performance Woes

One of the changes being made when I joined this team was increased visibility (for the team) about how the services they owned were behaving in production. One service in particular was beginning to show signs of trouble and with the Christmas period looming it was felt something needed to be done about it pronto.

Interestingly the investigation of timeouts caused me to start correlating data with the other service we had had problems with earlier. On one particular day this daily batch process was delayed by a couple of hours and on that very same day the unexplained timeouts in the downstream service shifted too. Whilst correlation does not imply causality, the smoke from the gun was thickening. But it still didn’t make sense how the problem was “jumping the cracks”.

The investigation for my current team’s service turned to the Oracle database and it unearthed some stats that showed the database was making quite a few reads to satisfy the most common query type – retrieving the transactions for an account.

The Mists Begin to Clear

I started to apply the “5 Whys” technique to see if I could piece together a coherent picture that would address the immediate concern, but might also encompass the other one too. The question I started with was this:

“Why are the upstream service HTTP requests timing out?”
  1. Because they are waiting for a database connection. Why?
  2. Because each query is taking much longer. Why?
  3. Because the database is constantly hitting the SAN. Why?
  4. Because the database has to read so many pages. Why?
  5. Because the table being queried is badly organised.
Switching to the problem of unexplained timeouts in the other service for a moment it all started to make sense. This batch process that runs in the early morning generates a huge amount of “non-cacheable” reads (essentially a table scan) which is saturating the SAN and therefore causing the similar SAN related problems to what we had before.

Sadly my hypothesis was never acknowledged or discussed outside the team as they had stopped asking questions when they realised the database query was taking too long. However within the team it was accepted as highly plausible so I felt comfortable that at least we had some closure, and more importantly a theory to consider if things showed up again.

The temporary solution to the database problem was to stick a whole load more RAM in it to vastly improve caching and therefore reduce query times enough during the day to avoid the bottlenecks for now.

I posited that this change would also fix (or at least heavily reduce) the problems of unknown timeouts in the other service because Oracle would need to perform far less physical reads, and therefore the load on the SAN would also be reduced. This is exactly what I observed, so the gun was smoking even more now.

Addressing the Root Cause

Fundamentally the problem was down to the database having to do way more I/O work than should be necessary to satisfy the query. The table in question is essentially a set of transactions for an account which are being queried by the account’s ID.

The table was implemented as a simple heap with an index for the account ID. Whilst this meant that the transactions for an account could be found by the index, due to the heap structure the transactions were spread right across the table’s entire set of pages. Essentially the database did a few reads of the table index to find the rows in question and then (pathologically speaking) did one read per-row to get the data itself. Hence, for accounts with many transactions that was a huge number of random I/O’s.

I wasn’t there when the table was designed and so I have no knowledge about what the rationale was. Maybe it was just “the simplest thing that would possibly work” and they thought they’d have time to address scalability later? Or maybe they expected a different read / write pattern? Either way it’s not the structure I would have expected out-of-the-box for this kind of table.

Given that the table stores data for an account, and the key for that account is the primary means of lookup, we should be looking to keep all the data for an account close together. Hence using a table physically structured around the account ID (a “clustered index” on SQL Server and “index-organised table” on Oracle) will provide fast access and excellent locality of reference because all the pages for each account will be stored together. This way the database only has to navigate the index to the start of the specific account’s data and then do a few sequential page reads to get the rest.

No Time to Fix It

The problem with modern businesses is that they run 24x7 these days and so there is no time for downtime and maintenance. So whilst a differently organised table may well now be the best approach, the cost of implementing that change may be too high. Due to the current volume of data, taking the database offline and rebuilding it was not considered possible given the current state of the business and market.

Instead the DBAs decided to add a covering index that could be built online which included all the data so the query optimiser could satisfy the main query solely from the index. Essentially they created the clustered table via an index. Of course every write now had to update the table, original index and the new one. It should have been possible at that point to drop the original index, but I’m not sure if that happened as they’d also have to prove it wasn’t being used by another query.

Back to the SAN

In the meantime I was asked to investigate some other unexplained timeouts that occurred well outside the morning batch processing window. Knowing what we did now about the database and the SAN someone questioned whether the DBAs were already implementing this new index in production?

They weren’t but they were testing the approach in the QA environment. The correlation again was very strong and so someone investigated what the topology was for the databases in the QA environment and they discovered that some of the storage pools shared a portion of the SAN with production which was clearly unintentional. Oops.

Early Warning Indicators

Hindsight is a wonderful thing and it’s good that they were gaining visibility of their service’s behaviour, but that was only able to identify immediate glitches. There also needs to be some element of trend analysis to spot when things are beginning to head south.

For me the stance on instrumentation is that you measure everything you can afford to. Any lengthy computation or external I/O (i.e. anything that could block) should be recorded so that you can get a handle on what operations are behaving strangely now, and how they are changing over time as the service ages and adapts to new loads. It’s pretty easy to add too (see “Simple Instrumentation”).

Without some form of trend analysis you become like a slow-boiled frog that isn’t noticing how the surroundings are changing. All of a sudden what once took milliseconds now takes tens of milliseconds but you haven’t noticed it creep up. Everything appears to be normal right up to the point that performance drops off the cliff and you’re fire-fighting to bring it back under control.


You also cannot just monitor everything and expect to make sense of it all when a crisis hits. The data by itself is no use if you don’t understand how it relates to the moving parts of the system – you need to know why certain things change together, or not. From this you can build a heartbeat so that you really know how the system is evolving over time.

Acceptance Test Is Not an Environment

In a traditional software development process where you did analysis, development and then testing, there is often the use of shared environments, and therefore there is often a one-to-one relationship between the name of the environment and the type of testing performed. For example UAT (User Acceptance Testing) tends to come right at the very end of the process just before production. If you are working on a back-end system there may well be no “U” in the UAT and so it really just becomes a more production-like test environment.

In a modern development process there is more of a distinction between the type of tests we are running and the environment in which we are running them. We are always trying to achieve a balance between getting the fastest feedback possible on whether our changes are correct, whilst still ensuring that enough of the system is being tested in a manner similar to production so that we minimise any problems due to environmental differences.

In my C Vu article “The Developer’s Sandbox” I described a number of different ways that you might partition a system (and test data) to allow a variety of different levels of non-unit testing. In essence I am mostly interested in running fast, automated test suites in some isolated manner to gain rapid feedback. However I also like to do a bit of manual exploratory testing, especially when making changes around deployment or infrastructure code. And demoing new features is also important too to ensure that we’re building “the right thing”.

What I’ve found is that there is often some confusion when talking about testing that conflates the suite of tests being exercised with the configuration of the system it’s being run on. For example I will try and run every automated test possible on my local machine before committing my changes. This means I’m probably running some combination of unit, component, integration, acceptance and system tests against a variety of mock and real components and services depending on how expensive or not they are to use.

Similarly on the build server we will run exactly the same suite of tests but because we have more time we can use the real dependencies where possible and only rely on mocks where we have to. The closer the code gets to production the closer the test environment has to get to production too.

As a consequence this means there is no one-to-one relationship between the test suite configuration and the environment where it is run. By default we tend to optimise for the developer feedback loop which means the out-of-the-box configuration is usually “localhost” everywhere [1]. In contrast the build server, development and test environments will likely have real networks, databases, message queues, etc. in play and so the same suite of tests will increase the amount of infrastructure and integration for a more production-like quality, perhaps at the expense of performance. The point is that we aim to run the same tests and only vary the configuration. Hence when talking about automated testing it may require us to qualify it with the environment configuration we might be running with to avoid confusion.

One natural observation might be that it’s not right to call the running of the acceptance test suite on a developer’s local machine “acceptance tests” as some element of the “acceptance” must come from it being run in a more-production like manner. Whilst I get the sentiment, I think that misses the point about developer’s leveraging the traditionally more costly tests in a constrained, but by no means useless environment, to gain earlier feedback around the functional behaviour. No, it doesn’t mean it’s signed-off and ready for production just because it works on my machine, but it does mean that at a fundamental level the change is sound and worthy of pushing further down the deployment pipeline.



[1] I always say that I should be able to unplug from the network and go out into the garden where there is no Wi-Fi and still be able to write code and have a high degree of confidence that it works. Modern tooling (and a sane approach to developer licensing) makes that possible even when databases, message queues, etc. are in the equation without having to restrict ourselves to relying solely on unit testing.

Observable State versus Persisted State

A while back I was working on a replacement service that was intending to use one of those new-fangled document-oriented databases (Couchbase as it goes). During the sprint planning meeting we had a contentious story around persisting data and what it meant to handle multiple writes in a single “business transaction”. There was some consternation that because there is no native transaction support (or locking) to ensure we got an atomic commit on success, or a rollback if a problem occurred somewhere, then we couldn’t deliver the story on that technology stack.

Effectively we had reached the point where we were handling the stories around idempotency and the story had wording in it that assumed a classic relational all-or-nothing style of transactional writing which we naturally couldn’t have. The crux of the question was whether we could perform our writes in such a way that if an error occurred any invariants would still remain, and if the request was retried then we’d be able to complete it after being left temporarily in a potentially half-finished state.

Atomic Multi-Document Writes

The problem revolved around creating a number of child documents (e.g. Orders) for a root document (e.g. Customer). When using a traditional database the child records could just be written as-is because they will not be visible until the transaction is committed (ignoring dirty reads). If an error occurs at any point whilst writing, the whole lot are removed. If the database goes down before the commit is persisted it will roll-back the transaction if it needs to on restart. Either way any invariants violated during the writes are invisible outside the transaction.

Non-Atomic Multi-Document Writes

Whilst writes are atomic at a document level, they are not when multiple documents (or many, separate writes to the same document) are involved. As such we need to perform each insert, update and delete in a way that assumes we might lose connectivity at that moment.

The first problem is ensuring that a failure after any single write cannot leave the data in a state where any invariants have been violated. For instance if the model says that there is a two-way relationship between two documents, then only having one-half of it is unacceptable because navigating the other way will generate an error.

As a consequence of partially written data being a possibility due to a lack of transactions, we likely have to adopt an error handling strategy that either unwinds the state or moves it forward to achieve the original desired outcome [1]. For this to happen we will almost certainly be looking at using idempotent writes where we can try the same action again and again and not incur any additional side-effects if it has already completed successfully (e.g. a counter is incremented once, and only once).

The Observable Effects of Idempotency

And so we come back to the problem we encountered when discussing the story – what exactly does idempotency mean? The way it was worded in the story was that any failed business transaction must not leave any residual state behind. Given the way that the database works and the kind of business transaction we were trying to do meant that this was simply impossible to achieve. With an air of defeat the discussion turned to how we can switch back to using a traditional transactional database to meet this story.

However, I wanted clarification around what it meant for “no state” to be left within the database. What I thought the intent of that phrase really meant was “no observable state” should be left around if the transaction fails. If we consider the system as a black box, not a white one, then we can leave residual state lying around just so long as it is not visible outside the system. And as long as the system is only accessible via our public API we can control how temporary state can remain hidden.

But how? In this instance if we ordered our writes carefully enough we can ensure that any invariants remain intact after every single write. We just need to be careful about how we define when a piece of data becomes visible through the public API.

Example: File-System Writes

To understand how this can be achieved think about how a modern day editor, such as MS Word, saves documents. It does not just open the file and start writing because if it did and the machine failed both the old and new documents would be lost. Instead it follows a sequence something like this, to minimise the loss of data:
  1. Write the new document to a temporary file.
  2. Rename the current backup file to a temporary name.
  3. Rename the old document to make it the backup.
  4. Rename the temporary file to the document’s name.
  5. Delete the old backup file.
In fact this pattern of file-system behaviour (write + rename) is so common that NTFS even recognises it to make sure the newly written document carries over the previous file’s creation date to make it appear as if it just updated the old file.

What makes this work is that the really dangerous work is done off to the side (i.e. writing the new version of the document) leaving just some file-system metadata changes (3 renames and a delete) to “commit” the change. I touched on this idea before in “Copy & Rename (Like Copy & Swap But For File-Systems)” after having to deal with torn files due to a badly written file transfer process.

Idempotent Writes

The way to achieve the same effect in the database is also by writing in a particular way and by tagging each business transaction with a unique ID that we can use to replay or recover from after a failure.

In our example we split the writes up into two stages:
  1. First insert the child documents.
  2. Then update the parent document to refer to them.
It might seem as though the child documents would be visible after the initial write but they aren’t because the public API only publishes the ID of children who are referenced in the parent. As such there may be state persisted, but it is not observable until the single write at the end of the parent document, which is atomic.

The relationship is actually bidirectional (you can find a child and lookup its parent) which might seem like a loophole until you consider the previous point – the child is not publicly visible until the parent has been committed. You can’t ask for the child because you have no way of knowing of its existence via the public API.

The way the idempotent ID works is that it is logged against certain writes so that we can tell what has and hasn’t been performed already. So in our example above each child document is created (possibly with the idempotent ID [2]) and when we add the references into the parent we tag it with the idempotent ID so that we know we completed the transaction. If it fails at any point we can just discard the temporary child documents and recreate them. This does mean we have the potential for detritus to be left around on failures, but they should be rare and can be “garbage collected” in slow time using a background process [2].

Scalability

This technique works for simple object models which is how I’ve used it. It can be extended to some degree if you are willing to add complexity to your model (and probably increase the number of I/Os) by creating more elaborate “invariants”. For example if the sender could have controlled the child document ID it might mean that the public API would have to navigate from child to parent to validate its existence (presence of the document alone not being enough).

Given the choice between using a classic transactional database and having to think really hard about this stuff it’s probably not worth it. But if you have a simple object model and are looking at alternatives for performance reasons, then you need to think a bit differently if you’re going to cope without transactions.


[1] Just ignoring a part-failed request and leaving the data in a valid, but unusual state, should be possible but highly undesirable from a support perspective. It’s hard enough piecing together what’s happened without being plagued unnecessarily by zombie data.


[2] It’s not essential if you always re-submit and roll forward, but can help in the aftermath if cleaning up. It would probably be required though if you needed to roll-back first as it may be the only key you have to the document at that point.

Wednesday, 9 December 2015

Don’t Be Afraid to Throw Away Data

One of the problems I’ve seen come up in various projects is what to do with all the data that we’re being given that we don’t need to use at that moment in time? For example I worked on a new system which was estimated as needing to hold around 100 TB when in production as it had a data retention period of just over a year. We only needed a subset of it initially.

More recently the same problem came up on a 3rd party data feed where the vast majority of the data could be discarded because only a couple of attributes where actually being used. In both cases I struggled to convince the business to reconsider mindlessly hoarding data in production that they didn’t need, either now or in the foreseeable future.

Storage Costs

Fundamentally storing the data as part of the production data set was going to cost a non-trivial amount of money. Whilst 100 TB does not sound like a huge amount these days, once you consider that it’ll be held on top class storage (e.g. solid state), the cost begins to become noticeable. In contrast there are plenty of really cheap storage options for the parts of the data that likely has an SLA ranging from months, to “never”.

Redundancy

We should also not forget that the more data we store in our production data stores the harder it will be to recover when (not if) something goes wrong. Why waste time restoring non-essential data when the aim is to get the business back on its feet ASAP? If you treat every piece of data the same you have no ability to prioritise.

What If?

In both instances the argument from the business was one about “what about when we need to use the data in the future?”. They were worried that if they throw it away and later discover that it’s useful then they’d have to wait ages again until they had accumulated enough.

In both cases what they failed to distinguish is the difference use cases for the data. To them it was an all-or-nothing deal and I strongly suspect that was down to the mentality of using a single database product so they could keep all the eggs in one basket.

Production Queries versus Analysis

Production infrastructure will usually be sized and tuned to cope with the fixed subset of requests used by it. The more varied the demands the harder it is to provide a service that meets all its needs and therefore its SLAs. If those demands involve the ability to run ad-hoc queries then all bets are off. I’ve seen people crash production databases by running poorly written ad-hoc queries (usually by accident).

In contrast, in my experience, data analysis requirements often come with much lower expectations. It’s entirely possible that just a sample of the data might be required rather than every byte ever produced. The data store may be tuned and arranged completely differently if it’s likely to be handling unknown queries. Given the less critical nature of the data it probably comes with far lower support guarantees, and therefore running costs.

Partition Appropriately

The idea of using separate databases for separate purposes is nothing new – the traditional “transactional versus reporting” split has been around for decades. It’s just another specialisation of the more general principle regarding the Separation of Concerns.

With ever cheaper hardware and cloud computing at one’s fingertips it might seem that modern databases can handle any disparate load you care to throw at them because most data sets could probably fit into RAM these days if you decided to spend the money.

Sadly the cost of enterprise-grade hardware still makes me wince, especially when the internal price factors in all the costs of the data centre, infrastructure staff, etc. Only a couple of years ago I was quoted £36 per GB for storage on an enterprise project expected to store many tens of terabytes [1].

Many businesses are still holding on tightly to their own data centres, for various reasons, and so the answer is not always as clear cut as it first appears.

Deferring Decisions

In both cases what I was essentially trying to do was help the business defer some decisions that I suspected were not important in the shorter term. Rather than blindly assume that all data is valuable and get stuck spinning our wheels on speculative requirements, we should consider whether dropping the data is the easiest approach, for now. If that’s absolutely not possible, then consider other ways to put the unused data to one side until we know more about how it will be used.

In the former case cited at the beginning we were continually getting bogged down in discussions about how to store the data that we didn’t understand up-front in a way that would make it available at a (much) later date. Aside from being just a schema design issue it also meant we had to factor it into the discussions around performance. In the end we reached an agreement where we could dump all the incoming raw data (after processing) onto a compressed volume so that it wouldn’t be lost, but the cost of understanding and re-importing what we didn’t understand today would be borne at the time when it was actually required.

As for the latter case we proposed keeping the production cache tiny by only storing what mattered, and that the full payload would be pushed out to a queue where it could be imported into an independent, non-production database organised for analysis.

 

[1] The costing model was entirely based around the notion of SAN storage for everything. The modern document-oriented databases like MongoDB are architected for commodity hardware which really messes with those kinds of costing models.

Tuesday, 8 December 2015

Poor Performance of log4net Context Properties

Back in September last year I wrote a post about a simple, low-latency web service we had to put together in a short timeframe (see “The Surgical Team”). During this project we hit a serious performance snag with log4net caused by using its context properties collection in the log message.

The Early Warning Indicator

The web service we were building had to perform a simple lookup on a multi-GB data set in less than 10 ms. We were going to use .Net and ASP.Net and so given its garbage collected environment we sought to verify first of all that garbage collections were not going to be a problem. Hence we got the CI pipeline in place and wrapped up a call to a NOP web API handler in a performance test to measure the basic solution infrastructure.

After a few false starts with the test itself (and the test framework) we found that each call was only taking a couple of ms [1] with an occasionally longer one as we hit a 2nd generation garbage collection (GC) which “stops the world” (at least in .Net 4). Even though the expensive GC meant we were occasionally more than an order of magnitude outside our SLA, the business were happy to go with it given the time constraints [2].

The Klaxon Goes Off

All of a sudden the core part of the build is succeeding but the performance tests are failing. We initially put it down to a blip, perhaps in the infrastructure, and ignore it for now. Then it trips again and again and so we decide to check the last few commits to see what’s happening as this now feels like it might be our code.

One of the developers had recently added log4net into the mix for diagnostic logging and to report SLA violations via the Windows event log, but that had been a few commits before things started going south. The commit that seemed to have pushed us over the edge was the addition of the HTTP request correlation ID to the log message. This was done via the context property bag in log4net, e.g.

%date %-5level %thread %property{correlationId} ...

This didn’t seem quite right at first though as we’d used the same property bag elsewhere, e.g. in the event log message. But of course we quickly realised the event log message only happened when we were already outside the SLA.

We reverted the change and the performance tests were green once more. At this point we didn’t know what it was about using properties that was causing it, or even if it really was that so we looked for another way as the information was important for support and monitoring (see “Causality – Relating Distributed Diagnostic Contexts”).

(This also got noticed and formally reported by someone else a few months later, and is now tracked in the log4net JIRA under LOGNET-421 and LOG4NET-429.)

Pattern Converters to the Rescue

Fortunately the property bag isn’t the only way to insert custom content into a log4net message (without just concatenating it into the message which was our fall-back), it also supports custom fields, called “pattern converters”.

As you can see from the documentation you specify your own field names and use them as placeholders too, e.g.

%date %-5level %thread %correlationId ...

Unlike the built-in property bag you have to do a little bit more work, such as creating some (thread-local) storage for the property value that you can fish out later when you need to invoke log4net to write the message [3].

public class CorrelationIdConverter : PatternConverter
{
  protected override void Convert(TextWriter writer,
                                  object state)
  {
    // The ID is stored in a thread-local property.
    writer.Write(Correlation.Id);
  }
}

You’ll also need to tell log4net about the field name and the class that should be invoked to format the value [4].

<converter>
  <name value="correlationId" />
  <type value="MyLib.CorrelationIdConverter, MyLib"/>
</converter>

We watched the build machine performance test results closely when the change was pushed but, just as we had hoped, the effect wasn’t noticeable.

 

[1] I am in no way suggesting that “a couple of ms” for an empty web API call could be considered decent performance. Quite frankly I was amazed when we discovered how much time and memory the ASP.Net pipeline itself used.

[2] We did suggest ways to mitigate this, such as fronting the service with a load balancer / service that could do a “best of three” as the service was read-only.

[3] Technically speaking it was stored in the ASP.Net HttpContext, if it exists, and if not we fall back to using traditional thread-local storage.

[4] We were actually using programmatic configuration of log4net there, but on another project we had to use the traditional log4net.config file and sadly it starts to add a bit of noise.

Friday, 4 December 2015

The Importance of Leading by Example

I didn’t have time to write a short blog post, so I wrote a longer one instead.” – Mark Twain (mostly)

As I mentioned recently in “Missing the Daily Commute by Train” my previous engagement was a little out of the ordinary for me. Normally I’m just another programmer working in a delivery team alongside other developers, testers, BAs, etc. but in this instance I was doing a more hands-off consultant style role.

This is quite an unusual type of engagement for the consultancy through which I am currently working (Equal Experts). Whilst they prefer to put together an entire team, including developers, testers, UX specialists, etc. to try and ensure the best chance of delivering a successful outcome (a product the customer really wants), sometimes they have to settle for less control.

Small Cogs, Big Wheel

To date I’ve worked on a few different projects at a couple of sites and by-and-large the consultancy firm has only provided developers as that is what the client has wanted (they have preferred to provide their own people for the other roles). Even so, despite being there to deliver a software product we have naturally been able to put our many collective years of software development experience into effect and provided input, not only on the aspects of coding, but also on how the team works.

It’s very tempting to try and solve the entire organisation’s problems, but we have to remember that first-and-foremost we are there to deliver a specific project. If we can also impart other knowledge about how the team, and even possibly the division or organisation, might consider doing things better then that’s an added bonus. Shared correctly any advice is often welcomed and sometimes the desired outcome is achieved. Not every battle may be won, but a few changes here-and-there can often make a big difference to how the team behaves and the quality of work delivered.

Being a Bigger Cog

What was different about this recent engagement was that we were there explicitly to try and instigate such over-arching changes in the first place. Re-organising the teams from being silos grouped around similar skill sets (i.e. programming language/platform, testing, etc.) to cross-functional teams where the entire technology stack was catered for was an obvious change as it meant the delivery of any feature should mostly stay within the delivery team.

Much of what we did was to observe what was happening and try and correct the team’s and, to a much lesser extent, an individual’s behaviour. The latter was rare and really just signified a behaviour that was probably common to that person’s role rather than actually about one specific individual.

As an outsider (to the team) this was reasonably easy to do because the advice often involved getting the right people to talk to each other. Learning to decipher a Kanban board and act on it appropriately is by no means trivial, but it generally involves focusing on helping only one person in the team – the Scrum Master. Many of the techniques we employ when running a project actually involve giving up on modern technology and tools and going back to pen and paper or whiteboards. In essence there is no learning curve for the tooling, only in the approach [1].

In contrast there are often many developers and testers in a team and getting all of them to understand what it takes to write sustainable software without actually helping them directly felt like a more insurmountable task. You can talk all day long about concepts like test-driven development and continuous integration but many of the barriers to implementing those lay at the architectural level.

As I mentioned at the beginning the initial driver for the engagement was “an agile transformation” and due to the organisation’s structure the sponsor was effectively interested in us changing the way they manage the project rather than the way the software is developed. In short, changing the engineering practices were near the bottom of the list of priorities.

No Time to Improve

When the initial dust had settled around re-organising the teams and throwing out the vast quantities of paperwork they would generate to facilitate hand-offs, we got to start addressing the technical aspects. This, sadly, didn’t go anywhere near as well.

Where the team re-organisation happened very quickly (a couple of months) and, somewhat surprisingly, caused very little disruption, the problems with the architecture and development practices would require a significant investment in time and therefore also money. It would also likely have an impact on the delivery schedule due to the refactoring needed to get a fluid delivery pipeline in place [2].

As I discussed only recently in “Don’t Fail Fast, Learn Cheaply” modern development practices often have an air of redundancy about them and refactoring was a classic case in point. Right from the start the notion that you would rewrite existing code to support a sustainable pace of change in the future was rejected. When the management team focus on how to stop the more disruptive 10% being less productive instead of enabling the 90% to work much better you know you have an uphill battle.

Ultimately the entire codebase, except for one small area, was not written with modern, automated testability in mind. As a consequence there would need to be a non-trivial amount of work to do just to get the code into a position where it was possible to even begin testing through unit or acceptance tests. Doing this “rewrite” costs time, which when you are used to dedicating 100% of your time delivering production code means a substantial short-term hit.

A side-effect of having no slack in the schedule also means that there is no time to improve ones skills. Abraham Lincoln suggests that time spent “sharpening our axe” is essential as without learning we never find more efficient ways to do our work. In an industry like software development where the goal posts are moving all the time it’s paramount that we keep inspecting our axe to see when it’s becoming blunt.

The knock-on effects of the abyss they had found themselves in meant that they were in a vicious cycle where the architecture doesn’t easily support automated testing and changing the architecture to support automated testing was too difficult. With no rapid feedback there were lots of bugs which caused rework for testers and developers and that put more time pressure on them.

Eating the Elephant

There is a saying about how you go about eating an elephant – one bite at a time. With a system that resembles a Big Ball of Mud it’s often hard to even know where to start. Once upon a time I was brave enough to spend three days just sorting out the #includes (and forward declaring types) in a large C++ codebase just so that I can get the simplest unit test in place. If I had known it would take 3 days I’m not sure I would have started, but it paid huge dividends once it was done, both for the change I needed to make at the time and subsequent ones. This kind of experience gives you the courage needed later on to be brave and tackle these kinds of seemingly insurmountable problems.

And this is where I began to notice where not being embedded in the team delivering code on a daily basis was beginning to constrain the work we, as consultants, were trying to do. From the perspective of the developers at the client it’s all very well having a consultant impress upon you the theory about how it will help in the long term, but I won’t be the one having the uncomfortable conversations about why I appear to be taking so long now.

I kept trying to put myself back in that position all those years ago to try and remind myself what it felt like so that I could find another way to make the developers more comfortable with doing the right thing. I also reread one of my earliest blog posts “Refactoring - Do You Tell Your Boss?” to find inspiration there too. I thought it might be possible to use pairing as a way to help ease the pressure but that never went anywhere either.

Accountability

In retrospect I think the difference between what happens when we’re inside the team helping them with delivery versus outside them consulting, is down to accountability. When we’re in the team our heads are on the block too and so leading by example is fundamental to getting people to follow. When we’re outside the team we’re ultimately not (seen as) accountable for the changes we might have a hand in advising on.

When you truly believe in techniques like refactoring, test-first development, pair programming, etc. you exude an air of confidence that makes it easier to just accept that it’s the best way to work. You stop asking for permission to do things that are rightfully yours to do and you focus on how best to achieve the desired outcome, which caters both for the short term and the longer term goals.

In the past when I’ve thought about what it means to “lead by example” I’ve probably focused on the mechanics of the various software engineering practices – what it means to develop in a test-first manner or how you refactor something. What I’m realising now is that showing someone how to use the tools and techniques is not nearly as important as showing them what can be achieved by applying them. By tackling the really gnarly messes one bite at a time we illustrate that it’s plausible to eat the entire elephant, and that is the impetus they need to start the leap from theory to practice.

 

[1] I realise that sounds like I’m trivialising how to run agile teams which is clearly not that easy when you look at how many enterprises are run. The cultural problems are universal and usually the root cause, so therefore not something that is easily changed, but you can still make a very big impact by forcing more collaboration and taking their enterprise-level toys away.

[2] Of course by not changing the process at all the speed of delivery will continue to drop anyway as change becomes harder. But that has probably been so slow and gradual that it’s largely gone unnoticed.

Thursday, 3 December 2015

More | Stand-Up

2015-12-01 20.28.55

The consultancy that I’m currently working through (Equal Experts, aka EE) had a Christmas Party the other night at a swanky bar in London. They had already got a band lined up (The Git Clones) which comprised of various EE associates and so I thought it might be a good opportunity once again to try my hand at a little stand-up comedy.

Given the nature of the company – software development – the audience seemed to be a perfect fit. Being a Christmas party I knew there would be plenty of +1’s in the crowd too so I thought I’d better break them in gently. That said, I also felt it was the responsibility of any +1’s to pay more attention to what their partners had been telling them if didn’t want to feel left out…

Sadly the sound system appeared to be heavily optimised for the bass player in the band and so those of us who only spoke were inaudible to the majority of the audience. The knock-on effect of this for me was that all I could hear at the front was the chatter from those who carried on talking (and rightly so). This gave me the appearance that most people weren’t that engaged and so I skipped over the more “hard-core” stuff later on [1]. It wasn’t until I got off the stage that I discovered how little most people could hear.

Fortunately the response was very positive from those that did know what was going on which is good to know. Hence for those of you that thought the mime artist up on stage before the band was pretty rubbish, this was what you should have heard:

“I went to the opticians the other day as I started seeing keyboards, mice and printers out the corner of my eye. She said it’s just peripheral vision.”

“I’ve decided it’s time to upgrade to fibre so I’ve started eating All Bran for breakfast.”

“The last time I was at the dentist they told me I had a scaling problem. I said ‘that’s awkward as I don’t have room for any more teeth’.”

“During my mid-thirties I blew a whole load of cash on a top-of-the-range gaming rig. I think I was suffering from a Half-Life crisis.”

“My wife and I have been together for 25 years so I thought I ought to get her a token ring. It turns out you can only get 100-BASE-TX these days.”

“My son was getting hassled at school to share his music collection with BitTorrent. I told him not to give in to peer-to-peer pressure.”

“The last time I flew I put my phone into airplane mode and it promptly assumed the crash position.”

“I forgot my password the other day so I tried a dictionary attack. I just kept hitting the administrator over the head with it until he told me what it was.”

“Are electronic cigarettes just vapourware?”

“I spent an hour the other night trying to upload a picture of Marcel Marceau, but the server kept responding with ‘415 unsupported mime type’.”

“When you’re looking to score some new hallucinogenic drugs do you first consult Trip Advisor?”


“Was the Tower of Pisa built using lean manufacturing?”

“I know it’s all the rage but I think the writing’s on the wall for Kanban.”

“Is a cross-functional team just a bunch of grumpy LISP programmers?”

“If you want to adopt ‘agile release trains’ do you have to use Ruby on Rails?”

“My team isn’t very good at this agile stuff; our successes are just stories and our failures are all epics.”

“One company I worked at used mob programming. The hired a bunch of henchman to stand around with baseball bats to ensure everyone did an 80 hour week.”

“I asked one company at an interview if they used spikes. They said ‘yes, your head goes on one if you don’t deliver on time’. ”

“The other day the head of QA asked my why all our automated tests only cover the happy path. I told him they were ‘rose tinted specs’.”

“When working from home I like to get my kids to help out with the coding. I call it ‘au pair programming’.”

“If poor quality code leads to technical debt, does that make bad programmers loan sharks?”

“The problem with the technical debt financial metaphor is that in the end everyone loses interest.”


“I once tried out some numerical recipes that involved currying functions, but all I got was a nan.”

“Is it any wonder that modern programmers are obese when they’re addicted to syntactic sugar.”

“C++ comes with complexity guarantees. If you use C++ it’s guaranteed to be complex.”

“Is the removal of a dependency injection framework from a codebase known as ‘spring cleaning’?”

“They say that Java and C# programmers overuse reflection. Given the quality of their code I’d say they aren’t reflecting enough.”

“Is it me or are Java and C# so similar these days that they’re harder to tell apart than The Adams Family and The Munsters?”

“If you think keeping up with The Kardashians is hard, you should try JavaScript frameworks!”

“I heard they were setting up a spin-off of EE to specialise in JavaScript work. It’s going to be called Equal, Equal, Equal Experts.”

“When Sherlock Holmes talks about ‘a three-pipe problem’ does he mean one that uses grep, sed, awk and sort?”

“When it comes to creating UML diagrams of micro-service architectures I never know where to draw the line.”

“Our DR strategy is less active/passive and more passive/aggressive. When the system goes down we just sit around tutting loudly until someone fixes it.”

“Most systems I work on have ‘fives nines’ reliability. They’re usually available about 45% of the time.”

“I blame Facebook for the quality of SQL that young programmers produce. They’re obsessed with ‘likes’.”

“I wouldn’t bother upgrading your database as the SQL is never as good as the original.”

“The hardest problem in computer science is dealing with nans. They’re always ringing you up and asking you to fix their machine for them.”

[1] If you’re intrigued to know what the more “programmer oriented” stuff was you look at the set for my first ever “gig” at the ACCU 2015 Conference.

Thursday, 26 November 2015

The Cost of Not Starting

The idea of emergent design is uncomfortable to those at the top and it’s pretty easy to see why. Whilst there are no real physical barriers to overcome if the software architecture goes astray, there is the potential for some significant costs if the rework is extensive (think change of platform / paradigm / language). In times gone by there was a desire to analyse the problem to death in an attempt to try and ensure the “correct” design choices were made early and would therefore (theoretically) minimise rework.

In a modern agile world however we see the fallacy in that thinking and are beginning to rely more on emergent design as we make better provision for adapting to change. It’s relatively easy to see how this works for the small-scale stuff, but ultimately there has to be some up-front architectural choices that will shape the future of the system. Trying to minimise the number of up-front choices to remain lean, whilst also deciding enough to make progress and learn more, is a balancing act. But the cost of not actually starting the work and even beginning to learn can definitely be dear if the desire is to move to a newer platform.

A Chance to Learn

I recently had some minor involvement in a project that was to build a new, simple, lookup-style data service. Whilst the organisation had built some of these in the past they have been on a much older platform, and given the loose timescale at the project’s inception it was felt to be a great opportunity to try and build this one on a newer, more sustainable platform.

Essentially the only major decision to make up-front was about the platform itself. There had already been some inroads into both Java and .Net, with the former already being used to provide more modern service endpoints. So it seemed eminently sensible to go ahead and use it again to build a more SOA style service where it owns the data too. (Up to that point the system was a monolith where data was shared through the database.)

Due to there being an existing team familiar with the platform they already knew plenty about how to build Java-based services, so there was little risk there, aside from perhaps choosing a RESTful approach over SOAP. Where there would be an opportunity to learn was in the data storage area as a document-oriented database seemed like a good fit and it was something the department hadn’t used before.

Also as a result of the adapter-style nature of the work the team had done before they had never developed a truly “independent” service, so they had a great opportunity to try building something more original in an ATDD/BDD manner. And then there was the chance to make it independently serviceable too which would give them an initial data point on moving away from a tightly-coupled monolithic architecture to something looser [1].

Just Enough Design

In my mind there was absolutely no reason why the project could not be started based on the knowledge and decisions already made up to that point. The basic platform had been chosen and therefore the delivery team was known and so it would be possible to begin scheduling the work.

The choice of protocol and database were yet to be finalised, but in both cases they would be relying heavily on integrating a 3rd party product or library – there was little they had to write themselves. As such the risk was just in evaluating and choosing an approach, and they already had experience with SOAP and their existing database to fall back on if things didn’t pan out.

Admittedly the protocol was a choice that would affect the consumer, but the service was a simple data access affair and therefore there was very little complexity at this stage. The database was going to be purely an implementation detail and therefore any change in direction here would be of no interest to the consumers.

The only other design work might be around what is needed to support the various types of automated tests, such as test APIs. This would all just come out “in the wash”.

Deferring the Decision to Start

The main reason for choosing the project as a point of learning was its simplicity. Pretty much everything about it allowed for work in isolation (i.e. minimal integration) so that directions could be explored without fear of breaking the existing system, or development process.

What happened was that some details surrounding the data format of the 3rd party service were still up in the air. In a tightly-coupled system where the data is assumed to be handled almost verbatim, not knowing this kind of detail has the potential to cause rework and so it is seen as preferable to defer any decision it affects. But in a loosely-coupled system where we decide on a formal service contract between the consumer and producer that is independent of the underlying implementation [2], we have less reason to defer any decisions as the impact will be minimal.

As a consequence of delaying doing any actual development on the service the project reached a point well passed the Last Responsible Moment and as such a decision was implicitly made for it. The looming deadline meant that there was no time or resources to confidently deliver the project on time and so it was decided that it would be done the old way instead.

Cost versus Value

One of the reasons I feel that the decision to do it the old way was so easy to make was down to the cost based view of the project. Based solely on the amount of manpower required, it likely appears to be much cheaper to deliver when you’ve done similar work before and have a supply of people readily available. But that only takes the short-term cost into account – the longer term picture is different.

For a start it’s highly likely that the service will have to be rewritten on a newer platform at some point in the future. That means some of the cost to build it will be duplicated. It’s possible many of the same learning's could be done on another project and then leveraged in the rebuild, but what are the chances they’ll have the same deadline luxuries next time?

In the meantime it will be running on a platform that is more costly to run. It may only add a small overhead, but when you’re already getting close to the ceiling it has the potential to affect the reliably of the entire monolithic system. Being done on the old platform also opens the door to any maintenance being done using the “culture” of that platform, which is to tightly-couple things. This means that when the time finally comes to apply The Strangler Pattern it won’t just be a simple lift-and-shift.

Whilst it might be easy to gauge and compare the short-term costs of the two approaches it’s pretty hard to put a tangible value on them. Even so it feels as though you could make a judgment call as to whether doing it on a newer platform was “worth” twice or three times the cost if you knew you were going to be gaining a significant amount of knowledge about how to build a more sustainable system that can also be continuously delivered.

Using Uncertainty as a Driver

One of Kevlin Henney’s contributions to the book “97 Things Every Software Architect Should Know” discusses how we can factor uncertainty into our architecture and design so that we can minimise the disruption caused when the facts finally come to light.

In this particular case I see the uncertainty around the external data format as being a driver for ensuring we encapsulate the behaviour behind a service and instead formalise a contract with the consumer to shield them from the indecision. Whilst Kevlin might have largely been alluding to design decisions the notion “use uncertainty as a driver” is also an allegory for “agile” itself.

Eliminating Waste

There is undoubtedly an element of poetic justice in this tale. The reason we have historically put more effort into our analysis is to try and avoid wasting time and money on building the wrong thing. In this instance all the delays waiting for the analysis and design phases to finish meant that there was no time left to do it “right” and so we will in all likelihood end up generating more waste by doing it twice instead.

Also instead of moving forward the knowledge around building a more sustainable platform we now know no more than we do today, which means maintenance will continue to be more costly too, both in terms of time & money and, potentially more importantly, morale.

[1] Whilst a monolithic architecture is very likely to be tightly-coupled, it doesn’t have to be. The problem was not being monolithic per-se, but being tightly-coupled.

[2] Yes, it’s possible that such as change could cause a major re-evaluation of the tech stack, but if that happens and we had no way of foreseeing it I’m not sure what else we could have done.

Wednesday, 25 November 2015

Don’t Fail Fast, Learn Cheaply

The term “failing fast” has been around for a long time and is one that I’ve used since the early days of my career. When talking to other developers I’ve never had a problem with it, but using it with business folk has had a different reaction on occasion.

Defensive Programming

I first came across the term (I believe) when reading Steve Maguire’s excellent book “Writing Solid Code”. In it he describes how letting a process crash at the moment something really bad happens is often more desirable than trying to code defensively, as that just masks the underlying issue. Whilst it sucks for the user they stand less chance of even worse things happening, e.g. silent data corruption. I wrote about my own experiences with this type of coding in “The Cost of Defensive Programming”.

Resource Efficiency

The second context under which I met the “failing fast” term was when reading Michael Nygard’s fabulous book “Release It!” Here he was talking about avoiding queuing or doing work which was ultimately going to be wasteful. For example if you can’t acquire a resource because it is unavailable, it’s better to discover that early and fail then instead of waiting until the end at which point you need to throw work away. Once again I’ve told my own tale around this in “Service Providers Are Interested In Your Timeouts Too”.

Projects

The most recent use of “fail fast” I’ve encountered has appeared in relation to the delivery of software projects. In this guise we are talking about how to do just enough work to either prove or disprove that the project is in fact viable. At a smaller scale you could apply the same idea to a spike, which is often one part of a project and used to validate, say, a technical approach.

In essence what we’re saying is that if you’re going to fail, make sure you do it as quickly as possible. By delaying the work that will allow you to decide whether the idea is actually viable runs the risk of so much being done that you fall foul of the Sunk Cost Fallacy. Sander Hoogendoorn has a post titled “Failing fast” that talks about this idea in more detail.

Negative Connotations

As you can see the term has many uses and so I’ve found it quite natural when talking to fellow developers in any of these three contexts to say it – they’ve always understood the real meaning. However when talking to less technical people, such as business folk and higher level managers I’ve found that you can get a different reaction. Instead of latching onto the second word “fast”, they focus on the first word “fail”. What then happens is that the discussion turns into one about “why would we do something where we might fail?”. And “isn’t that a backwards step?”.

At this point you’ve now got explain that failing quickly is really not failing per-se, but actually a successful outcome from a business cost perspective. In essence you’ve already put yourself on the back foot and you’ve potentially lost your audience as they try and work out why this “agile” stuff is beneficial if it means you’re going to fail! Why wouldn’t you just do more analysis up front and avoid failing in the first place?

Learning Cheaply

A more positive sounding way of promoting the idea is instead to focus on the point of the exercise, which is to learn more about the problem. And if we flip the notion of “fast” around and turn it into something the business really understands, money, we can talk about saving it by spending less to get an answer to our question. Also, where failing feels like a backwards step, we generally consider learning to be a cumulative process and therefore it sounds like we’re always making some sort of progress instead.

It’s all just smoke-and-mirrors of course, but in a world where everyone is vying for the company’s money being a little careful with your language may just tip the scales in your favour.

Tuesday, 24 November 2015

Missing the Daily Commute by Train

I started programming professionally just over 20 years ago and in all that time I have mostly commuted to my place of work either by car or train. My first role, straight out of university, was at the height of the recession in 1992 and so I pretty much moved to wherever it was going to be. After that I started contracting and did a couple of stints where I commuted by car for an hour each-way which pretty much convinced me that commuting by car any distance was less than desirable. Whilst I had been car-sharing and enjoyed some very interesting chats with my fellow programmer passenger it was still tiring and no fun when stuck in traffic (which eventually became a regular occurrence).

During that time my wife had tried commuting into London by train for her first job and found it was quite palatable. Hence it felt as though I either took a contract on my doorstep (which was unlikely), we moved house, or I headed into London by train [1]. And so I spent the better part of the next 20 years commuting into London and its suburbs.

All Quiet on the Writing Front

You may have noticed that my writing activities have taken a serious nosedive over the last 6 months and that’s almost entirely due to me taking a contract that was once again almost on my doorstep. A chance to commute by car for only 20 minutes a day, and in the opposite direction to all the traffic (which was heading into Cambridge) felt like too good an opportunity to pass up. It wasn’t a hands-on development role like I’ve been doing for the past 2 decades but, frankly, given the short commute I was happy to try my hand at some pure consulting for a change. And I’m glad I did as I learned heaps of stuff in the process [2].

Having such a short commute by car has been an absolute delight. I’ve left the house in the morning, and the office in the evening, when I felt like it rather than to meet a timetable. And the short drive time has meant me spending more time at both ends of the day with the wife and kids [3]. I never quite made it home to enjoy dinner every day with them as getting out of the business park’s car park was a nightmare at 5 pm; but it was close.

That said it seems a little churlish to complain about the lack of time I’ve had to write either for this blog or the ACCU journals. Clearly I have had the time, in the evening for example, but I’ve (implicitly) chosen to spend it differently. As I look back over my career I now begin to understand some of the comments that my colleagues have made in the past around how “lucky” I was to have a regular train-based commute.

A Time to Learn

My journey into London consists initially of 45 minutes solid travel followed by “an amount” of time on the underground rail network which has been anywhere from around 10 to 30 minutes. That first solid block of time has been great for really getting my teeth into a spot of reading, gaming or writing (articles or code) as I nearly always get to sit down, even if it’s on the carriage floor. The underground stretch is “standing room only” but still perfectly fine for reading a paper journal, like MSDN or one of the ACCU publications, as they are easy to hold and manipulate even on a very crowded train.

In the early days when a development capable laptop cost in the region of “thousands of pounds” I spent most of my time reading books about C++, Design Patterns, Windows internals, etc. I also read a variety of journals that have long since gone out of print such as C++ Report, MSJ, CUJ, Application Development Advisor and Dr Dobbs. Pretty much the only one left from this era that’s still in print is MSJ, but now under the name of MSDN Magazine. Luckily the ACCU journals, which I only discovered when CUJ disappeared (circa 2005), are also still in printed form.

Deliberate Practice

There is a saying that goes:

In theory there is no difference between theory and practice. In practice there is.

And so I’ve spent plenty of time coding on the train too. The train line I travel on has never even had decent mobile phone reception and so the idea of using the Internet is somewhat laughable. But given that my background has mostly been writing applications and services in C++ this has hardly been a problem to date, and is, in my mind, even highly desirable (See “The Developer’s Sandbox”). Most of what you see on my web site and GitHub page is code that has been written in these little 45 minute stints to-and-from work. Occasionally I’ve done a little bit in the evenings or in the office when the tool has been used to actually help me with my day job, but I’ve never worked anywhere that provides “20% time” - even for permanent staff (and I’d never expect to as a freelancer either).

Habitual Behaviour

It shouldn’t have come as any real surprise that my non-work activities would fall by the wayside the moment that my commute disappeared. After all I’ve taken a few lengthy periods of time off between contracts in the past and despite my best efforts to get motivated and spend it productively I’ve instead found it easy to fritter it away (but in a really nice way, e.g. having time with the family).

It took me a long time to realise just how much structure I need in my life to get “other” things done. Whilst I’d like to believe that I don’t need this kind of formality I’m really just kidding myself. Just as I need my notebook (paper based, of course) by my side to make notes and not forget things, so I need some semblance of order throughout my day to help guide me.

As I write this I’m beginning to wonder how much of what I said in “Code by Day, Design by Night” actually describes my behaviour outside “work time”? I guess my commute means I’ve always had “20%” time, it’s just that it’s had to be on top of my 100% working day. Either way I now realise how valuable to my career that time actually is.

[As if to prove a point I’m only just proof-reading this and hitting the “publish” button now that I’m back commuting again…]

[1] Another choice would have been to go permanent again but I had just started to enjoy the freedom of freelancing and was reluctant to give that up again so quickly.

[2] Hopefully I’ll be filling these very pages with musings from that gig in the coming months.

[3] Let’s put aside for a moment the fact that working from home means a zero-minute commute. Also, given the disruption I caused just by being around when the kids are supposed to be getting ready for school, I’m not convinced my wife always saw it as a bonus :o).

Friday, 25 September 2015

Choosing a Supplier: The Hackathon

One would hope in this day-and-age that when hiring a candidate for a role as a developer the interview process would include some element of actually doing what it is we do - programming. This could be as simple as submitting an offline coding test, but better yet a face-to-face session where you pair with one of the people you’re likely to work with. Either way, just as an artist would expect to show their portfolio, we should expect to do something similar – not just write a CV and then talk about it.

If that’s the case for hiring a single developer, why shouldn’t choosing a software supplier follow a similar process, after all there is probably far more money at stake. And that’s really what the whole interview process is all about – it’s a form of risk management – you’re trying to find a supplier that you’re confident is going to deliver what you really need in a timely manner and for an acceptable cost.

Show Your Workings

The Hackathon, which probably sounds a bit too hipster for some people’s taste, is the embodiment of this idea – get the potential suppliers to show how they actually produce software, albeit on  a very small scale. My last two days have just been spent on one of these and it was great fun, if a little nerve-wracking.

The consultancy firm I’m currently working through (Equal Experts) has been involved in one of these before, quite recently, and they relish the chance to do this kind of selection process more as it plays to their strengths – delivering working software incrementally using modern development techniques [1].

By all accounts this particular one was slightly different to the previous one, although it still followed a similar outline. There are a number of very small teams (3) representing the different suppliers. Each team has a small number of people (~4) that covers whatever skills they think they’ll need based on the brief. In our case we had 3 devs and a UX person, but obviously we could also act in a lesser capacity as a BA, QA, Ops, etc. too to cover all the bases.

The overall structure was that we would spend a short period learning a bit about the company we were pitching to and about the problem they wanted us to tackle. Each team was then given a separate “war” room in which they could work for roughly two days before presenting back at the end.

Where in the previous Hackathon they got more freedom about what they wanted to tackle (i.e. a higher-level problem statement) this one had a more specific problem to be solved. The problem was also directly related to the actual problem they chosen supplier would eventually be asked to work on, which makes sense.

During the two days various people involved in the selection process would come round and visit us to see what we’d been up to and that would also give us an opportunity to ask any questions we had about the problem. If we really needed to we could have called upon The Business at any time to come and help us but their visits were frequent enough that it meant we never needed to go that far.

Our Approach

Naturally I can’t say anything about the problem itself, but suffice to say that it was far bigger than anything we could expect to build in just two days. However that didn’t stop us tackling it in the same way we would a real project. We did some up-front analysis to explore the problem domain a fair bit, make some key architectural decisions to decide what we were going to build, create a backlog with some stories on it and then start coding. We also littered the walls with sheets of magic whiteboard [2], post-it notes and index cards that covered our questions, answers, architecture, backlog, etc.

We were able to to do a little bit of work beforehand, such as creating a GitHub repo that we could all check we had access to, along with a dummy build in AppVeyor to cover the CI part. The guest Wi-Fi wasn’t brilliant [3] which meant pushing and pulling to/from GitHub was a bit laggy, but it was usable and the cloud based CI monitor was stable enough.

Despite it being unofficially described as a “Hackathon” we decided we would do the best we could to show how we worked in practice, rather than try and push out as much code as possible. Whilst we no doubt could have got more code working if we had cut corners on some of the other parts of process (e.g. analysis or testing) we would not have given a fair representation of what we do (IMHO). I’ve touched on this before in “Keeping the Faith Under Pressure” and I’m pleased with what we produced. I’m perfectly happy to call the code I wrote over the two days “production code”.

Day One

After doing our early analysis we started putting together the walking skeleton which was a text-book example of Conway’s Law as we initially kept out of each other’s way whilst we got the boilerplate stuff together. Of course this separation didn’t last long as by the end of the day the temporary assemblies and folders had all gone in a simple refactoring exercise so that we had a coherent codebase to build from.

We finished the day with our technology stack up and running (browser-based client + REST API), watched over by AppVeyor, and a very thin slice of functionality in play which we demoed to a couple of the stakeholders.

Day Two

The following day was essentially a few hours shorter to allow time for each team to present to the panel, that also meant we’d need to factor in some time to put the presentation together. In essence we only had ~5 hours to implement the next set of features and so got one of the stakeholders in first-thing to make a priority call so we knew where to focus our efforts during the morning.

During the time we three devs were building a working product our UX expert was building a clickable prototype to help explore the problem further ahead. This had the benefit of us being able to see the context in which the data was used and therefore we could better understand the bigger picture. In such a short timeframe it was perhaps difficult to see what beneficial effect it had on our design and thinking but what it did show clearly was how we work and how important we believe UX to be to the development process in general.

We stopped coding an hour before our presentation slot to give ourselves plenty of time to think about what we needed to say and show. Our UX expert had surreptitiously taken plenty of photos of us engaging with the stakeholders along with the evolving boards and backlog so that we could put together a compelling story about our approach.

We talked for 30 minutes about the process and architecture, and gave a demo of what we’d built to date. We then had about 15 minutes to answer questions from the panel, both about our approach and how we might tackle some of the questions we hadn’t yet answered in the code.

Each team got to present to the panel in isolation, but we all hung around until the end at which point we each did a very brief version of the presentation to the other teams. This was really interesting as up to that point we had no idea what the others were up to. For example we chose a thin-client whereas the other two chose thick-clients. We used post-its and a makeshift whiteboard for our notes and product backlog whilst another used an online tool and the third didn’t mention their approach.

Wrapping Up

Did the exercise achieve what it set out to do? As I’m not the client I have no idea what their final decision is or whether the eventual product has met their needs, because clearly it hasn’t been built yet. But I believe their behaviour suggested that they were pretty pleased with what they had seen from everyone. I think they seemed surprised that the three teams had behaved quite differently and so they probably got a lot more out of the exercise than anticipated as each team would probably have asked different questions and explored different avenues. Given that this was a real problem we were working on I’m sure there is a lot of value in that alone.

Personally I went into the process somewhat nervous. My current role is a real departure for me - it’s not a hands-on development role. As such I thought I might not be “match fit” even with 20 years of coding experience behind me. What I forgot though was that the point of the process was to be ourselves and write code as I would for real and that just came back naturally. Hopefully I added an equal amount of value to the team and gave as good an account of myself as the others appeared to.

So, is this is the way I think tenders should be done? Yes, but I’m just a developer :o). I did ask someone at Equal Experts about how it compared cost-wise for them given that the RFP (Request for Proposal) process can also be pretty time consuming too and he suggested it wasn’t that far off. Not having done any others I can’t say whether the client had a disproportionate number of people involved at their end but given the potential sums at stake I’m sure they saw it as a good investment. I certainly hope more companies do it in the future.

[1] Apologies if that sounded like a thinly-veiled sales pitch, it wasn’t mean to be. I was just trying to describe how they see themselves; I’m an associate, not an employee.

[2] Somewhat unfortunately the room we where in had frosted glass and the sheets of magic whiteboard didn’t really stick to it. Without any blu-tack or sellotape we had to use even more post-it notes to hold up the whiteboards!

[3] The Wi-Fi dropped out every now and then and had limited bandwidth which meant that any image-heavy web pages could take a while to load.

Monday, 7 September 2015

Stand-Up on the Beach

aotb-stand-up-lBack in April I performed a stand-up comedy routine as a lightning talk on an unsuspecting audience at the ACCU Conference (see my previous blog post The Daily Stand-Up). At this year's Agile on the Beach I got to have another go at it, but not as a lightning talk, this time it was going to form part of the pre-conference evening event (aka “pasty night”).

Whereas the ACCU conference is almost entirely about software craftsmanship, Agile on the Beach has mostly other streams covering all aspects of the software development process. As such I needed to come up with a different routine that catered for a much wider audience. Given its nature I swapped out many of the programming specific puns and replaced them with something (hopefully) more appropriate, i.e. more process related. Also there was a bonus track on continuous delivery this year so that meant I could throw in some relevant content there too.

Once again it seemed to go down pretty well, by which I mean the audience groaned appropriately :o). So for those of you unfortunate enough to have missed it, here is the set:

“Was the Tower of Pisa built using lean manufacturing?”

“Agile methods might be all the rage these days but I reckon the writing’s on the wall for Kanban.”

“Are cross-functional teams just a bunch of grumpy LISP programmers?”

“Some say Scrum’s sprint goals are necessary for motivation, but Kanban is also about cracking the WIP.”

“If you want to adopt Agile Release Trains do your developers need to use Ruby on Rails?”

“The last census had a box labelled ‘Religion’, so I put ‘TDD’.”

“When I’m working from home I like to get the kids involved in my coding; I call this ‘au pair programming’.”

“My team’s not really got the hang of this agile stuff – our successes are just stories and our fails are all epics.”

“If you have too many information radiators do you get scrumburnt?”

“The other day the product owner asked me why all our acceptance tests only covered happy paths. I said they’re rose tinted specs.”

“Agile’s a lot older than many people think – Dick Turpin was doing stand-up and deliver years ago.”

“If poor quality code results in technical debt, does that make bad programmers loan sharks?”

“Some say C# and Java programmers overuse reflection, but given the quality of their code I’d say they aren’t reflecting enough.”

“Is it me or are C# and Java so similar these days they’re harder to tell apart than The Munsters and The Adams Family?”

“Our system has five-nines reliability, it’s usually working about 45% of the time.”

“When working for New Scotland Yard do developers have to work on a special branch?”

“I really dig software archaeology.”

“As a baby I was brought up on Farley’s, and I’m bringing my children up on Farley’s too – at bedtime I read them a chapter from Continuous Delivery.”

“Are modern developers obese because they rely so heavily on syntactic sugar?”

“Is the removal of a dependency injection framework from a codebase known as ‘spring cleaning‘?”

“If you think keeping up with the Kardashians is hard, try JavaScript frameworks.”

“When it comes to drawing diagrams of micro-service architectures, I never know where to draw the line.”

“The other day I went to the dentist and he told me I had a scaling problem. I said that’s awkward as I’ve no room for any more teeth.”

“Our DR strategy is not so much active/passive as passive/aggressive – when it fails we sit around and tut loudly until someone fixes it.”

“Don’t upgrade your database as the SQL is never as good as the original.”

“I blame Facebook for the quality of modern SQL – young developers are so obsessed with LIKEs.”

“When Sherlock Holmes talks of a three-pipe problem does he mean it needs grep, sed, awk and sort?”

“I don’t understand why the police are profiling criminals – surely they shouldn’t be helping them to commit crimes more efficiently?”

“C++ has complexity guarantees – if you write C++, it’s guaranteed to be complex.”

“Some people are like ‘chars’ in C, they get promoted for no apparent reason.”

“Would our codebase be healthier if we only used natural numbers?”

“One of the hardest problems in computer science is dealing with nans – they always want you to fix their machines.”

“In my thirties I blew loads of cash on a monster gaming rig, I think I was suffering from a Half-Life crisis.”

“I forgot my password the other day so I tried a dictionary attack – I just kept hitting the administrator over the head with a copy of the OED until he let me in.”

“My wife and I have been together for many years so I thought I’d get a token ring, but it seems you can only get Gigabit Ethernet these days.”

“My son was being hassled at school to share his music on the internet with BitTorrent. I told him not to give in to peer-to-peer pressure.”

“When flying here I had to put my phone into airplane mode, at which point it assumed the crash position.”

“Are electronic cigarettes just vapourware?”

“I spent all day trying to upload a picture of Marcel Marceau but the server kept responding with ‘415 Unsupported Mime Type’.”