Tuesday, 21 October 2014

So Many Wrongs, But No Rights

As we approach Christmas, that wonderful time of year for giving and receiving, we also approach that annual corporate ritual that is the Change Freeze. It’s also a time when I get to remember what is probably the worst software release I’ve had the misfortune to be involved in...

The Baroque Import Process

Like all successful systems it had been grown organically from a walking skeleton. Most of the codebase had decent test coverage and we had a fairly good build pipeline going out to a development instance that ran as close to production as possible. Most functional problems showed up in development and the UAT environment highlighted anything else non-environmental. We even had unit tests for most of our SQL code!

However, we also had some big chunks of technical debt [1] too. One particular stored procedure was now a behemoth (a many-hundred line monster) and had been developed without any test coverage. By the time this was recognised it had already grown massively by cut-and-paste and naturally nobody wanted to go back and write tests for it, and there was no drive from management to tackle this particular piece of debt either [2]. The other related data was not handled by this particular procedure but another complex maze of procedures and views. There was some “token” test coverage here, but not around what will transpire below.

These procedures were used to handle the versioning of an upstream feed. Very little data changed day-to-day so we used a versioning strategy that involved comparing yesterday’s and today’s data and only creating new records for updated entities. Another table then tied together which version should be used for which business date.

The Replacement Process

These procedures had managed to allow us to go live on time and had lived for a year in production, but eventually parts of the manual process were going to be automated upstream and so finally we got the go ahead to replace the major parts of this Heath Robinson process with something that lived outside the database and could provide us with better validation and diagnostics. This was developed in the lead up to Christmas and was looking good to be scheduled for release just before the change freeze kicked in.

The Change Freeze

This company, like many others, tries to mitigate the risk of problems escalating when no one is around to fix them during the festive period by putting a halt on all non-critical changes. It doesn’t matter whether you have been delivering continuously for over 12 months without a single cock-up, you still can’t deploy any changes unless they are to directly fix a priority one production incident.

The Last Minute Data Update

With just a week or so to go before the change freeze the business decided they needed to tweak some key data so that it would be in for the year end runs. This data was “manually calibrated” by them and very rarely changed. So we restored UAT to match production, ran in the new data and waited for the results. Most of the results were fine but as we looked a little closer we discovered that the report contained some entities where the data change had appeared to take effect, but the calculation result had not changed too.

Given the number of problems that had already shown up in the reporting component it seemed entirely feasible that another one had crept in somehow, but digging deeper proved that wasn’t entirely the case. The reporting code was known to be dubious for other reasons, but it didn’t explain why this particular data change had not had an effect everywhere it was expected.

Development != UAT

We also applied the data change to the development environment where the new release had been chugging along quite nicely. But the following morning the development system disagreed with UAT too. It looked like there might be a bug in the replacement process as well. Given how close we were to going live with this larger change we quickly dived in to see what had broken and whether we could also fix this before the freeze kicked in.

The Bug Unearthed

It turned out the new code was correct, the problem was actually in the old code, the code currently in production. The bug was in the versioning process where it failed to detect a new entity version if this one piece of datum that the business had wanted to change was the only change in the entity’s data. As a consequence the calculations had been done using the old value. Oops.

The One-Line Fix

The fix was trivial. All we needed to do was add an extra predicate in one of the select statements to ensure that the comparison was done against the latest version and not every prior version:

AND thing.Version = thing.LatestVersion

We knew the fix was this simple, and that it would work without even testing it...

The Irony

How did we know? Because we’d seen it before. Nope, it was better than that, the person who made this mistake had seen it before and even raised a bug request for the previous problem some months earlier.

Now It Gets Ugly

Not only do we have a new bug, but the results of the test run also highlights a longstanding problem with the reporting code. This code, which was bashed out in isolation [3], pulled in data from all over the place. In particular, instead of reporting data from the dated snapshot tables which is what the calculations use, it pulls them from the staging tables where the input feed is kept.

It might sound like these dated snapshot tables are in the “sanitised data” schema, they’re not. Both the unprocessed input feed and the dated snapshot live in the same table in the staging schema. The column where the value should have gone was never populated. No, that’s not true, it was populated, and versioned correctly, but with a different value that was never used.

Consequently what was reported was the value provided by the business, but that value never made its way into the set used for number crunching, hence the disparity. Sadly two wrongs did not make it right in this case.

Now It Gets Political

In any “sane” project I’d hope we could just hold up our hands and admit we have a bug, mention that we already have a fix for it ready to go and then discuss how best to get it deployed and move on the next most important thing. If the result of the discussion was that we would have to wait, that they could live with the bug, then so be it; at least we are being transparent. But that’s probably why I’d never make a “good” manager.

Instead we tried to find ways to disguise the bug. We couldn’t deploy the fix because so many other calculations that have also been working with incorrect data would come to light. We couldn’t deploy our new release because the replacement component didn’t have the bug either. This left us with somehow tweaking other data so that it would force a version change to occur, e.g. adding an extra space in a text field that was unused. The final option was stalling until the change freeze ended and we could hopefully bury bad news by folding the supposedly technical only release [4] in with a data release and the bug would get lost in the noise.

The Dark Cloud Arrives

With the change freeze over (6 weeks later) we had quite a backlog of changes. The frustration of the change freeze meant that the business were happy to lump together our new release alongside some other 3rd party and data changes. Our opportunity to bury bad news had arrived and the release was pushed out, the numbers all moved about, people murmured that things moved more than expected, but it quickly went quiet again. Normality resumed.

Epilogue

I don’t think I’ll ever understand why we couldn’t just hold up our hand and admit we’d made a mistake. Compared to many other projects around us we were a beacon of success as we had originally delivered something on time and under budget; although it may not have had all the bells and whistles originally planned. In the 12 months after going into production we had delivered at regular intervals measured in weeks, not months, and had not once had an outage caused by something our team had done. In contrast we had to push back a number of times on the 3rd party components provided to us because they didn’t entirely do what was expected of them - we became their regression test suite! I would hope that kind of delivery record would have afforded us the right to mess up slightly once in a while, but perhaps not. Maybe what trust we had built up was actually worth nothing.

They say confession is good for the soul. This is mine.

 

[1] Technical Debt is really about shortcuts taken for expediency, not crap code. However the crap code was quickly identified and a decision was made to live with it, by which I mean no decision was made to do anything about it up front. Is that now a conscious decision which means it becomes technical debt?

[2] Whilst I agree in principle that refactoring should be a by-product of any story, sometimes the debt grows into something better served by an architectural refactoring.

[3] This one component has provided inspiration for a number of other blog posts, e.g. this one in particular “The Cost of Defensive Programming”.

[4] We tried to avoid mixing purely technical changes, e.g. architectural refactorings and upcoming features (toggled off), with changes that affected the calculation results. This was to ensure our regression testing had virtually no noise. Packaging a number-breaking change in isolation was also a defence mechanism that allowed us to squarely point the finger outside our team when it went wrong. And it did, on numerous occasions.

Monday, 20 October 2014

What’s the Price of Confidence?

I recently had one of those the conversations about testing that comes up every now and then. It usually starts off with someone in the team, probably the project manager, conveying deep concerns about a particular change or new feature with them getting twitchy about whether it’s been tested well enough or not. In those cases where it comes from the management side, that fear can be projected onto the team, but in a way that attempts to somehow use that fear as a tool to try and “magically” ensure there are no bugs or performance problems (e.g. by implying that lots of late nights running manual tests will do the trick). This is all in contrast to the famous quote from Edsger Dijkstra:

Program testing can be used to show the presence of bugs, but never to show their absence

Financial Loss

The first time this situation came up I was working at an investment bank and the conversation meandered around until a manager started to get anxious and suggested that if we screw up it could cost the company upwards of 10-20 million quid. Okay, so that’s got mine and my colleague’s attention, but neither of us could see what we could obviously do that would ensure we were “100%” bug free. We were already doing unit testing and some informal code reviewing, and we were also delivering to our development system-test environment as often as we could where we ran in lock-step with production but on a reduced data set and calculation resolution.

In fact the crux of the argument was really that our UAT environment was woefully underpowered - it had become the production environment on the first release. If we had parity with production we could also do the kind of regression testing that would get pretty close to 100% confidence that there was nothing, either functionally or performance wise, that was likely to appear after releasing.

My argument was that, knowing what we do from Dijkstra, if the company stands to lose so much money from us making a mistake, then surely the risk is worth the investment by the company to help us minimise the chances of a problem slipping through; us being human beings and all (even if we are experienced ones). Bear in mind that this was an investment bank, where the team was made up of 6 skilled contractors, and we were only asking for a handful of beefy app servers and a few dozen blades to go in the compute grid. I posited that the cost of the hardware, which was likely to be far less than 100K, was orders of magnitude lower than the cost of failure and it was only a month or two of what the entire team costs. That outlay did not seem unrealistic to me given all the other project costs.

Loss of Reputation

The more recent conversation was once again about the parity between the pre-production and production environments, but this time about the database. The same “fear” was there again, that the behaviour of a new maintenance service might screw up, but this time the cost was more likely to be directly expressed as a soiled reputation. That could of course lead to the loss of future business from the affected customers and anyone else who was unhappy about a similar prospect happening to them, so indirectly it could still lead to some financial loss.

My response was once again a sense of dismay that we could not just get the database restored to the test environment and get on with it. I could understand if the data was sensitive, i.e. real customer data needed masking, or if it was huge (hundreds of TBs, not a couple of hundred GBs) although that would give me more cause for concern, not less. But it wasn’t, and I don’t know why this couldn’t just be done either as a one off, which is possibly more valid in this scenario, or better yet to establish it as a practice going forward.

The cultural difference at play here is that the databases appear to be closely guarded and so there are more hoops to go through to gain access to both a server and the backups.

Provisioning

Maybe I’m being naive here, but I thought one of the benefits of all the effort going into cloud computing is that the provisioning of servers, at least for the run-of-the-mill roles, becomes trivial. I accept that for the bigger server roles, such as databases, the effort and cost may be higher, but given how sensitive they can be to becoming the bottleneck we should put more effort into ensuring they are made available when the chances of a performance problem showing up is heightened. At the very least it must be possible to temporarily tailor any test environment so that it can be used to perform adequate testing of the changes that are a cause for concern.

Continuous Delivery

This all sounds decidedly old school though, i.e. doing development and a big bang release where you can focus on some final testing. In his talk at Agile on the Beach 2014 Steve Smith described Release Testing as Risk Management Theatre. A more modern approach is to focus on delivering “little and often” which means that you’re constantly pushing changes through your pre-production environment; so it has to be agile too. If you can not or will not invest in what it takes to continuously scale your test environment(s) to meet their demands then I find it difficult to see how you are ever going to gain the level of confidence that appears to be being sought after.

One thing Steve simplified in his talk [1] was the way features are pushed through the pipeline. In his model features go through in their entirety, and only their entirety, which is not necessarily the case when using practices such as Feature Toggles which forces integration to happen as early as possible. A side-effect of this technique is that partially finished features can go out into production sooner, which is potentially desirable for pure refactorings so that you begin to reap your return in the investment (ROI) sooner. But at the same time you need to be careful that the refactoring does not have some adverse impact on performance. Consequently Continuous Delivery comes with it’s own set of risks, but the general consensus is that these are manageable and worth taking to establish an earlier ROI, but you must be geared up for it.

One of the questions Steve asked in his talk was “how long does it take to get a code change into production?” [2]. Personally I like to think there are really two questions here: “how long does it take to reap the ROI on a new feature or change” and “how long does it take to roll a fix out”. A factor in both of these questions is how confident you are in your development process that you’re delivering quality code and doing adequate (automated) testing to root out any unintended side-effects. This confidence will come at a price that takes in direct costs, such as infrastructure & tooling, but also indirect costs such as the time spent by the team writing & running tests and reviewing the design & code. If you decide to save money on infrastructure and tooling, or work in an environment that makes it difficult to get what you need, how are you going to compensate for that? And will it cost you more in time and energy in the long run?

 

[1] I asked him about this after his talk and he agreed that it was a simplified model used to keep the underlying message clear and simple.

[2] This question more famously comes from “Lean Software Development: An Agile Toolkit” by Mary and Tom Poppendieck.

Friday, 17 October 2014

Terse Exception Messages

Yesterday, whilst putting together a testing tool, I hit one of those generic framework error messages that says exactly what’s wrong but says nothing about where the problem might be. A large part of this diversion was down to being half way through adding support for a new command line verb and then returning to it after a lengthy phone call whilst misremembering that I hadn’t quite finished wiring in the initial NOP implementation.

Everything looked to be in the right place, but when I ran it I got no output at all - not even an error message. Luckily I always test my command line applications with my two-line batch file (See “Windows Batch File Template”) that reports the process exit code to ensure I’m returning the correct code on failure. It showed a “2” which is our generic failure code, so that was ok.

A quick spot of debugging of our console application base class and the problem was easy to see, the code looked like this:

int exitCode = ExitCode.Failure;

foreach (var handler in _handlers)
{
  if (handler.Verb == verb)
  {
    exitCode = handler.Execute(options);
    break;
  }
}

return exitCode;

As you can probably see if there is no handler wired up for the verb then it just falls out the bottom. Luckily we got one thing right which was to adopt the stance of “Assume Failure by Default” so that at least an error was reported, albeit not a particularly helpful one.

A Better Error Message

One could argue at this point that while the code is probably not brilliant, given the domain (a console application), it’s fairly easy to test and fix whatever mistake I made. But, is it worth putting any effort into improving matters and if so how much? Personally I think avoiding fixing this kind of problem is part of what leads to the “Broken Windows” syndrome. It’s only a small piece of code and a sprinkling of LINQ will soon sort it out:

return _handlers.Single(v => v.Name == verb); 
                .Execute(options);

This is much simpler and it generates a bit more noise when it fails, which feels like A Good Thing. When I ran it I got the following error in my console window though:

{System.InvalidOperationException} Sequence contains no matching element

Anyone who has ever done any LINQ will recognise this somewhat terse error message. Of course it’s not the framework’s fault, after all how can they know anything about how the method is used? And it uses generics too so it’s not as if they could easily tell you what the element was that you were looking for because we’ve passed a lambda in.

An Even Better Error Message

At this point we have managed to reduce our code to something that is pretty simple and gets the job done - case closed, right?. The question I asked myself though was whether I should undo some of that simplification in order to produce a better diagnostic message? As I mentioned above the scenario where this code is used means it wouldn’t take long with a debugger to track down the mistake and so is there really any value in going one step further?

I have found in the past that developers often concentrate on the happy path and then generally gloss over the error paths, by which I mean they make sure an error is reported, but don’t put too much effort into ensuring that what is reported would be of use to fellow developers or, more importantly, the support team. I’ve discussed this before in “Diagnostic & Support User Interfaces” but suffice to say that a key part of what makes a complex system supportable is well thought out error handling and clearly written diagnostic messages.

In this instance I decided the LINQ exception was still too terse, and although I wouldn’t be able to pinpoint the source of the error clearly, I felt I could probably do much better with only a little extra code:

var handler = _handlers.SingleOrDefault(v =>
                                    v.Name == verb);

Constraint.MustNotBeNull(handler,
    "No handler configured for verb '{0}'", verbName);

return handler.Execute(options);

This meant that the output from running the console application now was more like this:

{ConstraintViolationException} No handler configured for verb 'new-verb'

By switching from using Single() to SingleOrDefault() I got to catch the missing handler problem myself and effectively report it through an assertion failure [1]. Of course I don’t explicitly catch the case where two handlers with the same name are registered; that fault will still be reported via a terser LINQ exception. However I felt at the time that this was less likely (in retrospect a copy-and-paste error is probably more likely).

The Anguish of Choice

This is the kind of small problem I bump into all the time and I never really know which way to go. While on the one hand I love the simplicity of the code that I could have written, I’ve also worked on too many codebases where problems are tedious to solve because not quite enough thought was put into the likely failure scenarios up front, or refactored when they did show up so that it improves the lives of our successors [2].

 

[1] The Constraint class plays a similar role to Debug.Assert() but is compiled into release builds too. Code contracts could also be used but I wrote this long before working on a C# codebase where the use of real Code Contracts was possible.

[2] A seamless link to another recent post: “Will Your Successor Be a Superstar Programmer”.

Wednesday, 8 October 2014

Will Your Successor Be a Superstar Programmer?

Something I struggle with when writing code is trying to factor in what the maintenance programmer that comes after me will be like. I’ve touched on this before in “Can Code Be Too Simple?” when I showed some code that could be succinctly implemented in C++, but might not be quite as obvious to someone more well versed in other languages. As such can they maintain the code you’ve written?

It might sound ridiculous asking someone who is not an “expert” in a particular programming language to fix a problem but that is exactly what has happened in some of the companies I’ve worked in. The likes of GitHub and Google might get to hire the cream of the crop and therefore have a level of expertise in their programmers that ensures they can write the best code possible, but those who don’t quite cut the mustard will end up somewhere else. As such the average level of programmer ability in the rest of the world is likely to be somewhat lower than the industry’s superstars.

Of course that assumes the code is maintained by the programmers who originally built the system in the first place. One client I worked at has a more unusual approach to handling software maintenance - give it to the support team. That’s right, they hire some experienced developers to build them a system and then on release they hand the codebase over to the support team to maintain it. Don’t get me wrong the support team are not idiots, far from it, but their day job is operational support and system administration, not becoming the best programmer they can be. I’m not sure whether I should find it condescending or not that my client considers the code I write as being so simple as to be maintainable by someone who does it only part-time.

When I was working on that project I had a brief discussion with a colleague about this setup. I voiced my concerns about the fact that this codebase was going to be supported by people who were most probably not nearly as experienced as the programmers that were going to develop it. In the back on my mind I always have that quote from Brian Kernighan [1]:

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

His reply, which I think is a perfectly valid viewpoint, is that it is not our problem if the company that hires us to write their software chooses not to hire people of equal talent (or even more talented) to support it.

Maybe that is the whole point. By writing the absolute best code I can (by using the natural idioms, proper domain types, etc.) the code stands a better chance of being supported by a non-expert? Sadly I won’t be around to find out, because if I was, I’d be the one supporting it [2].

 

[1] Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style. See Programming Quotes.

[2] Whilst I was around, even on a different project, any changes still came my way. I’m not sure whether that was for convenience, a lack of resources or a risk-reduction measure for this very reason.

Tuesday, 7 October 2014

Who’s Maintaining the 100 Foot View?

Last year I watched Michael Feathers give the keynote at Agile Cambridge 2013. It was another one of his Software Archaeology based talks and he touched on a few of the usual topics, such as Technical Debt and the quaint, old-fashioned notion of Big Design Up-Front (BDUF) via an all-encompassing UML model. We all chuckled at the prospect of generating our system from The Model and then “just filling in the blanks”.

Whilst I agree whole-heartedly with what he had to say it got me thinking a bit more about the level of design that sits between the Architect and the Programmer. Sadly I only got to run my thoughts briefly passed Michael as he wasn’t able to hang about. I think I got a “knowing nod of agreement”, but then I may also have been given the “I’m going to agree so that you’ll leave me alone” look too :-).

What I’ve noticed is that teams are often happy to think about The Big Picture and make sure that the really costly aspects are thought through, but less attention is paid to the design as we start to drill down into the component level. There might be a couple of big architecture diagrams hanging around that illustrate the overall shape of the system, but no medium or small diagrams that hone in on the more “interesting” internal parts of the system.

In “Whatever Happened to UML?” I questioned why the tool fell out of favour, even just for notional convenience which is how I use it [1]. I find that once a codebase starts to acquire functionality, especially if done in a test-first manner, it is important to put together a few rough sketches to show how the design is evolving. Often the act of doing this is enough to point out inconsistencies in the design, such as a lack of symmetry in a read/write hierarchy or a bunch of namespaces that perhaps should be split out into a separate package.

In C# the class access-level “internal” is worth nothing if you bung all your code into a single assembly. Conversely having one assembly per namespace is a different kind of maintenance burden, so the sweet spot is somewhere in between. I often start with namespaces called “Mechanisms” and “Remote” in the walking skeleton that are used for technical bit-and-bobs and proxies respectively. At some point they will usually be split off into separate assemblies to help enforce the use of “internal” on any interfaces or classes. Similar activities occur for large clumps of business logic when it’s noticed that the number of common dependencies between them is getting thin on the ground, i.e. the low cohesion can be made clearer by partitioning the codebase further.

To me refactoring needs to happen at all levels in the system - from architecture right down to method level. Whilst architectural refactorings have the potential to be costly, especially if some form of data migration is required, the lower levels can usually be done far more cheaply. Moving code around, either within a namespace in the same package or by splitting it off into separate packages should be fairly painless if the code was already well partitioned in the first place and the correct access modifiers used (i.e. internal and private) and adhered to.

And yet I see little of this kind of thinking going on. What worries me is that in the rush to embrace being “agile” and to adhere to the mantra of doing “the simplest thing that could possibly work” we’ve thrown the proverbial baby out with the bath water. In our desire to distance ourselves from being seen to be designing far too much up front we’ve lost the ability to even design in the small as we go along.

 

[1] Interestingly, Simon Brown in his talk at Agile on the Beach 2014 (Agility and the essence of software architecture), questioned whether there was any real value even in the UML notion as a common convention. It’s a good point and I guess as long as you make it clear whether it’s a dependency or data flow diagram you’ll know what the arrowheads correspond to.

Monday, 6 October 2014

Should You Mention Your Bio When Speaking?

The day before I was due to speak at the Agile on the Beach conference a re-tweet entered my timeline suggesting that mentioning your biography during your talk was pointless. The premise seemed to be that people came to hear your content, not your life story. And anyway they can see your bio in the programme so just “get on with it already”.

I’m not sure exactly what had happened to cause that tweet but I found myself reflecting on part of my own talk just hours before I was due to give it, which was mightily uncomfortable. I reasoned that perhaps what they had seen was someone dedicating a significant amount of time, say 10 minutes, to reeling off their CV. If that was the case then I wholeheartedly agree with the sentiment. However, in case what they did was only a minute or so of back-story, then I’m going to dispense my $0.02 on the topic, for what it’s worth...

Context is everything, especially when you’re trying to understand what the presenter’s trying to say and how they may have come to their conclusions. I feel my Test-Driven SQL talk requires a bit more background that the other talks I have given in the last because I am not normally a part of the community for which it is intended, or from the community which you might normally find someone to present it - I’m a C++/C# application programmer by trade, not a SQL developer or DBA. The entire basis of the talk is about how I came to apply the same set of principles from my normal programming endeavours to the world of the relational database. Although what I say makes sense to me and my colleagues (who also apply it), I know it may seem unnatural to a native from that side of the software development fence.

In a world where one size never fits all it is the constraints (or lack of them) on the speaker’s subject matter that allows us to put the content they are presenting into focus. And as such we can either identify ourselves with those constraints and therefore become more attentive, or realise it comes from a different world (or a least one we are unlikely to face in the shorter term) and just enjoy the talk as another bit of background knowledge to bank for when the relevant time comes.

For example, a common disparity exists between the start-up and enterprise cultures. Anyone from a garage start-up reading my recent “Developer Freedom” article will probably wonder what all the brouhaha is about. Similarly they may also ignore any notion of formalising a SQL-based public interface to their database because time-to-market matters, and anyway they only have a single product so there aren’t any “data leeches” to worry about.

Like all headline grabbing tweets the truth is no doubt much murkier than it appears on the surface. Personally I welcome some relevant knowledge about a speaker’s background in the talk as I’m generally too lazy to read all the abstracts and bio’s before picking a talk to listen to, at least at a conference. However, I feel that the information must provide some useful context for the presentation they are about to undertake to be worthwhile.

Friday, 3 October 2014

Building the Pipeline - Process Led or Automation Led?

After being in the comfortable position of working on greenfield projects in more recent times I’ve taken the opportunity to work on an existing system for the forthcoming months. Along with getting to know some new people I also have to get to know someone else’s codebase, techniques, build process, deployment, etc. One of my initial surprises was around the build pipeline - it cannot be replicated (easily) by a developer.

In a situation like this where my preconceptions have been challenged, rather than throw my hands up in disgust and cry “what are you lot doing?”, I prefer to take it as an opportunity to question my own principles. I do not ever want to be one of those people that just says “we’ve always done it this way” - I want to continually question my own beliefs to ensure that I don’t forget why I’m doing them. One of these principles is that it should be possible for me to replicate any build and deployment on my own workstation. For me, being able to work entirely in isolation is the one sure way to know that anything that goes wrong is almost entirely of my own doing and not due to interference from external sources.

The One-Stop Shop - Visual Studio

I believe one reason for how you might end up in this state of affairs is because Visual Studio appears to be a one-stop shop for developing, building and deploying a solution. Many developers rely heavily on its features and do not ever step outside it to understand how its use might fit into the bigger picture in a large system (I covered this a while back in “Don’t Let Your Tools Pwn You”). For me Visual Studio is mostly an integrated text editor and makefile manager - I can just as easily view and edit code in Notepad, Notepad++, TortoiseMerge, etc.

Given the way the team works and the demarcation between roles, it is easy to see how the development process is reflected in that practice, i.e Conway’s Law. The developers write code and tests in Visual Studio and push to the repo. The build pipeline picks up the code, builds and packages it, deploys it and runs the tests. The second stage in the process is managed (as in a real person) by a build manager - a dedicated role I’ve never come across before. In every team I’ve worked in to date, both (much) bigger and smaller in size, it has been the developers that put together the entire build and deployment process.

Please note that I’m not suggesting someone who remains focused on the core development duties is somehow inferior to others that have more general roles. On the contrary diversity in all its guises is a good thing for a team to have. Customers pay for the functionality not the development process and therefore if anything they generate more value than I do.

Process or Automation First?

I’ve really struggled to try and succinctly categorise these two approaches. The line between them seems to be down to whether you look to develop a process first that you then automate, or whether you start automating and then develop a process. I’ve only ever worked in the former way, essentially by building a set of scripts on my local workstation that carve out the process (See “Layered Builds” and “Wrapper Scripts”). I then get to debug the process as much as possible before attempting to automate it, by which time the only (hopefully minor) differences should be down to the environment. This also has the nice side-effect that pretty much the entire build process then lives within the repo itself and so is versioned with the code [1].

Although I don’t know for sure, what I suspect has happened here is that the project has got started using Visual Studio which keeps the developers busy. Then the process of creating a build pipeline starts by picking a CI technology, such as Jenkins or TeamCity, and then stitching together the building blocks using the CI tool’s UI. Because the developer’s role stops at getting the acceptance tests passing, the process beyond that becomes someone else’s responsibility. I’m sure the developer’s helped debug the pipeline at some point, but I’d hedge my bets that it had to be done on the build server.

In the modern agile world where we start with a walking skeleton is it preferable to get the walking skeleton build automated or a solid isolated development process going?

Build Differences 

The difference in these two approaches has been foremost in my mind today as I spent the afternoon trying to understand why the Debug and Release build configurations were different. I tried to create a simple script that would replicate what the build pipeline is doing and found that the debug build worked fine locally, but the Release build failed. However the converse was true on the TFS build farm. What this means is that the developers work solely with debug code and the build pipeline works solely with release code. While in practice this should not be too bothersome, it does mean that any problems that do show up once the CI gets its hands on your code cannot be easily replicated locally.

The first problem I turned up straight away was that building on the command line via MSBuild was fine, and explained why the build machine also passed the build, whilst building through Visual Studio failed during compilation. It turned out that you had to build the solution twice to make Visual Studio happy. The reason no one else had noticed (or more likely had forgotten about the problem) was because they weren’t following the sort of practice I advocate in “Cleaning the Workspace”.

This turned out to be a simple missing NuGet package dependency. The problem this afternoon was much harder to diagnose because I knew nothing about TFS and its build farm. Like all brownfield projects the person you really want to speak to left just before you started and so I had to figure out myself why the Wix setup project was using $(SolutionDir) to construct the binaries path for a debug build, but $(OutDir)_PublishedWebsites for the release build. After a little googling I stumbled across the following blog post “Override the TFS Team Build OutDir property in TFS 2013” that put me on the right track.

It seems that a common practice with TFS farm builds is to put all the binaries in one folder, and this can be achieved by overriding the $(OutDir) variable on the MSBuild command line. This lead to me modifying my script so that a debug build executes like this:

> msbuild.exe Solution.sln /v:m
  "/p:Configuration=Debug" "/p:Platform=Any CPU"

…whilst a release build would be this:

> msbuild.exe Solution.sln /v:m
  "/p:Configuration=Release" "/p:Platform=Any CPU"
  "/p:OutDir=%root%\Build\Release\"

Confidence Trick

Clearly the team has coped admirably without the kind of build script I’m trying to put together, so it’s hardly essential. However I personally feel uncomfortable developing without such a tool available to quickly run through a build and deployment so that I could do some local system-level testing [2]. I like to refactor heavily and for me to have confidence in the changes I’m making I like to have the tools readily available otherwise I’m tempted not to bother.

As to whether the pipeline is more maintainable or not by leveraging the automation tools to do more of the work remains to be seen. I’m certainly looking forward to seeing how this team structure plays out and in the meantime I may learn to trust modern build tools a bit more and perhaps let go of one or two old fashioned responsibilities in the process.

 

[1] When I first got to use Jenkins I wondered how easy it would be to keep the tool’s configuration in the VCS - it’s trivial. I wrote a simple script to use xcopy /s to copy the config.xml files from the Jenkins folder into a suitable folder in our repo and then check it in. Whilst this is not the entire Jenkins configuration it would be enough to help us get a replacement Jenkins instance up and running quickly, which is one of the reasons for doing it.

[2] Sadly the current setup relies on shared infrastructure, e.g. databases and message queues so there is still some work to do if total isolation is to be achieved.