Monday 22 August 2016

Sharing Code with Git Subtree

The codebase I currently work on is split into a number of repositories. For example the infrastructure and deployment scripts are in separate repos as are each service-style “component”.

Manual Syncing

To keep things moving along the team decided that the handful of bits of code that were shared between the two services could easily be managed by a spot of manual copying. By keeping the shared code in a separate namespace it was also partitioned off to help make it apparent that this code was at some point going to be elevated to a more formal “shared” status.

This approach was clearly not sustainable but sufficed whilst the team was still working out what to build. Eventually we reached a point where we needed to bring the logging and monitoring stuff in-sync and I also wanted to share some other useful code like an Optional<T> type. It also became apparent that the shared code was missing quite a few unit tests as well.

Share Source or Binaries?

The gut reaction to such a problem in a language like C# would probably be to hive off the shared code into a separate repo and create another build pipeline for it that would result in publishing a package via a NuGet feed. And that is certainly what we expected to do. However the problem was where to publish the packages to as this was closed source. The organisation had its own license for an Enterprise-scale product but it wasn’t initially reachable from outside the premises where our codebase lay. Also there were some problems with getting NuGet to publish to it with an API key that seemed to lay with the way the product’s permissions were configured.

Hence to keep the ball rolling we decided to share the code at the source level by pulling the shared repo into each component’s solution. There are two common ways of doing this with Git – subtrees and submodules.

Git Submodules

It seemed logical that we should adopt the more modern submodule approach as it felt easier to attach, update and detach later. It also appeared to have support in the Jenkins 1.x plugin for doing a recursive clone so we wouldn’t have to frig it with some manual Git voodoo.

As always there is a difference between theory and practice. Whilst I suspect the submodule feature in the Jenkins plugin works great with publicly accessible open-source repos it’s not quite up to scratch when it comes to private repos that require credentials. After much gnashing of teeth trying to convince the Jenkins plugin to recursively clone the submodules, we conceded defeat assuming that we’re another victim of JENKINS-20941.

Git Subtree

Given that our long term goal was to move to publishing a NuGet feed we decided to try using a Git subtree instead so that we could at least move forward and share code. This turned out (initially) to be much simpler because for tooling like Jenkins it appears no different to a single repo.

Our source tree looked (unsurprisingly) like this:

<solution>
  +- src
     +- app
     +- shared-lib
        +- .csproj
        +- *.cs

All we needed to do was replace the shared-lib folder with the contents of the new Shared repository.

First we needed to set up a Git remote. Just as the remote main branch of a cloned repo goes by the name origin/master, so we set up a remote for the Shared repository’s main branch:

> git remote add shared https://github/org/Shared.git

Next we removed the old shared library folder:

> git rm src\shared-lib

…and grafted the new one in from the remote branch:

> git subtree add --prefix src/shared shared master --squash

This effectively takes the shared/master branch and links it further down the repo source tree to src/shared which is where we had it before.

However the organisation of the new Shared repo is not exactly the same as the old shared-lib project folder. A single child project usually sits in it’s own folder, but a full-on repo has it’s own src folder and build scripts and so the source tree now looked like this:

<solution>
  +- src
     +- app
     +- shared
        +- src
           +- shared-lib
              +- .csproj
              +- *.cs

There is now two extra levels of indirection. First there is the shared folder which corresponds to the external repo, plus there is that repo’s src folder.

At this point all that was left to do was to fix up the build, i.e. fix up the path to the shared-lib project in the Visual Studio solution file (.sln) and push the changes.

We chose to use the --squash flag when creating the subtree as we weren’t interested in seeing the entire history of the shared library in the solution’s repository.

Updating the Subtree

Flowing changes from the parent repo down into the subtree of the child repo is as simple as a fetch & pull:

> git fetch shared master
> git subtree pull --prefix src/shared shared master --squash

The latter command is almost the same as the one we used earlier but we pull rather than add. Once again we’re squashing the entire history as we’re not interested in it.

Pushing Changes Back

Naturally you might want to make a change in the subtree in the context of the entire solution and then push it back up to the parent repo. This is doable but involves using git subtree push to normalise the change back into the folder structure of the parent repo.

Personally we decided just to make the changes test-first in the parent and always flow down to the child. In the few cases the child solution helped in debugging we decided to work on the fix in the child solution workspace and then simply manually copy the change over to the shared workspace and push it out through the normal route. It’s by no means optimal but a NuGet feed was always our end game so we tolerated the little bit of friction in the short term.

The End of the Road

If we were only sucking in libraries that had no external dependencies themselves (up to that point our small shared code only relied on the .Net BCL) we might have got away with this technique for longer. But in the end the need to pull in 3rd party dependencies via NuGet in the shared project pushed it over the edge.

The problem is that NuGet packages are on a per-solution basis and the <HintPath> element in the project file assumes a relative path (essentially) from the solution file. When working in the real repo as part of the shared solution it was “..\..\packages\Xxx”, but when it’s part of the subtree based solution it needed to be two levels further up as “..\..\..\..\packages\Xxx”.

Although I didn’t spend long looking I couldn’t find a simple way to easily overcome this problem and so we decided it was time to bite-the-bullet and fix the real issue which was publishing the shared library as a NuGet feed.

Partial Success

This clearly is not anything like what you’d call an extensive use of git subtree to share code, but it certainly gave me a feel for it can do and I think it was relatively painless. What caused us to abandon it was tooling specific (the relationship between the enclosing solution’s NuGet packages folder and the shared assembly project itself) and so a different toolchain may well fair much better if build configuration is only passed down from parent to subtree.

I suspect the main force that might deter you from this technique is how much you know, or feel you need to know, about how git works. When you’re inside a tool like Visual Studio it’s very easy to make a change in the subtree folder and check it in and not necessarily realise you’re modifying what is essentially read-only code. When you next update the subtree things get sticky. Hence you really need to be diligent about your changes and pay extra attention when you commit to ensure you don’t accidentally include edits within the subtree (if you’re not planning on pushing back that way). Depending on how experienced your team are this kind of tip-toeing around the codebase might be just one more thing you’re not willing to take on.

Manually Forking Chunks of Open Source Code

Consuming open source projects is generally easy when you are just taking a package that pulls in source or binaries into your code “as is”. However on occasion we might find ourselves needing to customise part of it, or even borrow and adapt some of its code to either workaround a bug or implement our own feature.

If you’re forking the entire repo and building it yourself then you are generally going to play by their rules as you’re aware that you’re playing in somebody else’s house. But when you clone just a small part of their code to create your own version then it might not seem like you have to continue honouring their style and choices, but you probably should. At least, if you want to take advantage of upstream fixes and improvements you should. If you’re just going to rip out the underlying logic it doesn’t really matter, but if what you’re doing is more like tweaking then a more surgical approach should be considered instead.

Log4Net Rolling File Appender

The driver for this post was having to take over maintenance of a codebase that used the Log4Net logging framework. The service’s shared libraries included a customised Log4Net appender that took the basic rolling file appender and then tweaked some of the date/time handling code so that it could support the finer-grained rolling log file behaviour they needed. This included keeping the original file extension and rolling more frequently than a day. They had also added some extra logic to support compressing the log files in the background too.

When I joined the team the Log4Net project had moved on quite a bit and when I discovered the customised appender I thought I’d better check that it was still going to work when we upgraded to a more recent version. Naturally this involved diffing our customised version against the current Log4Net file appender.

However to easily merge in any changes from the Log4Net codebase I would need to do a three-way diff. I needed the common ancestor version, my version and their version. Whilst I could fall back to a two-way diff (latest of theirs and mine) there were lots of overlapping changes around the date/time arithmetic which I suspected were noise as the Log4Net version appeared to now have what we needed.

The first problem was working out what the common ancestor was. Going back through the history of our version I could see that the first version checked-in was already a highly modded version. They had also appeared to apply some of the ReSharper style refactorings which added a bit of extra noise into the mix.

What I had hoped they would have done is started by checking in the exact version of the code they got from Log4Net and put in the check-in commit the Subversion revision number of the code so that I could see at what version they were going to fork it. After a few careful manual comparisons and some application of logic around commit timestamps I pinned down what I thought was the original version.

From here I could then trace both sets of commit logs and work out what features had been added in the Log4Net side and what I then needed to pull over from our side which turned out to be very little in the end. The hardest part was working out if the two changes around the date rolling arithmetic were logically the same as I had no tests to back up the changes on our side.

In the end I took the latest version of the code from the Log4Net codebase and manually folded in the compression changes to restore parity. Personally I didn’t like the way the compression behaviour was hacked in [1] but I wanted to get back to working code first and then refactor later. I tried to add some integration tests too at the same time but they have to be run separately as the granularity of the rollover was per-minute as a best case [2].

Although the baseline Log4Net code didn’t match our coding style I felt it was more important to be able to rebase our changes over any new Log4Net version than to meet our coding guidelines. Naturally I made sure to include the relevant Log4Net Subversion revision numbers in my commits to make it clear what version provided the new baseline so that a future maintainer has a clear reference point to work from.

In short if you are going to base some of your own code very closely on some open source stuff (or even internal shared code) make sure you’ve got the relevant commit details for the baseline version in your commit history. Also try and avoid changing too much unnecessarily in your forked version to make it easier to pull and rebase underlying changes in the future.

 

[1] What worried me was the potential “hidden” performance spikes that the compression could put on the owning process. I would prefer the log file compression to be a background activity that happens in slow time and is attributable to an entirely separate process that didn’t have tight per-request SLAs to meet.

[2] I doubt there is much call for log files that roll every millisecond :o).

Monday 8 August 2016

Estimating is Liberating

Right now after having just read the title you’re probably thinking I’ve gone mad. I mean, why would a developer actively promote the use of estimation? Surely I should be joining the ranks of the #NoEstimates crowd and advocate the abolishment of estimating as a technique?

Well, yes, and no. As plenty of other much wiser people than me have already pointed out, the #NoEstimates movement, like the #NoSql one before it, is not the black-and-white issue it first appears. However this blog post isn’t really about whether or not I think you should create estimates per-se but about some of the effects I’ve observed from teams performing it. Whether the means justifies the end (even in the short term) is for you to decide.

Establishing Trust

The first time I realised there was more to estimating than simply coming up with a figure that describes how long some piece of work will likely take was almost a decade ago. I was working as “just another developer” on a greenfield project that was using something approaching DSDM as a project management style.

We had started off really well delivering a walking skeleton and fleshing it out bit-by-bit based on what the project manager and team thought was a suitable order. There were many technical risks to overcome, not least due to the cancellation of some services we were dependent on.

After what seemed like a flying start things slowly went downhill. 6 months in the project manager (PM) and I had a quiet word [1] as they were concerned things seemed to be taking so much longer than earlier in the project. I suggested that we consider going back to estimating our work. What I had noticed is that the problems we were encountering were really just delays caused by not really thinking through the problems. Hence every day the work would be finished “tomorrow” which naturally caused the project manager to start losing faith in his team.

By forcing ourselves to have to come up with an idea of how long we would need to work on a feature we started breaking it down into much smaller chunks. Not only did this mean that we thought through more clearly what issues we might need to tackle but it also allowed us to trim any obvious fat and work in parallel where possible.

The result was an increase in trust again in the team by the project manager, but also by extension the customer and PM therefore had less “awkward” conversations too [2].

Knowledge Sharing

The next moment where I began to see the positive effects of estimation was when joining a team that adopted Planning Poker as a way of estimating the work for a sprint.

In the (not so) good old days, work was assigned to individuals and they were responsible for estimating and performing that work largely in isolation. Of course many of us would prefer to seek advice from others, but you were still essentially seen as responsible for it. As a corollary to that the number of people in the team who knew what was being worked on was therefore small. Even if you did have a whiff of what was happening you probably knew very little about the details unless you stuck your nose in [3].

This team worked in a similar fashion, but by opening up the planning session to the whole team everyone now had a better idea of what was going on. So, even if they weren’t actively going to be working on a feature they still had some input into it.

What the planning poker session did was bring everyone together so that they all felt included and therefore informed. Additionally by actively canvasing their opinion for an estimate on each and every feature their view was being taken into consideration. By providing an overly small or large estimate they would have a sure-fire means of having their opinion heard because the conversation tends to focus on understanding the outliers rather than the general consensus. It also allowed members of the team to proactively request involvement in something rather than finding out later it has already been given to someone else.

I think the team started to behave more like a collective and less like a bunch of individuals after adopting this practice.

Transparency

More recently I was involved in some consultancy at a company where I got to be a pure observer in someone else’s agile transformation. The units of work they were scheduling tended to be measured in weeks-to-months rather than hours-to-days.

I observed one planning meeting where someone had a task that was estimated at 4 weeks. I didn’t really know anything about their particular system but the task sounded pretty familiar and I was surprised that it was anything more than a week, especially as the developer was fairly experienced and it sounded like it was similar to work they had done before.

I later asked them to explain how they came up with their estimate and it transpired that buried inside were huge contingencies. In fact a key part of the task involved understanding how an API that no one had used before worked. In reality there was a known part for which a reasonably sound estimate could be given, but also a large part of it was unknown. Like many organisations it never acknowledged that aspects of software development are often unknown and, when faced with something we’ve never done before, we are still expected to be able to say how long it will take.

Consequently I got them to break the task down into smaller parts and present those estimates instead. Most notable was an upfront piece around understanding the 3rd party API – a technical spike. This very consciously did not have an estimate attached and it allowed us to explain to the stakeholders what spikes are and how & when to use them to explore the unknown.

This openness with the business made both them and the delivery team more comfortable. The business were now more in the loop about the bigger risks and could also see how they were being handled. Consequently they also now had the ability to learn cheaply (fail faster) by keeping the unknown work more tightly under control and avoid unexpected spiralling costs or delays.

The benefit for the delivery team was the recognition from the business there is stuff we just don’t know how to do. For us this is hugely liberating because we can now lay our cards firmly on the table instead of hiding behind them. Instead of worrying about how much to pad our estimate to ensure we have enough contingency to cover all the stuff we definitely don’t know about, we can instead split off that work and play it out up front as a time-boxed exercise [4].Instead of being sorry that we are going to be late, again, we have the opportunity to be praised for saving the business money.

Training Wheels

What all of these tales have in common is that the end product – the actual estimate is of little importance to the team. The whole #NoEstimates movement has plenty to say on whether they are useful or not in the end, but the by-product of the process of estimating certainly has some use as a teaching aid.

A mature (agile) team will already be able to break work down into smaller chunks, analyse the risks and prioritise it so that the most valuable is done first (or risks reduced). But an inexperienced team that has had little direct contact with its stakeholders may choose to go through this process as a way of gaining trust with the business.

In the beginning both sides may be at odds, both believing that the other side doesn’t really care about what is important to them. Estimation could be used as a technique that allows the technical side to “show it’s workings” to the other side, just as an exam student proves to the examiner that they didn’t just stumble upon the answer through luck.

As trust eventually grows and the joint understandings of “value” take shape, along with a display of continuous delivery of the business’s (ever changing) preferred features, the task of estimation falls away to leave those useful practices which always underpinned it. At this point the training wheels have come off and the team feels liberated from the tyranny of arbitrary deadlines.

 

[1] This is how process improvement used to take place (if at all) before retrospectives came along.

[2] Without direct involvement from the customer all communication was channelled through the project manager. Whilst well-meaning (to protect the team) this created more problems than it solved.

[3] I guess I’m a “busy-body” because I have always enjoyed sticking my nose in and finding out what others are up to, mostly to see if they’d like my help.

[4] The common alternative to missing your deadline and being held to it is to work longer hours and consequently lose morale. Either way the business eventually loses, not that they will always realise that.