Monday 26 January 2015

IIS Hosted Web API Not Caching Internal HTTP Responses

Executive Summary: Set “Load Profile” to true for the IIS application pool.

I had a fun afternoon last week trying to help a colleague work out why the responses to the internal HTTP requests from one of my team’s web APIs weren’t being cached by the .Net HTTP component. It appeared that this was all working fine in the pre-production environment, but not production itself.

Caching HTTP Responses

The code (which I’m not overly familiar with) uses one of the standard .Net HTTP request classes (WebRequest I believe) to make authentication calls to another internal, RESTful service. It was anticipated by the owner of this service that responses could be cached by the .Net framework thereby alleviating the need for our own service to manually implement a caching layer. Naturally the point of (briefly) caching the HTTP responses was to reduce load on the underlying authentication servers and they had made sure that their response headers were correctly configured to make this “all just work”.

In theory this was true, but in practice it turned out to be a little more complicated once the code as running under the real IIS worker process (w3wp.exe).

Administrator Rights

My colleague had already done a fair bit of analysis and experimentation with a spare production server and the wonderful Process Monitor tool from Sysinternals. By watching the w3wp.exe process whilst he sent requests with CURL he could see that the process was getting Access Denied errors when accessing parts of the Registry and file-system.

He had also tried configuring the IIS application pool to run under another non-service account, i.e. a support user’s account, and that appeared to work. Comparing the behaviours with Process Monitor for the two accounts he could verify that the problem was one of permissions, or so it seemed.

A natural response when seeing Access Denied problems is to grant the underprivileged account admin rights and that’s what he did and lo-and-behold things started working. However further experimentation suggested that this didn’t always fix the problem, which was mightily confusing.

At this point I joined the party...

Verifying Admin Rights

Past experience has taught me that just adding an account to the Administrators group is not always enough - it’s best to actually check the security attributes of the process in question to double-check that it really has become elevated. Conversely when removing rights I make the same check to ensure they’ve been revoked correctly.

For this check I used another of Sysinternals tools - Process Explorer. We hit an initial “Access Denied” for some process information [1] so I used the technique of starting it with “psexec -s -i” which was probably a bit of overkill (See “Process Explorer Failed to Display DLLs or Handles” for when this might be necessary). After right-clicking on w3wp.exe to show the process’s Properties and then switching to the Security tab I could see that the process was indeed a member of BUILTIN\Administrators.

It later transpired that the w3wp.exe process not correctly acquiring or relinquishing its power might have been the source of some of the earlier confusion. I too witnessed the removal of the account from the local Administrators group and after recycling the web site application pool found the w3wp.exe process still appeared to be a member. The process start time strongly suggested it had been restarted too when we recycled the app pool.

The Default Profile

We then switched back to Process Monitor and looked at the Registry and file-system access. The HTTP client was accessing various “Internet Settings” registry keys under the HKEY_USERS tree - HKU\.Default in particular. It was then reading and writing to an internet cache that was stored under “C:\Windows\Temp\Temporary Internet Files\...”.

I had always suspected that user profiles might have a role to play in this problem because I remembered back in the dim-and-distant past that the native WinInet library couldn’t (or perhaps shouldn’t) be used for making HTTP calls when running as a service. Microsoft later added another native component called WinHttp that could be used by services. IIRC the main problem back then was around configuring proxy settings, and whilst we weren’t suffering from that specific ailment this time, the registry keys it appeared to be accessing was pointing in the same direction.

We then revoked the admin rights from the service account, checked they had been revoked with Process Explorer, and then executed the test HTTP requests again so I could see whether the same registry and file-system access patterns occurred, and they did. A brief play with the REG command-line tool and DIR helped prove that, although the keys and folders existed, the unprivileged service account wouldn’t have access to them.

LoadProfile=true

This suggested to me that the problem was likely down to the fact that the service account was being used by the w3wp.exe process without a user profile being loaded. As such, due to the lack of a profile, it was trying to do the same thing it would with just a default profile on hand, but it didn’t have the necessary rights to update the cache in the “C:\Windows\Temp\Temporary Internet Files” folders.

Changing the Application Pool setting for “Load Profile” to “true” seemed to fix the problem. We then applied this to another spare production server (after first proving that it was not behaving correctly) and that worked too.

Admin Rights Not Required

I think this tale highlights why the answer to many deployment issues can be solved by apparently making the service account a local administrator. The huge gain in power that right confers allows the process to read, and more importantly write, to so much of the system. Even if the correct answer is an entirely different setting (as in this case) taking the sledgehammer approach can get you running, but at a potentially disastrous cost [2].

Anyone doing service style deployments on Windows should know how to weld tools like NTRIGHTS and CACLS (or their more modern equivalents) to grant just the right amount of permissions for their processes.

 

[1] I’ve spent so many years supporting pre-Windows Vista systems that I still forget occasionally that I need to elevate admin tools. I usually start with a non-elevated command prompt for safety and then wonder later why a tool doesn’t work properly because I’ve forgotten that I’ve gone for safety first. Which of course is the whole point :-).

[2] If the service is compromised it might grant an intruder further access. Even if it’s an internal service putting it in a sandbox helps ensure it does what you think it should.

Thursday 22 January 2015

Configuration Can Be Simpler

Every time a developer creates an XML based IoC container config file I’m sure not just one, but an entire litter of kittens die. Not just them but any build managers [1] and support staff too. I’ll admit that I have yet to work on a project where I’ve seen any clear reason to even use an IoC container. I’m sure they exist, it’s just where I’ve seen them used it’s as a workaround for some other problem, e.g. a weak build and deployment pipeline [2].

Configuration Files are a User Interface

In “Diagnostic & Support User Interfaces” and “Testing Drives the Need for Flexible Configuration” I tried to show how system configuration files are a form of Human Computer Interface (HCI). This is a view that is also promoted in Michael Nygard’s wonderful book “Release It! Design and Deploy Production-Ready Software”.

If that is the case, then files with hundreds or thousands of lines of XML is not my idea of a pleasant UI to work with. One knee-jerk reaction might be to manage the complexity by trying to use an extra level of indirection and generate the XML config file(s) from a much simpler DSL written with, say, a scripting language. To me this just appears to be an attempt to overcome the XML “angle-bracket tax” instead of getting to the heart of the problem, which is using an overly-flexible and complex tool to solve what might be a much simpler problem.

Example: Configuring a Chain of Responsibility

To try and explain my position further I’m going to use a real-world example that has come up. Within a web API I worked on there was a component that handles the authentication aspects of an internal system. This component used the Chain of Responsibility design pattern to achieve its goals so that a client request could be fired into the head of the chain and somewhere along it one of the providers would answer the question of who the principle was, even if the eventual answer was “unknown”.

The section of an IoC container XML config file needed to describe this chain was not pretty - it was littered with fully qualified types names (assembly, namespace and class/interface). Whilst I grant you that it is possible to configure a Directed Acyclic Graph (DAG) of objects using the tool and its config file I would suggest that it is probably not desirable to. The example - a chain of responsibility - does not call for that amount of power (an arbitrary DAG) and I think we can do something much simpler and with very little coding effort.

Key/Value Configuration

I’ve always found the old fashioned .ini file style of configuration files one of the most easy to read and maintain (from a multi-environment perspective). One of the reasons for this I believe is that the configuration is generally flattened to a simple set of key/value pairs. Having one entry per line makes them really easy to process with traditional tools and the constraint of one-setting-per-line forces you to think a bit harder about how best to break-down and represent each “setting”. If the configuration value cannot be distilled down to a simple value then perhaps a richer structure might be needed (indicated with a simple setting like a filename), but just for that one setting, it does not invite unnecessary complexity elsewhere.

Consequently I would look to try and express the structure of our example “chain” in a single key/value entry, albeit in this case as a .Net style “App Setting”. This is how I would express it:

<appSettings>
  . . .
  <add key=“Security:AuthenticationChain” 
       value=“Token,IP,Default” />
  <!-- component specific settings below it -->
  . . .
</appSettings>

You’ll immediately notice that the name of the setting describes the structure we are trying to create - a chain - you don’t have to infer it from the structure of the XML or read an XML comment. You’ll also see that we haven’t included any implementation details in the setting’s value - each term is a shorthand (an abstraction if you will) for the component that rests in that position within the chain. This makes code refactoring a breeze as no implementation details have leaked outside the code and could therefore be hidden from a refactoring tool.

The Factory Method

If you’re thinking that all the complexity can’t have vanished, then you’d be right. But much of the accidental complexity has gone away simply by choosing to use a simpler and more restrictive representation. Parsing this representation only requires a handful of lines of code:

public static class Factory
{
  public static IProvider CreateChain(IConfiguration configuration)
  {
    var key = “Security:AuthenticationChain”;
    var setting = configuration[key];
    var chain = setting.Split(‘,’).Reverse();

    IProvider root = null;
    foreach (var type in chain)
      root = CreateProvider(type, configuration, root);

    return root;
  }

  private static IProvider CreateProvider(string type, IConfiguration configuration, IProvider nextProvider)
  {
    switch (type)
    {
      case “Token”:
        return new TokenProvider(..., nextProvider);
      case “IP”:
        return new IpProvider(..., nextProvider);
      case “Default”:
        return new DefaultProvider(..., nextProvider);
    }

    throw new ArgumentException(“Unknown provider type ‘{0}’”, type);
  }
}

If all the dependencies in this chain are purely functional (and they almost certainly should be for a Chain of Responsibility) then the factory and chain can be easily tested using component-level tests [3]. The only mock it requires is for the configuration object which is trivial to knock-up, even as a manual mock.

What About Adding New Providers?

One of the allures of IoC containers is the apparent ease with which you can “re-configure” your system and add support for a new component. Whilst that may be true for the deployment configuration step, I know that the development and testing cost of any new component is easily going to dwarf the miniscule amount of change required to the Factory Method shown above to support such a change. Unless the new component is going to be implemented in an entirely new assembly and there is a strict requirement not to touch the existing codebase, then you’ve already got a code change and re-deployment on your hands anyway.

Configuration Validation

One of the things in particular I like about manual factory methods is that they provide a suitable place to put validation code for your configuration. For example, in our authentication chain the final provider is the default one. This provider is “a provider of last resorts”, no matter what happens before it, it can always answer our query and return an answer, even if that answer is “no-one knows who they are”. The fact that this provider should always appear at the end of the chain means that there is the possibility to mis-configure the system by accident. Imagine what would happen if it was the first in the chain - every other provider would essentially be ignored.

One approach to this problem is to punt and let the documentation (if any exists) provide a warning; if the sysadmins then mess it up that’s their problem. Personally I’d like to give them a fighting chance, especially when I know that I can add some trivial validation (and even unit test its behaviour):

case “Default”:
{
  if (nextProvider != null)
  {
    throw new ConfigurationException(“The ‘Default’ provider can only be used at the end of the chain”);
  }
  return new DefaultProvider(...);
}

If you look closely you’ll see that I’ve also removed the final argument from the call to the DefaultProvider constructor. The great thing about expressing your design in code (for a statically compiled language [4]) is that once I decided to remove the final parameter from the DefaultProvider class ctor I could lean on the compiler and fix-up the call sites as appropriate. If one of the call sites wouldn’t have been usable this way I’d have spotted it immediately rather than having to debug a curious runtime issue some time later.

Test-Driven Design

I’ve heard it said that TDD done properly leads you naturally towards using an IoC container. Frankly I’m bemused by that statement because when I do TDD it leads me exactly towards what I describe above. Frankly I genuinely cannot see how the simplest step can be to go from a hardcoded implementation (i.e. the DefaultProvider in our example) to using a highly-configurable tool to define an arbitrary DAG - there is always something much simpler in between.

Even if some of the steps below are not actually coded up and just become a thought exercise, I’d still not end up reaching for the big guns  by the end:

  1. Identify a need to do authentication
  2. Create the default provider with TDD
  3. Hard-code it as the only implementation
  4. Identify a need for other authentication mechanisms
  5. Create them with TDD
  6. Hard-code the initial chain
  7. Identify a need to make the chain order configurable
  8. Introduce a factory (method) to house the creation logic, using TDD

As I said right back at the beginning, with a fluid development process and one eye on good design principles, the term “hard-coded” no longer remains quite such the pejorative it once was.

 

[1] Not a role I had come across until recently as the build pipeline and deployment process has always been maintained by the developers.

[2] When you can make a change in isolation, build your artefacts, run an extensive test suite and deploy changes to production quickly, many needs for components to be configurable through config files just evaporate.

[3] The difference between unit-level tests and component-level tests is really just the size of your “unit”. I chose not to say “unit tests” to make it clear that I am aware there would be more than one class under test in certain scenarios.

[4] Or a non-statically compiled language with some form of type annotations.

Friday 16 January 2015

Terse Exception Messages - Part II

Quite unintentionally it looks like my post back in October last year “Terse Exception Messages” needs a second part. The initial post was about dealing with terse error messages generated by external code, however I’d forgot that terse exception messages are not just the domain of 3rd party libraries, they can appear in “our code” too. Take this example from some real world production code:

throw new NotFoundException(“Invalid ID”);

Sadly this will likely appear in any diagnostic log file with only the usual thread context information to back it up, e.g.

<date, time, thread, etc.> ERROR: Invalid ID

If you also have a stack trace to work from, that could give you a clue as to what type the ID refers to. You might also be able to hunt around and find an earlier diagnostic message that could give you the value too, but you’re probably going to have work for it.

Clearer Exception Messages

At a bare minimum I would also include the “logical” type and, if possible (and it usually is), then the offending value too. As such what I would have thrown is something more like this:

throw new NotFoundException(“Invalid customer ID ‘{0}’”, id);

This message contains the two key pieces of information I know are available at the throw site and so would generate a more support-friendly message like this:

<date, time, thread, etc.> ERROR: Invalid customer ID ‘1234-5678’

Discerning Empty Values

You’ll notice that the value, which in this case was a string value, is surrounded by quotes. This is important in a message where it might not be obvious that the value is empty. For example if I had not added the quotes and the customer ID was blank (a good indication of what the problem might be) then it would have looked thus:

ERROR: Invalid customer ID

Unless you know intimately what all messages look like, you could easily be mistaken for believing that the error message contains little more than our first example, when in fact the value is there too, albeit invisibly. In the improved case you would see this:

ERROR: Invalid customer ID ‘’

For strings there is another possible source of problems that can be hard to spot when there are no delimiters around the value, and that is leading and trailing whitespace. Even when using a fixed width font for viewing log files it can be tricky to spot a single erroneous extra space - adding the quotes makes that a little easier to see:

ERROR: Invalid customer ID ‘1234-5678 ’

As an aside my personal preference for using single quotes around the value probably stems from many years working with SQL. I’ve lifted values out of log files only to paste them into a variety of SQL queries countless times and so automatically including the quotes seemed a natural thing to do.

Discerning Incorrect Values

It’s probably obvious but another reason to include the value is to discern the difference between a value that is badly formed, and could therefore never work (i.e. a logic error), and one that is only temporarily unusable (i.e. a runtime error). For example, if I saw the following error I would initially assume that the client has passed the wrong value in the wrong argument or that there is an internal logic bug which has caused the value to become “corrupted”:

ERROR: Invalid customer ID ‘10p off baked beans’

If I wanted to get a little more nit-picky about this error message I would stay clear of the word “invalid” as it is a somewhat overloaded term, much like “error”. Unless you have no validation at all for a type, there are usually two kinds of errors you could generate - one during parsing and another during range validation. This is most commonly seen with datetimes where you might only support ISO-8601 formatted values which, once parsed, could be checked to ensure an end date does not precede a start date.

To me the kind of error below would still be a terse message; better than the original one, but could be even better.

ERROR: Invalid datetime ‘01/01/2001 10:12 am’

My preferred term for values that fail “structural validation” was adopted from XML parsing - malformed. A value that parses correctly is considered “well formed” whereas the converse would be “malformed”. Hence I would have thrown:

ERROR: Malformed datetime ‘01/01/2001 10:12 am’

For types like a datetime where there are many conventions I would also try and include the common name for the format or perhaps a template if the format is dynamically configurable:

ERROR: Malformed datetime ‘01/01/2001 10:12 am’ (Only ISO-8601 supported)

ERROR: Malformed datetime ‘01/01/2001 10:12 am’ (Must conform to ‘YYYY-MM-DD HH:MM:SS’)

Once a value has passed structural validation if it then fails “semantic validation” I would report that using an appropriate term, such as “out of range” or explicitly describe the comparison. With semantic validation you often know more about the role the value is playing (e.g. it’s a ‘start’ or ‘end’ date) and so you can be more explicit in your message about what the problem is:

ERROR: The end date ‘2010-01-01’ precedes the start date ‘2014-02-03’

Custom Parsing Methods

As we saw back in my original post the language framework parsing methods are often horribly terse, even when failing to parse a simple value like an integer. The reason is that, like all general purpose functions, there is a balance between performance, security, etc. and so that means including the value in the exception might be undesirable in some classes of application.

However in the systems on which I work this is not a problem and I’d happily trade-off better diagnostics for whatever little extra garbage this might produce, especially given that it’s only in (hopefully) exceptional code paths.

My solution to this problem is the proverbial “extra level of indirection” and so I wrap the underlying TryParse() framework methods with something that produces a more satisfying diagnostic. In C# extension methods are a wonderful mechanism for doing that. For example I’ve wrapped the standard configuration mechanism with little methods that read a string value, try to parse it, and if fails then include the setting name and value in the message. Consequently instead of the usual terse FormatException affair you get when parsing an int with code like this:

int.Parse(settings[“timeoutInMs”]);

You invoke an extension method like so:

settings.ReadInt(“timeoutInMs”);

If this fails you’ll get a much more pleasing error message along these lines:

ERROR: The ‘timeoutInMs’ configuration setting value ‘3.1415’ was not a well formed integer value

Formatting Exceptions in C#

Whenever I create a custom exception type in C# I generally add a variety of overloads so that you can create one directly with a message format and selection of arguments without forcing the caller to use String.Format(). This is how the earlier example worked:

throw new NotFoundException(“Invalid customer ID ‘{0}’”, id);

All the class needs for this kind of use is a signature akin to String.Format’s:

public NotFoundException(string format,
                         params object[] args)
    : base(String.Format(format, args))
{ }

However this is not enough by itself, it’s dangerous. If we pass a raw message that happens to contain formatting instructions but no arguments (e.g. “Invalid list {0, 1, 2}”) it will throw during the internal call to String.Format(), so we need to add a simple string overload as well:

public NotFoundException(string message)
    : base(message)
{ }

As an aside I never add a public default constructor because I’m not convinced you should be allowed to throw an exception without providing some further information.

Due to the framework classes not having a varargs style ctor you might already be happy putting formatting calls in the callee, either with String.Format() directly or via a simple extension method like FormatWith():

throw new NotFoundException(String.Format(“Invalid customer ID ‘{0}’”, id));

throw new NotFoundException(“Invalid customer ID ‘{0}’”.FormatWith(id));

One scenario where the varargs style falls down in C# is when you start adding overloads to capture any inner exception. You can’t just add it on the end of the signature due to the final params-based args parameter, so you have to add it to the beginning instead:

public NotFoundException(Exception inner,
                         string format,
                         params object[] args)
    : base(String.Format(format, args), inner)
{ }

Sadly, this is the exact opposite of what the framework does, it adds them on the end.

In my experience very little effort is put in by developers to simulate faults and validate that exception handling is behaviour correctly, e.g. verifying what is reported and how. Consequently this makes the varargs overload potentially dangerous as adding an exception argument on the end by following the framework style would cause it to be silently swallowed as no placeholder would reference it. This has never been a problem for me in practice because I always throw custom types and always try and format a string, but that’s easy when you’re the one who discovers and instigates the pattern - others are more likely to follow the pattern they’re used to, which is probably how the framework does it.

Testing Error Messages

At the tail end of the previous section I touched briefly on what I think is at the heart of why (server-side) error messages are often so poor - a lack of testing to make sure that they satisfy the main “requirement”, which is to be helpful in diagnosing a problem. When writing unit tests for error scenarios it’s all too common to see something like this:

Assert.That(() => o.DoIt(. . .), Throws.Exception);

This verifies and documents very little about what will is expected to happen. In fact if “o” is a null reference the test will still pass and not even invoke the behaviour we want to exercise! Our expectations should be higher; but of course at the same time we don’t want to over specify and make the test brittle.

If a caller is going to attempt recovery of the failure then they need to know what the type of the exception is, lest they be forced to resort to “message scraping” to infer its type. Consequentially it’s beneficial to verify that the exception is of at least a certain type (i.e. it may also be a derived type). Where an obvious value, such as an input argument, is a factor in the error we can loosely test that the value is contained within the message somewhere too:

const string id = “malformed-value”;

Assert.That(() => o.DoIt(id),
            Throws.InstanceOf<ValidationException>()
                  .And.Message.Contains(is));

If there are many arguments to be validated it is important to try and match on the message too, or some other attribute, to discern which of the many arguments caused the exception to be thrown. It’s all to easy when making a change to a method that you end up finding one or more of your tests are passing because they are detecting a different error to the one they were intended to, as the null reference example above highlighted.

Whilst that takes care of our code-based contractual obligations we also have an obligation to those who will be consuming these messages to make sure they form a useful narrative within any diagnostic output. I tackled this side of the issue some years ago in “Diagnostic & Support User Interfaces” and so if you’ve experienced a sense of déjà vu reading this post then that may be a contributing factor.

Tuesday 13 January 2015

Text Editor Outlining Modes Hide Complexity

I have this theory, but I don’t have a statistically large enough data set [1] with which to publish anything other than a random musing. My theory is that developers who use a text editor with its “Outlining Mode” [2] switched on have a tendency to write more “complex” code.

Of course tarring every developer with the same brush is not exactly endearing myself to my readership and so I fully expect there to be many who use this feature with great aplomb and are aware of its apparent shortcomings. However for those who are unsure as to why I’m proposing this theory let me explain what it is that I’ve seen when working with developers who have this feature switched on.

Duplicate Code

The most obvious thing I’ve noticed is that developers who don’t see all the code on screen at once don’t notice all the duplication. If I’ve spotted a problem in a method then, as a natural by-product of seeing all the code for the class, I’ve also noticed the same pattern appears in a number of other places. In contrast when pairing with someone who can only see the method names, except for the one being currently edited, they do not automatically notice and therefore go looking for any duplication.

Of course the real problem is the duplication itself. I suspect that because I always see all the code I tend to see duplication as it builds up and so will naturally refactor it as soon as the test goes green. When all you see is method names I’m not sure how you are expecting to notice duplication unless you literally cut-and-pasted some other code. At least with code literally cut-and-pasted it looks identical, but when you’ve written code that only appears logically similar to that in another method it can be harder to reason away the duplication and do the necessary refactoring.

Accidental Complexity

In a more general sense what appears to be happening is that, paradoxically, the tool that tries to reduce the noise and clutter whilst editing code has succeeded in creating further complexity by constantly hiding it. If you do not “feel the pain” caused by working with overly complex code you do not feel the need to refactor it, until it becomes a problem, by which point you’ve probably already lost the battle. By having all the code in a source file visible you get to see the big picture so that even if you’re making edits to one method your subconscious can still go to work on the surrounding code and nudge you when it thinks something smelly is appearing.

Debugging is another opportunity where you get to take a walk through code you’ve perhaps not worked on yourself. In effect you have the opportunity to review that code as you’re debugging it, not formally, but you can certainly makes notes and raises any issues later (See “In The Toolbox - Pen & Paper”).

C# Regions

The use of regions in C# is another variation on this problem and it becomes yet another switch to turn off in the IDE to ensure all code is visible all the time. In theory being able to hide auto-generated code from developers when editing is a venerable goal, but the way I’ve seen it being used is to hide big chunks of code which just appears to act as a catalyst for the creation of behemoth source files.

Plausible?

Does this “theory” hold water? When I tweeted about it there were a few murmurs that at least supported vague anecdotal evidence and nobody came straight back and cited a paper, so I don’t think it’s already been debunked. I was pretty sure it’s old news too and amusingly I did get a reply from Len Holgate with a link to a blog post he had written almost exactly 10 years earlier that touched on part of this topic “VS.Net #Region”.

Perhaps a passer-by will leave a comment in future that will provide evidence of a more scientific study of this feature. Or alternatively someone else will point out how the tools being used incorrectly so that I don’t feel the need to turn it off the moment I install Visual Studio.

 

[1] You can read that as “a handful of cases”.

[2] That’s the name used in Visual Studio, but I have also heard the term “folding editor” used for the same feature.

Friday 9 January 2015

Git-TFS Pull Failure - error running command: git rebase --preserve-merges

I’ve been using git-tfs in anger now for the past 6 months and there was an occasional problem that cropped up that had me scratching my head a couple of times. Somehow I managed to fix it the first time by fumbling around; the second time it happened I decided it was time to fix it “scientifically” and then write it up for when it happens again...

TFS Gated Builds

I’m working in a team that has separate TFS projects, some of which I can push commits to directly with git-tfs rcheckin, and others that go through a TFS gated build. The latter requires use of git-tfs checkintool which pops up the standard TFS check-in dialog, but only after squashing any un-pushed commits in the Git repo into a TFS shelveset. It’s this mismatch between the Git master branch, which has separate commits, and the single squashed commit that comes back down on the TFS remote master branch, that causes the error.

As an aside, if I’m working on a repo that I can’t use git-tfs rcheckin with to keep my Git history intact then I manually push after every Git commit to avoid the squashing behaviour of git-tfs checkintool. As such the only times I’ve seen this problem is when I’ve forgotten to push a commit to TFS and am now pushing two commits (squashed as one) without realising it. I normally do all my code reviewing when I commit to my Git repo, I only take a cursory glance at the changes I’m pushing to TFS because I know the changes themselves are what I want; it’s just the impedance mismatch between Git and TFS causing the extra friction.

Git-TFS Pull Failure

After git-tfs checkintool has run the use of a TFS gated build causes the check-in to appear to abort. However in the background the TFS shelveset is passed onto the gated build where it’s built and then merged if successful. Once the gated build completes we can re-sync the local Git repo with the remote TFS repo using git-tfs pull. But sometimes it fails like this:

> git tfs pull --rebase
Working with tfs remote: default
Fetching from TFS remote 'default'...
C12345 = c3b3f9a1339366b918b2bb5c508590bbdcce4c60
rebase in progress; onto c3b3f9a
You are currently rebasing branch 'master' on 'c3b3f9a'.
 
nothing to commit, working directory clean
The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:
 
    git commit --allow-empty
 
Otherwise, please use 'git reset'
error running command: git rebase --preserve-merges refs/remotes/tfs/default
Command exited with error code: 1

Squashing the Git Master

If you compare the Git master and TFS remote master branches you’ll probably notice that the remote branch has a squashed version of your last few commits on master. Consequently what we need to do is squash the commits on master to make the local and remote branches look the same before we do the pull. First we need to abort the current rebase so that we put master back to where it was before we attempted the pull:

> git rebase --abort

Once this is done we can squash the commits on master to match the TFS remote. I happen to use TortoiseGit for this because it’s really easy and I’ve already got it open to compare the tips of the two branches (local and remote masters).

Pull Again

With the master and remote now back in sync it’s a simple matter of executing the git-tfs pull again at which point the rebase should now succeed and master will now match the TFS remote again.

> git tfs pull --rebase
Working with tfs remote: default
Fetching from TFS remote 'default'...
Rebasing (1/1)

Note: There shouldn’t be anything to rebase (unless you’ve added more commits since) so the only observable difference will be that the newly squashed commit will have the extra git-tfs-id property as per norm to marry the Git and TFS revision changeset numbers.