Tuesday, 30 September 2014

Single Points of Failure - The SAN

The tactical solution that was put together in my “The Surgical Team” post had being running perfectly for about 6 months when I get another one of those weird mornings where I find myself being escorted to a meeting room. This time though it’s the War Room...

It seemed our simple web service that had been happily servicing requests in under 10 ms had on occasion recently exceeded that by quite some margin - a margin measured in seconds not milliseconds! It seems that the “on occasion” had now turned into “quite a few” and the failure count was now measured in whole percentage points and rising.

Naturally I pulled out the trusty old Unix command line tools (GoW is my current favourite distribution) and grepped out the number of failed requests from the IIS log files (the last column in the default log format is the time taken in ms [1]). I did this across all four production web servers and confirmed that there were web requests taking many seconds - up to 30 seconds in some cases for a request to complete!

I switched to our application logs which log every API request to see what the internal view of the requests looked like. Nope, we still handled the request within microseconds internally (the rest is IIS + MVC overhead). Very occasionally I did see something really odd - the timestamp on the log message didn’t really agree with the measurement we had taken for the length of the call.

As far as I could tell our application, which is so utterly trivial [2], was behaving itself but something either in IIS or the OS or elsewhere was causing the request to hang. It definitely wasn’t Gen 2 garbage collections either because we log when they occur and they happened nowhere near any of the hangs we were seeing. Sniffing the traffic with WireShark confirmed the behaviour but it all made absolutely no sense. There were reports of some odd behaviour in the logs of other services too but they were not nearly so time sensitive and it was much harder to track the anomalies down. In one case there was a task scheduled onto an empty CLR thread pool and it appeared to be seconds before it was picked up.

The fact that it was not isolated to just our service, although the evidence there was weak, coupled with the fact that it affected all our web servers, but not always at the same moment in time suggested to me that it was likely to be infrastructure related. My gut reaction was something security related as I also noticed that the clocks were well out of sync. I had seen Windows servers behave oddly before when their clocks were out of sync, but our servers were still well within the legendary 5 minute window for Kerberos. Perhaps a busted domain controller or a busted NIC in a domain controller?

None of the servers had been patched recently, but just to see whether it might be a virtualization problem one web service guest VM was moved to an entirely different cluster and host VM version. This still made no difference though.

This went on for a few days. The only thing I had unearthed by running grep across the entire set of IIS logs since we went live, was that the problem had really surfaced a month or so earlier during a bank holiday. There had been some parallel investigation to find out exactly what changes were made back then, but we were advised that it really can’t be the source of the problem - just some tangential minor network reconfiguration. It sure looked like a smoking gun, but at the same time it had gone away and never really made any sense.

Someone put together a simple tool to make a web request against a local IIS instance to try and see if it was machine or network related and it proved pretty quickly that it was machine related. The next step was to try and disable everything non-essential, such as logging, and this came up trumps. So logging was disabled on all four web servers and the problem disappeared. But why?

As the title of this post has already given away the answer you won’t be surprised to discover it was the SAN. Something within the product was causing replication to hang and that was causing a cascading failure across the network. All the disks on all the web server VMs were virtualized and replicated via the SAN. Essentially the volume of logging the web servers were doing, whilst not insanely high (IMHO), was causing something within the SAN to get very upset.

Looking back I could see that that timestamp on some log messages didn’t make sense, but it had never occurred to me that the delay would be caused by a (silent) problem writing to the drive [3] hosted via the SAN. I’ve never worked directly with virtual machines before because all the years I worked in finance we were thrashing physical machines 24x7 by running intensive calculations. I’m also somewhat surprised that log files are not written to some form of local storage, even if it’s a virtualized local drive, as the point of writing to local storage is usually to avoid too many side-effects from the act of logging. Plus you have a log to look at if some part of the remote infrastructure fails. Perhaps I’m out of touch here - is it cheaper to throw all your log messages across the wire to some remote storage? I have used remote logging for operations style logging (i.e. info, warning, error), but not for diagnostic logging (i.e. debug/trace). Presumably SANs are supposed to be so reliable and performant that they are as reliable as the sun rising and setting?

The last time I checked, which was a few months after this incident happened, the SAN’s vendor had still not resolved the problem and the application logging was still disabled.

 

[1] The wording for the time-taken field in the IIS log file has a real air of lawyer speak about it. Basically it’s the time from the moment the request is first noticed to the point that the last packet is sent (IIRC). The weasel words seem to be a reminder that network delays outside of IIS’s control means the time may well be longer than that. And for really small responses (i.e. a single network packet in size) it could be even less reliable. Given that the resolution is 15 ms anyway I don’t think anyone would seriously consider it a truly accurate source in the first place.

[2] It was just a straight binary search across a (paged) array of 100 million pairs of fixed-length strings (close to 3 GB in size).

[3] When a physical disk starts to fail you can usually see some evidence in the event log, such as reads or writes failing. But in this case there was nothing in the guest’s event log to indicate the write took an inordinately long time. I can’t say whether the infrastructure team could see anything in the host’s logs but one would hope they would have been checked...

Friday, 26 September 2014

Parameter Mutation or Result Aggregation?

As part of our normal development cycle I did a bit of code reviewing and came across something that I realised I would have done myself a while back, but these days would consider doing a different way. The code was a function that attempted to perform an action on a collection of items, and as part of the error handling would keep a note of the failures and then call itself recursively to process the remainder. Something like this:

public List<Failure> Write(List<Thing> things)

  var failures =  new List<Failure>();  

  Write(things, ref failures);  

  return failures;
}

private void Write(List<Thing> things,
                   ref List<Failure> failures)

  try 
  { 
    // Use bulk API to try to write all the “things”  
  } 
  catch (WriteException e) 
  {  
    failures.Add(new Failure(...)); 

    var unwrittenThings = things.Skip(e.FailedItem+1)
                                .ToList();  

    Write(unwrittenThings, ref failures);  
  }
}

As always the real code was far more complicated than this (mostly due to various degrees of exception handling) which meant that you couldn’t simply manually convert the recursion into a traditional loop. No, what I found I would have done differently was change the way that the “failures” list is passed into a recursive function where it is then mutated as part of the process. Essentially seeing the “ref” keyword set off my spider-senses, which is exactly what it’s supposed to do.

Although only the outer method is public, and so this is purely a discussion about the internal implementation of a method, I still questioned whether it would be preferable to avoid the mutation (i.e. side-effect). As I get more and more used to adopting immutable values and trying to be thread-agnostic I find the style of code above less appealing.

In my earlier C++ days (i.e. prior to the invention of move semantics) this would have (felt like) the most efficient way to do it, i.e. to mutate a std::vector style parameter passed by reference [1]. But in C#, where such costs are different because the collections are all reference types, it wouldn’t cross my mind not to return a potentially large collection via the return value.

This leads me to an alternate approach to the code above, which might be considered slightly more pure (aside from the huge side-effects of whatever the underlying operation is), which is to modify a local collection, aggregate the results from the recursive calls, and then return that:

public List<Failure> Write(List<Thing> things)

  List<Failure> failures = new List<Failure>(); 

  try 
  { 
    // Use bulk API to try to write all the “things” 
  } 
  catch (WriteException e) 
  {  
    failures.Add(new Failure(...)); 

    unwrittenThings = things.Skip(e.FailedItem+1)
                            .ToList();  

    var moreFailures = Write(unwrittenThings);

    faillures.AddRange(moreFailures);  
  }  

  return failures;
}

Note that we also only need a single method now.

The one aspect that would cause me to think twice is whether the recursion could get so deep that it might generate a stack overflow [2]. As someone who lived through the 16-bit DOS\Windows era I am all too aware how easy it could be to blow the stack with recursion. However a 1 MB default stack these days probably gives you plenty of room for normal use. Thinking more specifically about the example above, if the number of failures was so high that every write caused a failure, then a stack overflow exception may well act as a natural Circuit Breaker that highlights something is badly wrong elsewhere.

 

[1] The other C++ alternative of course (still prior to move semantics) might have been to use std::auto_ptr or a shared_ptr. Whilst auto_ptr was generally considered dangerous for use in collections it was suitable for tightly scoped use where objects were returned from factory functions.

[2] Support for tail recursion would alleviate this, but then you’d have to know that the tail recursion optimisation was being applied if you wanted to be absolutely sure of avoiding a stack overflow.

Tuesday, 23 September 2014

Deferring the Database Choice

In my recent post “The Courage to Question” I described how the choice of database technology had been another “challenge” for the team. In a much earlier project we had been working with a NOSQL solution, which was the perfect fit for the very large data set, only to be told (as we geared up for production) that we must use a classic SQL technology instead. You can probably imagine our surprise and the ensuing debacle.

When the next project came up we were a little more wise and the choice of database technology was far less critical, at least on face value. However one of the features touted suggested to me that we might be better off with two different databases - one of each sort - as the usage was inherently different. Either way we didn’t want to get embroiled in any “discussion” before we had a better idea of what it was we were actually building, and so we elected to go for no database at all.

No formal database product that is. Our set of functional acceptance tests only involved a single web API service and so “an in-memory database” was the simplest thing we needed. Obviously by “in-memory” what I mean is a dictionary of key to (typed) object. This also allowed us to begin designing the data storage layer without having to get a database into the build pipeline, etc.

Reasonably soon into the project we reached a level of maturity in the demo system where not having any persistent data became a chore for the non-development staff as we were deploying so quickly the existing state kept getting wiped. It was time to consider whether we needed a real database.

We decided not to. In the background we had expressed a preference for which technology we’d prefer to use from a developer productivity perspective, but that appeared to be of no consequence. So, to minimise the disruption that would likely occur when adopting the real technology, we switched to a really dumb file-system based approach. Basically the persistence layer used the same JSON serializer we were using in the web API to turn the data model in to JSON strings that we then persisted as files, where the filename was the object’s primary key.

This got us over the persistence hurdle which helped during the demo-ing and also gave us time to develop the data model a little further before committing to a representation in a database. It was this latter procrastination around settling on a data model that had an interesting outcome. Where in the past the data model might have been modelled up front with a classical Customers / Orders / LineItems style hierarchical relationship we actually ended up with lots of little data silos that happened to have a similar primary key (e.g. customer ID). However the need to relate them formally with any sort of aggregate root (e.g. a Customer table) simply didn’t exist.

It also became apparent that one aspect of the data could even be viewed as an independent service, especially when looking further into the future to see what the strategic plans were for this part of the data. I’m loathe to appear to be jumping on the micro-services bandwagon [1]  but this felt like it was a very simple data service (e.g. an Address Book) that could be independently developed, deployed and serviced. The cost of splitting the services apart at that point in the project’s evolution would have been too high as we still had a lot of functionality to discover about the system we were building. But it was definitely firmly lodged in our minds that this might be desirable and even bandied around the idea of at least splitting the code into a separate assembly to at least enforce some extra partitioning for the time being.

It was an interesting exercise to live without a real database for as long as possible and it definitely had an enlightening effect on both the persistent data model and system architecture.

 

[1] Although I’ve read Lewis & Fowler’s blog post on the topic I wouldn’t want to presume that I have understood the salient points.

Friday, 19 September 2014

Haunted By Past Mistakes

Do you find yourself haunted by problems that exist in codebases you worked on in previous lives? I do. And I wish I could let them go but something inside gnaws away at me in the vain hope that one day I’ll be able to go back and right those past wrongs. I’m sure it’s no different to wishing you could go back in time and fix anything else you’ve done in your past, such as saying something nasty to a spouse, partner or friend. Of course I have those moments too, but fixing old code somehow seems more achievable. For those of you blessed enough to not suffer from this affliction let me give you an example...

I started my professional programming career just over 20 years ago at a small software house that wrote graphics packages for Windows 3.x. I initially worked on the little tools that they bundled with the main products. One tool, called KeyPad, was designed to help users enter characters not easily accessible from the keyboard. The application’s main window was a grid containing the hard to reach letters or symbols (depending on the font) and the user just clicked a character to send it to the previously focused application. One of the changes I made was to ensure that when the application window was resized it “snapped” to a grid that ensured the client area was an integral number characters in width and height. This meant trapping the WM_SIZE message so that if the window was shrunk or widened then the other dimension was adjusted appropriately. It was in my changes to the WM_SIZE handler that I left a really ugly hack that I’m not proud of.

The problem I faced was that when the WM_SIZE handler fired I then adjusted the dimensions and resized the window again to “snap” it to the nearest desirable size. This caused the WM_SIZE message to be sent again and so the code started to become re-entrant. For some (unknown to me) reason the window was a slightly different size to what I had calculated and that caused an amusing animated effect where the window rapidly disappeared up its own backside. I couldn’t work this out and so I added a hack to the WM_SIZE handler to cause the re-entrant behaviour to just stop on the second call. This “worked” but left a slightly odd visual effect that no one in the QA team seemed to notice. So it got shipped.

Many years after leaving the company I read an article in MSJ (MSDN Magazine as it was back then) or perhaps a blog post by Raymond Chen that described how to correctly calculate the total size of a window based on a desired client window area. And at that moment it dawned on me exactly what I had done wrong all those years before. But by now the company had folded due to competing products and so there wasn’t anyone to tell even if I really wanted to try and help correct my past mistake.

Clearly there is nothing I can do now, and at time I’m sure I was doing the best job I possibly could. So why do I feel so mortified? It’s not just my code either. I have sent an email to one of my previous teams at Christmas time to remind them about someone else’s bug that only shows up once a year when the gap in published market data exceeds 2 working days [1]. I feel bad because I didn’t just get on and fix it, instead leaving it to be prioritised - which it never was.

I realise I should just take the advice from the Bob Newhart clip that J.B. Rainsberger played during his Agile on the Beach keynote this year and “Stop It!” but I’m probably a lost cause. For those younger programmers out there who didn’t get to see Paul Hammond at the ACCU Conference in 2009 give his lightning talk “Caring Will Only Cause You Pain”, let this be your warning so that you might be saved in future from the ghosts of projects past.

 

[1] The potential for two days of holiday to butt up to a weekend and create a 4 day gap also occurs at Easter but the firm publishing the market data did occasionally publish data on one of the holidays so it didn’t always cause it to fail.

Thursday, 18 September 2014

The Courage to Question

A while back we had a retrospective where the exercise was to come up with some values that we as team wanted to adhere too. For those of us that are aware of the principles, values and practices from XP it’s harder to come up with something that isn’t just taken directly from there, but one of the items written down was “courage”.

As we went through and discussed what was meant by each suggestion it was interesting to see that the definition of courage came from the project manager and wasn’t so much about feeling confident in our abilities, but instead it was about being able to question authority. In a number of cases a demand has been made on the team and at least one of us has questioned what benefit it would provide. However, this isn’t just us being bloody-minded. In each case we can’t fathom what value the decision (or policy) would add to either our project or even the company as a whole. If someone could easily point out the value we’d all be happy to go along with it, but often it doesn’t appear to make any sense.

For example we started our current project using a private GitHub repo, largely because it was going to take forever to get set up on the company’s TFS farm (the older 2010 edition, not the newer Git friendly version). We also knew we’d be far more productive with Git as we’d already shown to be the case over the last 9 months on two other projects. Despite this we were asked to move to TFS. We were told “it’s the company standard”. So we asked for clarification on why we should disrupt our current fluid workflow (and give up the ability to work remotely too) when we’re using a service that is also saving them money. If they wanted us to use all the other TFS features, like work items and the build servers we could understand it, but we weren’t sure they did. And it turned out they didn’t. They just needed to have a copy of the source code in TFS for their maintenance team to take over at the time of going live. Of course they would prefer to have sight of it in TFS sooner (very sensible), which is what lead to the use of git-tfs that I wrote about a few months back in “Developing With Git, Pushing to TFS”.

Other things that we’ve headed off have been the desire to re-use one project’s security group on another because they happen to have the same members. We were also asked to switch to their own internal (SharePoint based) version of Trello instead of the real thing, but we went back to Trello because it was just too painful to use and there is no one assigned to fix the bugs and improve the internal version [1]. One result of the TFS battle was that we got to talk to someone in the company’s TFS admin team, explain our position and consequently get ourselves onto their programme for upgrading to the Git-friendly version of TFS. The choice of database technology has also reared it’s ugly head again and experience taught us we would be in for a struggle there too, and did. Apparently, although the development team may have the best understanding of the (scale of the) problem they are trying to solve, it falls to another department with no direct involvement to decide what the best tool for the job is. Developer productivity definitely did not feature in their decision matrix either.

I wrote an article for ACCU’s C Vu journal a few months back titled “Developer Freedom” which recounts the battles developers have in the enterprise to get anything done efficiently. I feel like I’m becoming more militant as I get older, but I guess that’s because as I get older and more experienced I have less tolerance for people towing the party line just for an easy life. If they say they want productive teams then I believe they should actually put some effort into making that happen. Of course what makes it easier to take such as stance is being in a position to ask questions where any fear of reproach can be met with finding another project at another client that respects its team and their values. I appreciate that those early in their professional programming career’s don’t have such luxuries to fall back on.

I’m really glad this behaviour got noticed by other members of the team, and more importantly that they see it as a positive attitude. I suspect that it could easily be viewed by others as some kind of ego trip - “my way or the highway” - but I don’t think we ever “threw our toys out of the pram” when a decision had gone against us. If we did put our spades down, then it was probably because it wasn’t worthy of any further debate.

The point of a retrospective is to “inspect and adapt” so that the team can work more effectively together. There’s little point doing that if you cannot instigate the changes you desire and sometimes you might need to fight pretty hard to get them. Of course one also hopes the cost is worth it in the end.

 

[1] I understand why a company might choose to build its own tooling, because they can integrate in ways a 3rd party tool perhaps can’t. But they can easily end up like many open source projects - abandoned and only half-finished once the initial excitement has worn off.

Wednesday, 17 September 2014

Refactoring Or Re-Factoring?

As per usual on Twitter, a link to an interesting blog post was doing the rounds recently on refactoring. The post was titled “Stop Refactoring and Start Factoring” and it tackled the issue of how much effort you put into the design of your code so that it best portrays your understanding of the problem. I felt that his beef was with the interpretation around “the simplest thing that could possibly work” and how some programmers might choose to put less effort into the initial design and leave it until a refactoring phase later to try and refine the design then. Of course by then the original programmer might have left and so you are left with two jobs - understanding their half-baked design first and then making the change the customer wants.

Martin Fowler’s book on Refactoring was written a long time ago (1999) and I think any interpretation needs to be done within the historical context that surrounded software development back then. Enterprise software in particular has a reputation of being largely “write-once” - changes were made by doing the smallest possible edits - no attempt was made to improve the design at the same time. I believe Fowler’s book was a call-to-arms to try and change the culture of software development so that systems got the TLC they deserved over their entire lifetime - not just the initial development put into version 1.0.

My understanding of refactoring, and TDD for that matter, has never been to put off ensuring that the design always best reflects my understanding of the problem as it exists today. Doing that - leaving the design as sub-optimal - is what we class as Technical Debt. Consequently I’ve always mentally inserted a hyphen into the word “refactoring” as what I believe changes when I come back to a piece of code is the enhanced clarity of the problem. As such what I’m doing when re-factoring is really just doing the factoring I would have done in the first place, if I had known then, what I know now.

If we apply the prime directive from the retrospective process that suggests developers are always doing their best work, and assuming that any conscious sub-standard work is recorded formally as technical debt, then that really just leaves developer ability. Perhaps what the author was really getting at is the lack of development skill to go that extra mile and create a more beautifully crafted design. For example, in my post “Primitive Domain Types - Too Much Like Hard Work?” I lamented the apparent ease with which the simpler choice of using primitive types gets taken instead of creating your own constrained domain types. This kind of behaviour though may not be a conscious decision from the developer if they have not worked in an environment where that design technique has been used. Sadly I suspect the realm of “Unconscious Incompetence” remains far too dominant within many programmers because the drive is for knowledge of tools and technologies [1], which can be easily purchased via the training budget, rather than the more craftsman-like skills that come from mixing with “the right sort”.

Or maybe not. Perhaps my interpretation of refactoring is far too liberal and he sees too many others taking it far too literally. It wouldn’t be the first time this has happened [2].

 

[1] I freely hold my hand up as once believing that that was what was important. Maybe not exclusively but in retrospect I now question my original choice of priorities.

[2] Design Patterns for one. For me they have always been about vocabulary - a means to communicate an idea without having to go into all the gory details. As time has passed (20 years, phew!) the implementation details in the book have aged, but the language still lives on.

Tuesday, 16 September 2014

The Surgical Team

I first read Fred Brooks’ essay The Surgical Team (from his seminal work The Mythical Man Month) over a decade ago. More recently I heard Allan Kelly talk at Agile Cambridge 2013 about Xanpan and the same idea of building a team and flowing the work through them came up again. This has always seemed perfectly logical to me, but as a freelancer who is hired for a project which tends to be quite long lived, I’ve never seen this play out for real. That changed last year and I got to be part of a surgical team.

First On The Scene

It was the Monday after the so-called “hurricane” and I was the first programmer from our team in that morning. Another would be in later when the chaos died down and the other two had the day off. I grabbed a coffee and put my bag down, but before I could unpack my laptop I was told to meet the project manager in a meeting room as soon as I could. Needless to say I was expecting the worst…

On the contrary the meeting took an entirely different tone. Due to an unexplainable performance problem a tactical solution was now needed to provide a low-latency service for doing some simple data mapping on a multi-GB data set. As always the service needed to be in production ASAP, which in this case meant 2 weeks. I was told we should drop what we were working on (a greenfield project in mid-flow) and focus entirely on this new problem.

Backup Arrives

By the time my first fellow programmer had made it in I had already worked out a possible solution with the main architect (it wasn’t rocket science) and we got started on building it. I created a Git repo, Trello board to track the tasks and the skeleton Visual Studio solution. After that was pushed I started working on the data mapping logic and my colleague started on putting together the web API aspect so that we could stay out of each other’s way. When the other two team members came back the following day, one picked up the continuous integration aspect to got the build pipeline working, and the other starting putting together the performance tests as they were essential to our success. Nearly everything was written in a test-first manner with a focus on acceptance tests for the service itself and unit tests for some of the internal machinery that would have been hard to test at system level.

We had a functionally complete demo service up and running within 3 days that the downstream team could code against, and by the end of the week we had started to bottom out some of the deployment issues which were only held up because the test and production infrastructure was not available. The service itself was ready within 2 weeks and what spilled over were the deployment and operations aspects which only became apparent once we had infrastructure to deploy on. The other three developers had already gone back to the original project by this time and I was left to mop up the loose ends.

Post Mortem 

As a team we had been together for less than 6 months, with the 4 (experienced) developers all being new, both to the client and each other. Looking back it had taken us more than two weeks just to get the build pipeline working on our main project, so being able to re-purpose the CI and development servers was a massive leg up. We also had existing security groups and accounts that we could re-use for the services whilst waiting for the project specific ones to be created. Finally the service itself was a web API which we were already creating for the main project and so we could drag over a lot of the scripts and common helper code we had written and tested there. Once of the best things about writing code that is “always ready to ship” is that you can reuse it in a hurry with a large degree of confidence that it’s already been decently tested.

What is probably evident from the second half of the story is that being new to the client we had no first hand experience of what it meant to put a service into production, so once we reached that point we slowed to a crawl. For instance we did not know who we needed to speak to about the support and operational requirements and so couldn’t have the conversation early enough to implement the service heartbeat, sort out the logging format, event log IDs, etc. when we easily had the capacity to handle them.

Epilogue

It was a pleasurable feeling to experience that time pressure again, but only because we knew it had an end close by and ultimately we were going to be heroes if the service was delivered on time. I’m also pleased to see that even under such pressure we stuck to the same development process we were already using - no testing corners were cut either because of the time pressure or the “allegedly” tactical nature of the solution. In essence the fastest way was the right way.