Friday 7 December 2012

To Wait Or Not - Turning Deadlock to Livelock

On the system I’m currently working on we have always seen the occasional failure caused by the database going to sleep for longer than the timeout on a particular write. When I say occasional we were seeing this once every few months, maybe. There is some polling of the table, but effectively there should never be any overlap between the query used to retrieve data and one used to update it later[1]. This eventually started to become a little more frequent and so I decided to add a little extra code to make the specific write query a little more resilient by allowing the 30 second default timeout to be more more configurable. The update is very small and so the actual transaction, when given a chance to do its work is tiny, so the extended timeout didn’t feel too dangerous. I intended to extend it a little bit at a time to ensure I didn’t mask any other issues by accident.

As is always the way I never quite got the time to finish plumbing the changes in in one hit; despite their relative simplicity.

When I finally came back to it I noticed a colleague had decided to take a different route - the more classic back-off and retry. This made me feel a little uneasy because I felt this choice was more risky. The first, and probably least important reason was that the code was now more complex. Alright, so it wasn’t reams and reams of code, but it was far more than just the one line required to bump the timeout. Yes, it could have be simplified by factoring the retry logic using Execute Around Method. But, what concerned me more was the impact on SQL Server of the constant retries from the numerous calling processes there are...

Everyone knows what deadlock is because it’s a classic interview question. But many people I’ve interviewed are far less familiar with its alter ego - livelock. Where with deadlock nobody makes progress but everyone is standing around doing nothing, with livelock the opposite is happening, everyone is so busy trying to avoid one another they still make no progress. The canonical example I know is two people walking towards each other in a corridor but don’t manage to pass because they both keep sidestepping to avoid the path of the other. Change the two people to two processes which are both backing-off and retrying and you can hopefully see where I’m going.

If SQL Server was blocking the update whilst some other lengthy read was occurring[2] it felt better to me to wait for longer on the basis that SQL Server will require no more resources to achieve that. In contrast, backing off and retrying means that the client will timeout, reconnect, and re-execute the query which means a plan has to be re-generated, etc. Even with connection pools and query plan caching it must still add some overhead and therefore would be more costly than waiting longer. If you have many clients doing this you will start to approach the effects you see with livelock.

I wrote about similar effects to this in “Service Providers Are Interested In Your Timeouts Too” where the client of a web service would timeout and retry but with the service still suck executing the previous query it just created more contention. I’m sure SQL Server has been around more than long enough to deal with this admirably, but I’m sure the constant retries don’t come for free.

In this particular case I’m sure the level of contention was easily dealt with either way, but when you have a performance problem that you cannot definitely pin down it’s better to inch your way forward than to change too many things in one go.

 

[1] Use of the READPAST hint helps reduce the readers blocking each other.

[2] We had no obvious candidates for why this might be happening at first. As time went on and the table size grew into many millions of rows there was some correlation with a few support queries that were being run occasionally but nothing really concrete. If it had been a major problem we probably would have stuck the SQL profiler on and tried to force the issue somehow. Intuition is never a good substitute for hard data but sometimes it’s all you have to go on.

Thursday 6 December 2012

Don’t Overload Stored Procedures Using Nullable Parameters

One common technique for minimising change within the database layer is to “overload” a stored procedure’s functionality by using null-defaulted parameters. For example, say you want to provide something in your public interface to set the status of a Task (which might be a table where you store the details of “tasks”). You know that you need to mark a task as “successful” or a “failure”, but then you get twitchy as you wonder what other states you might need, and what additional attributes you might need to store in the future about a competed task. The uncertainty could cause you to decide to create a “generic” stored procedure like this:-

create procedure dbo.SetTaskStatus
(
  @taskId     dbo.TaskId_t,
  @taskStatus dbo.TaskStatus_t
)
as
  . . .
go

By creating a nice “wide” API you can bend and mould it to all manner of new behaviour without disrupting your clients. But at what cost? You have effectively created a very thin veneer over the underlying tables and as a result you will end up with very little of an abstraction at all. The “generic” name and behaviour hardly conveys any meaning and so you need to understand the calling contexts to understand how it should be used.

The first change is almost certainly going to be to store some kind of error message, so you’ll end up adding 1 extra parameter and giving it a null value by default:-

create procedure dbo.SetTaskStatus
(
  @taskId        dbo.TaskId_t,
  @taskStatus    dbo.TaskStatus_t,
  @errroMesssage dbo.TextMessage_t = null
)
as
  . . .
go

But that new parameter only applies when @taskStatus = FAILED. What are you going to do when someone provides an error message on success? Or what about no error message on a failure? The logic in your procedure has just gone from a simple INSERT to something more complex - assuming that you are going to validate your inputs. If you don’t[1] then it could be head-scratching time for the poor developer that calls your API incorrectly by accident. As the number of arguments grows so does the permutations and yet only a few permutations may actually be valid.

I think it’s far more sensible to be explicit up front, just like you would in your back-end code. So I would have two procedures - one for success and another for failure:-

create procedure dbo.SetTaskToSucceeded
(
  @taskId dbo.TaskId_t
)
as
  . . .
go

create procedure dbo.SetTaskToFailed
(
  @taskId        dbo.TaskId_t,
  @errroMesssage dbo.TextMessage_t = null
)
as
  . . .
go

By creating two separate procedures we have removed one parameter and made it clearer exactly when an error message is expected. Any extra state you need can be added to one, the other or both, and it will still be obvious when you need it. It also means it should be easier to find what callers need updating because you can GREP for the specific “overload” rather than the general one.

In some cases you may be worried about duplicating functionality in both procedures. In which case you can always keep the two separate procedures as the publicly exposed API and internally just delegate to a common internal implementation:-

create procedure pub.SetTaskToSucceeded(. . .)
as
  declare @success pub.TaskStatus_t = 1;

  exec impl.SetTaskStatus @taskId, @success, . . .
go

create procedure pub.SetTaskToFailed(. . .)
as
  declare @failure pub.TaskStatus_t = 2;

  exec impl.SetTaskStatus @taskId, @failure, . . .
go

Don’t be afraid to refactor either. If you started with the overloaded case, such as with legacy code, and you find it spiralling out of control, then break it up and expose a new, more focused API. You can initially implement the new API in terms of the old, build up your set of tests and then switchover the clients. Once you have regained control of the API you then have the ability to refactor away the spaghetti with the barrage of tests to back you up.

 

[1] Performance may dictate that you can’t. Or perhaps you have such a shored-up Data Access Layer with excellent integration test coverage and tight control over all access to the database.

Wednesday 5 December 2012

Interfaces Are Not the Only Abstract Fruit

A while back I came across some code that I personally thought was over-engineered[1] and subsequently transferred my ire via Twitter. I was going to leave it at that and slip away unnoticed but @willhains didn’t think I should be allowed to sound off without justifying my position. So, this is my attempt to keep what little credibility I already have intact. This is the third attempt at writing this blog post which has always been a “smell” that I was probably wrong, but writing “Primitive Domain Types” along with a Twitter exchange with @jasongorman this weekend convinces me that there is still some merit in my argument. What I said was this:-

“Designing for the future doesn’t just mean sticking interfaces everywhere in your code. It means considering the cost of change.”

With hindsight it probably should have been kept to just the first sentence, but even 140 characters is still far too much rope for some people it seems. I have addressed the latter part to some degree with my recent post “Putting the Cost of Change Into Perspective” but I’ll try and elucidate a little more specifically later (if you’re still awake by then).

IDo Not IAm

One habit I’ve noticed (and been guilty of myself) is that when creating a new class there is a knee-jerk reaction to instantly create an interface for it, and essentially with the same name as the class. The reason for this is presumably because you suspect you’re going to need to mock it in some tests, either now or the very near future. That statement alone should be enough to convince you that TDD will probably tell you whether that’s true or not. So, in practice what you end up coming across is designs like this:-

interface IThing
{ . . . }

public class Thing : IThing
{ . . . }

Take a step back. Does the String class implement IString? Of course not, it implements various behaviours defined in interfaces that are usually common to many types and are therefore non-type-specific in name[2]. Of course none of this is news and in fact has been written about by far more intelligent people than me a long time ago - namely in “Mock Roles, Not Objects”.

This particular practice does make some sense when you’re implementing a service[3], although even then I would be mindful because a File usually doesn’t implement IFile - it is more likely to implement separate Reader and Writer interfaces[4]. No, it’s value and simple entity types where this practice feels wrong by default.

Although I’ve never experienced the pre-generics era of C#, I have lived through the “generic” containers in C and C++ (pre-template era) and even contributed something nastily similar myself[5] in the distant past. The kind of design this leads to is interfaces with names like IKey, IValue, IIdentity, etc. that effectively have no salient attributes. This may be because you don’t have a second use case from which draw any commonality, or you feel the bare interface alone satisfy the needs for static type safety and that’s enough. The problem I have is that this makes the design hard to understand because you cannot see the true relationships between the types - an IDictionary<IKey, IValue> doesn’t say much[6].

This was exactly the scenario I found myself describing when writing about “Primitive Domain Types” last week. I left the advice as just this:-

“There might be a temptation to factor out an ISurrogateKey interface from SurrogateKey ... I suggest you resist it on the grounds that these are simple value types”

The Cost of Change

One reason to use interfaces in the first place is exactly to isolate code from change and is embodied in the Open/Closed Principle. As long as all your new types can satisfy the existing interface contract you can in theory just slot them in. But to do this the interface must be designed with this kind of extensibility in mind and that often involves gazing into a crystal ball. Get it wrong and you save yourself very little, if anything.

When it comes to published interfaces you’ve got no choice but to think really hard, but for internal interfaces does it really matter? If you’ve no idea what the future holds and you’ve no obvious need to mock the type, why bother creating an interface at all? Maybe the type is just a simple Abstract Data Type or perhaps the interface is really just a single callback function?

Programmers brought up on a diet of C# and Java would do well to digest the following tweet from John Carmack:-

“Sometimes, the elegant implementation is just a function. Not a method. Not a class. Not a framework. Just a function.”

 

[1] Trying to remain faithful to The Simplest Thing That Could Possibly Work is hard, very hard. It’s even harder for older software engineers because we were brought up in a time when code was less malleable and “fixing” what wasn’t actually “broken” was heresy.

[2] The classic OO naming scheme is nouns for classes and verbs for interfaces. The example that always comes to my mind is a class called Thread but a behaviour called Runnable. The cheaper alternative to thinking seems to be to just stick the suffix “Service” onto both the class and interface.

[3] What a wonderfully overloaded term “service” is. In one sense you can be talking about the “services” a class provides through its interfaces and in another you might be referring to the “service” (or daemon) that is the process hosting the class.

[4] An aggregate interface like IFile (i.e. interface IFile : IReader, IWriter) can be useful, but should ideally be restricted to interfacing with legacy code. Also at system-test level you probably want to be able to read real data but redirect your writing to a test-specific location. I touched on the configuration side of this in “Testing Drives the Need for Flexible Configuration”.

[5] In C# this is realised with collections of type Object. In C and C++ it means holding values as void* pointers, or using structs in C that have a memory layout that simulates inheritance.

[6] The lack of typedef’s in C# is one my biggest bugbears coming from C++, at least as far as making code using Generics more readable.

Monday 3 December 2012

Value Type or Reference Type? An Acid Test

Life was simple under C+++ - everything is a value type. As someone who grew up via Assembler and then C on their way to C++, the notion of pointers always made perfect sense, after all it’s just code and memory. When I moved to C# a few years ago I had to learn all about the differences between Reference Types and Value Types. Of course references are pretty close to pointers, but types like String that are implemented as a reference type, but masquerade as a value type seemed confusing at first. That’s probably why it’s such a common interview question. Closely related, though probably not obviously at first, is the subject of immutability. Once again, in C++ you have “const” which allows you to make instances read-only, but in C# it has to be at the type level.

So, the String type is implemented as a “reference type”, but has “value types” semantics. So far so easy, but now let’s move on to the classic Customer type so beloved by sample code writers everywhere:-

public class Customer
{
  public Identity Id { get; }
  public string Name { get; }
  public Address Address { get; }
}

Now imagine you want to write a unit test that checks that the type can be serialized correctly because you’re going to pass it over a WCF channel. How do you check to make sure that an instance of the type will been rehydrated correctly? One answer might be to see if the original and deserialised objects “are equal”:-

[Test]
public void CustomerCanBeSerialised()
{
  var input = new Customer(. . .);
  var output = deserialise(serialise(input));

  Assert.That(output, Is.EqualTo(input));
}

To makes this work means implementing Equals() on the Customer type. And the best practices state that you if you implement Equals then you almost certainly should be looking to override GetHashCode() too.

Hmmm, what a hole we’re digging for ourselves here. It seems logical to use an equality comparison here because we might also have some non-public state that we also need to verify has been serialised correctly. But what’s that got to do with comparing two arbitrary instances of the same type to see if they are equal? Surely in that case only the observable state and behaviour matters?

This is where parts of the codebase I’m working on at the moment have ended up. What I suspect has made this situation worse is the apparent ease with which these member functions can be cranked out with a tool like ReSharper[1]. Some blame might also be laid at the door of Unit Testing because the tests were written with good, honest intentions and it’s easy to be swept along by the desire to do The Right Thing.

Ultimately the question I seem to be asking myself when creating a new type is not “is this a value or reference type?”, but “do I need to implement Equals()?”. And yet when googling around the subject the two terms seem inexplicably linked and often create a circular argument:-

“Implement Equals() if you want your reference type to have value type semantics” vs “if your reference type implements Equals() then its like a value type”

How about another use case - detecting changes to an object. A common pattern for handling updates to objects involves code like this:-

string newValue = m_control.GetValue();
if (newValue != m_oldValue)
{
  // value changed do something…
}

Here we are again using equality as a means to detect change. Of course String is a simple value type, but why shouldn’t I use the same mechanism to decide if it’s worth sending my Customer object over the network to update the database?

var newCustomer = GetCustomerDetails();

if (newCustomer != m_oldCustomer)
  m_database.SaveChanges(newCustomer);

Ah, you say, but Customer is an entity, not a value. Thanks, as if getting your head around this whole idea isn’t hard enough someone’s just added another term to try and throw you off the scent. Using the the logic that an entity is not a value implies it’s a reference type then, no? Oh, so value vs entity is not the same as value vs reference - no wonder this programming lark is so hard!

So, my first rule of thumb seems to be to never implement Equals() or any sort of comparator by default. This makes perfect sense as there is no “natural ordering” for most custom types. Implementing Equals() as a means to another end (e.g. testing) is fraught with danger and will only confound and confuse the future generations of maintenance programmers. Plus, once you’ve used it, your one chance to override Equals() its gone for good.

But what of my original question - whether a type should be implemented as (or like) a value type or a reference type? The default answer must surely be “reference type”. Which is probably of no surprise to anybody. And what of the aforementioned acid test? What I have learnt is that pretty much the only time I probably should ever have actually implemented Equals()/GetHasCode() is because I wanted to use the type as a key in a HashMap, Set, Dictionary, etc. And in cases where I’ve felt inclined to store a Set of <T>, and therefore scratched my head over the lack of a comparator, I’ve realised that I should instead be creating a Dictionary<Key, T>, or to apply the Customer example, it would be Dictionary<Identity, Customer>. Who would have thought the choice of container could have such an effect on the way a type might be implemented?

This nicely allows me to extend my previous post about “Primitive Domain Types”. I suggested that all the “Identity” type needed was a constructor, but of course it was also missing the comparison support and that’s another burden when defining the type. There are quit a few unit tests that need to be written if you want to cover all the cases that the contract for overriding Equals() spells out. Sticking with the underlying types is looking so much more attractive than cranking out your own value types!

At the beginning I may have suggested that with C++ it’s all a bed of roses. Of course it’s not. The equality model does seem to be much simpler though, but that’s probably just 15 years of C++ experience talking. Although you can compare pointers as a form of reference equality you don’t have pointers masquerading as values. Just imagine what would happen if std::shared_ptr<> provided an == operator that compared the de-referenced values instead of the actual pointers. Hey, actually that could work, couldn’t it?

 

[1] This is clearly not ReSharper’s fault but it does illustrate how tooling (or lack thereof) can affect the implementation.

Friday 30 November 2012

Whatever Happened to UML?

I first came across UML in the late ‘90s. One of the chaps I was working with downloaded a demo for a tool called Select. We watched as he drew a few boxes and connected them. I can’t remember much more about that particular product but up until that time I had only seen the occasional model in magazines like Dr Dobbs Journal and wondered what it was all about. It inspired me though to pick up a book[1] on the subject (UML Toolkit) and delve into the topic a bit more.

On my next contract I discovered the company had some licenses for Rational Rose ‘98. I initially used it to knock up a few sketches of some stuff I was working on; as much to help me put a legible drawing together as anything else. When we finally started working on a new project I thought I’d try my hand at using some of the other diagrams, to see what else we could get out of it. In particular I was interested in how the Use Case models might play out. I also gave the code generation options a spin to see whether it would be possible to create a skeleton from the model and then keep them in sync.

Of course none of this really panned out. I’m not sure how much was down to the tool and how much was my misunderstanding of what I thought I could achieve, but either way it felt a waste of time. The Use Case modelling was probably more my fault, but it seemed far more work than a set of bullet points and rough notes; I didn’t really see what the model added. The code generation was more likely down to the tool because it just couldn’t represent the relationships in the way I wanted. Useful though the code generation options were for formatting and file structure it just didn’t generate code I liked (for starters it wanted to use pointers everywhere and I wanted references).

But it’s not all bad news. If I look back at my first experiences what really attracted me wasn’t the formal notion of modelling an entire system up front and then pushing a button to generate the code, but the fact that you could knock up a quick diagram that used a representation everyone could understand (eventually)[2]. Seeing the Design Patterns book using UML for its diagrams also seemed to give credence to it’s usefulness. The very idea of Design Patterns seemed to be about taking the “template” and then knocking up a quick sketch where you replaced their class names with the ones from your problem and then standing back to see whether it made sense, or needed further refinement.

I remember having a discussion back then with a colleague about a logging framework we were planning on writing. The Design Patterns book had also taught me to start thinking in interfaces and I was trying to explain how the various parts could fit together and where we could use the Composite and Adaptor patterns to tackle certain problems. My colleague couldn’t immediately see the beauty that this was bringing so I knocked up a UML sketch of the classes and interfaces and gestured about how they would interact. Seeing the light bulb go on as all these concepts fell into place was pretty neat.

Both then and in the intervening years I have found the Class and Deployment Diagrams the most useful. Although I find TDD satisfies for most of my day-to-day work because of the need to refactor and take small steps, when I have a significant new piece to work on I then go back to the sketches and see how it might fit in. I still find that I need to have at least a vague idea of what I’m supposed to be designing before jumping in; it’s more a case of Small Design Up Front than BDUF though.

I’m not suggesting that it’s UML in particular either that I find useful. Really it’s just about drawing pictures to convey an idea. It just happens though that if I’m going to do any diagrams at all then they might as well be done in a tool and format that has some life in it.

That last statement is one that bothers me though. The problem with any tool more niche that something like MS Word is that you wonder if it’ll still be around long enough for you to open the documents again. The kind of diagrams I draw only get an airing every now and then and so you wonder when they will finally pass their best before date. I’ve been using Star UML for the last few years because it’s a native app so it’s very responsive, unlike the Java offerings which lag badly even on modern hardware. Although supposedly Open Source it’s seen little activity since 2005 with just a couple of sideline projects to try and resurrect it in Java.

One of the recent’ish changes to the Doxygen landscape is that you can embed UML diagrams alongside your comments, which can then be consumed with Plant UML. At the recent ACCU Conference one of the last sessions I went to was by Diomidis Spinellis about UML and declarative diagramming tools. The thought of being able to keep the models inside the source code seems to be about the best chance we have of things staying in step. Comments do go stale, but not nearly as easily as a separate document or one outside the source repository.

This also tackles one of the other major problem I’ve had with UML models in the past - merging. Although it’s nice to keep the related diagrams together so that you can browse them, this increases the chances that you’ll have to merge them as they get bigger. The way I saw it you either avoid branching (which is not a bad idea anyway), only update the model on the mainline (likely to forget) or keep the model as a bunch of separate smaller files instead (less navigable). I like to avoid branching anyway and actually I found that keeping the deployment and class models in separate files wasn’t so bad because they are only sketches after all. The prospect of keeping them in the source code just makes the problem pretty much go away. Now the only merge conflicts you have to deal with are the same kind you already deal with anyway.

I am surprised that UML has pretty much disappeared. But then maybe I’m just one of those people who needs pretty pictures to help them visualise a design. Or maybe everyone else just draws on scraps and paper and shoves them in the drawer - I have some of those too. I guess that I fundamentally believe that anything[3] that is of value to me is also probably of value to my teammates and successors, and therefore it’s at least worth a little extra effort to formalise it in some way.

 

[1] If you’re interested in a book on UML you can’t go far wrong with UML Distilled by Martin Fowler. Here is a link to the book review I did for the ACCU.

[2] The whole “aggregation” thing still confuses me. I’m way too literal and so people who talk about wheels being wholly “owned” by a car just doesn’t make sense to me. What does it then mean when you need to replace a puncture and you get the spare out of the boot and give the broken one to the man at Kwik-Fit to fix?

[3] It used to surprise me when I came across developers that didn’t share their test code and test harnesses (not through any malicious intend I’m sure), especially when you were working on a codebase with no automated tests. Mind you, after after the incident I had on one team perhaps they just learned to keep to themselves.

Thursday 29 November 2012

Primitive Domain Types - Too Much Like Hard Work?

How many books on programming start out by showing you how to define simple domain types and then use them throughout their examples? I’d wager very few; if not none (possibly a few specialist “computer science” books might). And I’m not talking about composite domain types either like an address, I’m talking about the (seemingly) primitive ones like time, date, product ID, URL, etc. In example code it’s so much easier to just write:-

string url = “http://example.com/thingy”;
. . .
int secs = 10;
int millis = secs * 1000;

If you were brought up on a language like BASIC or C you could be forgiven for sticking to the built-in types by default. After all that’s all you had available. In the performance conscious days gone by there was also a natural fear of extra complexity (irrespective of whether it could be measured and proven to show no degradation in a release build) that caused you to sacrifice type safety; just in case. You’d rather stick to the good old faithful “int” type rather than declare a simple C++ class called “date”[1] to encapsulate all those quirks you just know will appear the moment you delve into the murky waters of calendars. After all your current use case is really simple, right?

But what about declaring a typedef, doesn’t that count?

typedef date_t int;

In C that might be the best you can do without full blown classes[2]. At least any interfaces can now express their intent more clearly because both the type and parameter names are now working for you:-

datespan_t calculateInterval(date_t start, date_t end);

Now you don’t have to use “Reverse Hungarian Notation” and append the type name as a suffix to the variable name (which doesn’t work for the return type anyway) if you don’t want to:-

int calculateInterval(int startDate, int endDate);

In C++, C# and pretty much any other OO language you can do better - you have “the power” to create new types! I suspect though many developers don’t instantly reach for the “class” tool when their database is returning them a surrogate key typed as an int. In fact I’d wager again that the int will remain as the type used throughout the OO proficient codebase too. After all, if it’s an int in the database, why shouldn’t it be an int in the C# code too? It’s better to be consistent, right?

Did you stop to consider whether the reason an int was used in the database is because it’s only the closest possible representation the RDBMS provides for that concept? There might also be a bunch of constraints on the columns where it’s used; have you replicated those in your C++/C# code too? If the RDBMS supports User Defined Types (UDTs) you can at least do as well as the C scenario and use a consistent alias across-the-board. Before SQL Server 2008 added a native DATE type this was one possible approach to providing some expression of that concept in your schema:-

CREATE TYPE date AS smalldatetime;

Coming back to the excessive use of plain ints as the default type for surrogate keys, dates, etc. reminds me what some of our interfaces would look like if we didn’t create some simple domain types:-

List<Stuff> FetchStuff(int customerId, int start, int end)

You might argue that the names of the parameters could be better and that IntelliSense will guide you to the correct invocation. If you stick to a common pattern (ID, start date, end date) for methods taking similar sets of arguments[3] you lessen the chances of getting it wrong. But we don’t all notice such “obvious” patterns in our function signatures[4] and so someone is bound to switch them round by accident in a different interface:-

Other[] GetOthers(int start, int end, int customerId)

If you don’t have any decent tests you’re going to be reaching for the debugger when it finally does show up that somebody’s code calls it with the arguments the wrong way round.

There is one other gripe I have about using bare arithmetic types in this way - what exactly does this mean?

int meh = customerId + managerId;

The only time I can see where the use of arithmetic might be required is when generating IDs:-

int customerId = ++nextCustomerId;

But this code should probably be encapsulated somewhere else anyway. The use of the int type to generate IDs does not imply that the resulting “key” type must also be the same primitive type, but what we’re really talking about now is immutability:-

ID customerId = new ID(++nextCustomerId);

If you’re using C++ then you could use an enum as a strongly-typed way to handle an integer based ID[5]. Yes, you need to use cast-like syntax to initialise values but you get an awful lot else in return. If you don’t need any constraints on the value this may be enough, but if you do then you’re probably looking to create a class instead; perhaps with a dash of templates/generics thrown in to help factor out some commonality.

So, let’s continue with the surrogate key problem. I can think of two constraints on an ID that I’d like to enforce which reduces the set of allowable values from +-2 billion to just +1 to +2 billion. The first is that negative IDs are out on the premise of just being weird. The second is that C# uses 0 for an uninitialised value so I’d like to avoid that too. The former constraint might suggest the use of unsigned integers, but then we’re faced with the problem of adding a different constraint at the top end to ensure we don’t enter the 2 billion+ range because our code might support it but the database doesn’t.

I could decide to apply this constraint just to the database and that would at least protect my persisted data, assuming of course everyone remembers to use the same constraint everywhere it’s needed. However I’m a fan of code failing fast, in fact as soon as possible and I can’t think of any moment sooner than construction with a duff value.

So, here is a first-order approximation to get the ball rolling:-

public class SurrogateKey<T>
{
  public T Value { get; private set; }
}

The use of a base class is probably overkill but I thought it might be an interesting first step as I’ve already stated that ints and strings are two common types used for surrogate keys. Our first use case then is the ID field for customers, products, orders, etc. which I suggested a moment ago should have a range less than that of a full integer. So, let’s capture those constraints in a derived class:-

public class Identity : SurrogateKey<int>
{
  public Identity(int value)
  {
    if (value < 1)
      throw new InvalidIdentityException(...);

    Value = value;
  }
}

Now, this is where in C++ and C# the notion of strongly-typed typedefs/aliases would be really useful. We have a whole family of types that are all implemented the same way, but whose values are not implicitly interchangeable. In his first book, Imperfect C++, Matthew Wilson describes one technique for creating such strongly-typed beasts. And what prompted me to write this post was Martin Moene’s recent post on the Whole Value Idiom which touches on the the same ground but from a slightly different perspective[7].

Anyway, given that we don’t have strongly-typed typedefs we’re stuck with creating a very thin class that contains noting more than a constructor:-

public class ProductId : Identity
{
  public ProductId(int identity)
    : base(identity)
  { }
}

I’ll leave it up to you to decide whether the notion of sealing classes is a good or a bad thing (opinions seemed to have changed over time). Given that we’re after implementation inheritance and not polymorphism it probably should be. You could of course swap the use of inheritance for composition but then you’re giving yourself more work to do to expose the underlying value. Given our aim is to provide the ability to create lots of similar value types easily I’m happy with the current approach from the maintenance angle.

There might be a temptation to factor out an ISurrogateKey interface from SurrogateKey. This is an urge I believe comes from programmers more au fait with the pre-template/generics era or who are overly obsessed with mocking. I suggest you resist it on the grounds that these are simple value types[8].

So, how well does this idea translate to our other, similar problems? Well, for string keys I can easily imagine that you might want to restrict the character set. The same constraint could be applied to file paths and URLs. And I’ve already written about how I’m struggling with the “null reference vs empty string” duality. A C# Date class would almost certainly be a far richer abstraction that might leverage the DateTime class through composition. Unlike the more trivial surrogate key problem, the Date class has far more complex behaviours to implement, such as formatting as a string. And that’s before you decide whether to add the arithmetic operators too, or just worry about the storage aspects.

This, I feel, is where the “too much like hard work” comes in. It’s really easy to convince yourself that you can just pass around the “built-in” types and resort to “being careful” instead. After all you probably have a barrage of unit tests to ensure that you use them correctly internally. But the kinds of problems you are heading for cannot be found at the unit level; they occur at integration level and above. This means that the place where the duff data goes in - for example the UI - could be miles away from where the problem first shows up - the database if you’re still really lucky, but days later if the data is queued and there is no immediate processing of it.

I reckon most developers have no qualms about using enums for a type-safe set of limited values, so why does it feel like there is such a leap from that to a much larger, but still restricted, set of primitive values?

 

[1] There was no Boost when I started learning C++. There was no Java either for that matter. Let’s not discount an alternative either such as a simple value type and a set of free standing functions - OO is not the only fruit.

[2] One alternative for C is to create an opaque type with a struct, such as Windows uses for its various handles - HBRUSH, HPEN, HFONT, etc.

[3] There is an argument that this tuple of values (ID, start date, end date) should perhaps have a type of it’s own because they are clearly related in some way. The hard part would be trying to avoid the much overused suffix “Context”…

[4] In the eternal struggle to define what “cohesive” looks like, this would probably be in there, somewhere.

[5] Thanks to fellow ACCU member and C++ expert Jonathan Wakely for pointing this idiom out.

[6] I’ve mostly focused on the much simpler problem of “unitless” non-arithmetic values, but clearly I’m on very shaky ground because a date is normally measured in days and is usually anything but non-arithmetic.

[7] I’ve been working on a post titled “Interfaces Are Not The Only Abstract Fruit” but never been happy with the example I chose. I think I may have just discovered a better one.

Tuesday 20 November 2012

Using Nullable Columns to Handle Large Table Schema Changes

One of the problems with updating the schema of a database table that has a significant amount of data is that it can take ages. And whilst the data is being fixed up your database is in a half-baked state. A faster approach is to make a simpler metadata only change up-front, plug the missing data in in slow-time and then complete the meta-data change afterwards to close the loop. Using nullable columns makes the process more manageable.

Say for example I have a large customer table and I want to add a new non-nullable column with a default value. If I add that column as-is the database is going to have to touch every page to set the default values on all existing rows. However, if I add a nullable column, SQL Server at least, only has to update the table’s meta-data, it touches no data pages.

ALTER TABLE Customer ADD NewColumn INT NULL;

If the public interface to your database is raw access to the database tables then you’re pretty well hosed at this point without fixing the queries that touch the tables. But if you’ve built yourself a facade using views, functions, etc. to abstract the underlying data model you can use ISNULL() to fill in the default values on-the-fly when the client queries old data:-

SELECT c.Name, 
       ISNULL(c.NewColumn, @defaultValue) 
FROM   Customer c

When writing data your facade can manually apply the not null constraint and fill in the default values when not provided by the client. This may not be quite as performant as letting the database handle it natively but, bulk inserts aside, this should be more than adequate for most scenarios.

With the facade doing its job and encapsulating the data you’re free to update the old data in slow time. SQL Server (and probably other RDBMSs too) support using the TOP clause with the UPDATE statement to allow you to fix your data in small batches which helps keeps the transaction time down. It also means you can avoid blowing the transaction log if it’s of a fixed size.

UPDATE TOP(1000) Customer
SET   NewColumn = @defaultValue
WHERE NewColumn is null

Theoretically speaking the nullability of the column is correctly handled, at least from the client’s perspective[1], so you don’t need to actually alter the underlying table again. But if you do want/need to enforce the constraint more naturally you’re going to have to bite the bullet. SQL Server supports the WITH NOCHECK option that still allows you to make a metadata only change, but that comes with its own drawbacks, and so you may just need to accept a final hit. However, at least you can split the whole task up into smaller chunks and execute them as capacity permits rather than panicking over how you’re going to squeeze the entire change into a smaller window.

 

[1] When I talk about Database Development Using TDD, this is the kind of requirement I have in mind as it’s perfect unit test fodder.

From Test Harness To Support Tool

Before I discovered the wonders of unit testing I used to follow the suggestion from Steve Maguire in his book “Writing Solid Code” about stepping through your code in the debugger to test it. In particular he was suggesting that you look at the common pitfalls, such as off-by-one errors when traversing arrays and writing loops. This had a profound effect on the way that I partitioned my code because to perform this practice efficiently I needed to get my code executing in the debugger as quickly as possible. That in turn meant that code was generally packaged into small (static) libraries and then linked into both the main product and, more importantly for this discussion, a test harness.

The test harnesses may have started out as trivial applications, but the library often has some potentially reusable[1] value outside the main product. Consequently I often ended up putting more than a cursory amount of effort into them. Yes, sometimes I even contrived a feature just to allow a particular code path to be executed and therefore tested. In hindsight it was not an overly productive use of my time compared to, say, unit testing, but it did have the benefit of creating both a system/integration-test tool, and sometimes as a by-product a support tool. From what I’ve read about some of the support tools Microsoft ships, they also started out as internal utilities that others eventually found useful and the word spread.

Take for example my own, simple DDE and ODBC libraries. These weren’t written with any particular application in mind, but I created a simple GUI based test harness for each of them - DDEQuery came from the former and PQT from the latter. Although only intended as personal test tools I have used them both on many occasions to aid in the development and support of my real work[2]. OK, these are my tools and therefore I have more than a vested interest in using and maintaining them. But the same idea has paid dividends many times over on the other systems I’ve been paid to work on.

I began to realise how useful the test tools could be outside the team when I started working on a multi-language project for the first time. The front-end was written in PowerBuilder and the back-end in C. I was part of a team tasked with making substantial performance improvements, but the first problem I could see was that I had to waste more than 4 minutes loading the GUI every time before I even hit any of my team’s code. What I felt we needed was our own cut-down tool, not necessarily UI based, that would allow us to invoke the entry points more rapidly for testing and more importantly provide an alternate means through which we could invoke the profiler.

The watershed moment was when the tool got bundled with the main GUI during testing. Although it presented a very Matrix-esque view of the data it was enough to allow them to workaround some UI bugs, such as when an item was missing from the dashboard so you couldn’t even select it, let alone manipulate it. Nowadays even though I’d expect automated unit and collaboration tests I’d still want some sort of harness just to enable profiling and code checking, such as with a tool like BoundsChecker. Unit testing does not make any other sort of testing redundant, but you might get away with doing less of it.

I ended up in a similar kind of position many years later, but this time it was a VB front-end and C++ back-end with a good dose of COM thrown in the middle. This time they already had a command line tool for driving the calculation engine but still nothing to act as a front-end. Once again I was somewhat amazed that the first port of call when debugging was to invoke the VB front-end which an age to start up. This also implied that the COM interface was never tested in isolation either - it must have been painful working on the front/back-end interface.

I was particularly pleased with one simple little tool that made my life much easier. It allowed me to extract portions of a large data set so that I could turnaround my local testing much quicker. What I never expected, when broadcasting the existence of this tool to the rest of the team, was that I would be chastised for using company time to write a tool that helped me do my job more efficiently! I also built a simple GUI to act as a mock front-end, just like before, but this time I was far more careful about who I mentioned it to.

There were other simple test harnesses in the VCS repository from days gone by, but given the attitude of those up above it’s not surprising that they were left to rot. When I came to try them out I found none of them even built. Only one was of any vague use and so I nurtured that and made sure it became part of the standard build along with any other new tools I wrote. I then packaged it along with the others into a separate Utils package that could be deployed into an environment on demand to aid in any manual testing or support.

I eventually felt vindicated when the tool I was chastised over started to be used by both the BAs and BAU team to generate custom data sets for testing specific scenarios. In one of the essays that Fred Brooks published as part of The Mythical Man Month he suggests there might be:-

“It is not unreasonable for there to be half as much code in scaffolding as there is in product”

Nowadays with unit testing I would expect that figure to be even higher. Of course there is a cost to all this “scaffolding” as it needs to be maintained just like the production code. I suspect that the more naive managers believe that if they skimp on test code they will have less waste to maintain and therefore more time can be spent on building new features instead. In my experience those projects I worked on that had a plethora of tests and tools delivered new releases an order of magnitude more frequently than those without.

 

[1] I, along with many others, genuinely believed that we would be writing lots of reusable, pluggable components that would save ourselves time as we approached each new project. For a software house this may well make sense, and it certainly helped in some cases when I worked at one. But for in-house development trying to share code across projects and teams is just too hard it seems.

[2] The DDE tool rarely sees the light of day now, but PQT is still used by me virtually every day for testing and support.

Monday 19 November 2012

The File-System Is An Implementation Detail

Using a native file-system to store your data can be a double-edged sword. On the one-hand the great thing about a file-system is that the concept of files and folders is well understood and so there’s a low learning curve for new developers and a potential backdoor for support issues too. The downside of using the file-system is that it’s a well understood concept and there is a potential backdoor for the “time conscious” developer to use instead of any carefully designed “abstraction”.

The ability to manipulate the file-system with your everyday tools such as shell commands and scripts makes the allure of the “easy win” just too tempting some times. This is compounded by the fact that it may not be obvious whether the choice to use the file-system in such a way was a conscious one, exactly because of its flexibility, or imposed for some other reason[1], but not intended to be exploited. Once that power has been abused it can become the rationale for doing it again and again as people then follow its lead “to fit in with the existing code”. Then, any notion of “implementation detail” becomes obscured by the sheer volume of exploitative code.

But building an “abstraction” over the file-system doesn’t have to be some gigantic API with SQL like semantics to query and update data, it could be as simple as a bunch of static methods that take individual query parameters that you then turn into a path inside the abstraction. For example, say you have the following legacy code that loads customer data from a file in a value dated folder:-

public void ProcessCustomer(string name, DateTime date)
{
  string datedFolder = date.ToString(“YYYY-MM-DD”); 
  string path = Path.Combine(“\\server\share”, datedFolder); 
  path = Path.Combine(path, name); 

  Customer customer = Customer.Load(path);
  . . .

A big problem with this code is that it makes testing hard because you’re in the territory of having to mock the file-system API - notably some static methods in C#. Fortunately it’s reasonably easy to build a thin Facade[2] that can serve you well when testing your error recovery code too. But it is another barrier on the path (of least resistance) to a more wholesome codebase.

One objection you might immediately raise is that the Customer class takes a file-system path anyway and so you’re already tied to it, i.e. you’re doing no worse. What you’re missing is that the Customer class actually takes a string, the semantics of which happens to be a path. In future it could be a RESTful URL or some other “location”. The Load() method could also be overloaded to take different types of “location”. To support that agnosticism you’re best off getting the “location” to the class loader without needing to know yourself what it is, i.e. you want to start treating the “location” as an opaque type[3].

The simplest change is to just move the file handling code into a separate class:-

public static string FormatCustomerPath(string root, string name, DateTime date)
{
  string datedFolder = date.ToString(“YYYY-MM-DD”);
  string path = Path.Combine(root, datedFolder);
  path = Path.Combine(path, name); 
 
  return path;
}
. . .
public void ProcessCustomer(string name, DateTime date)
{
  string path = FS.FormatCustomerPath(“\\server\share”, name, date);

  Customer customer = Customer.Load(path);
  . . .
}

This very simple change already divorces the caller from many of the grungy details of where the data is stored. For starters you could change the order of the name and the date in the path. You could even insert other “random” elements into the path such as environment name, version number, etc. Or you could format the dates in DDMMMYYY instead of YYYY-MM-DD. The very fact that you’re passing the date as a rich DateTime type also makes the API less prone to error because you’re not juggling lots of raw string parameter[4]. Also, if your fellow developers have a habit of using format strings like “{0}\\{1}\\{2}” for paths then this simple change will improve portability too.

The root of the path (“\\server\share”) might be hard-coded but more likely is stored in a variable so the next step could be to extract that out entirely by storing it in a static member of the FS class instead. This can then be done once as part of the bootstrapping process in main(), leaving the client code virtually unaware of the existence of the file-system at all:-

public void ProcessCustomer(string name, DateTime date)
{
  string path = FS.FormatCustomerPath(name, date);

  Customer customer = Customer.Load(path);
  . . .
}

The final bare minimum change might be to add an extra level of indirection and hoist these two lines out into a separate class, or rename and change the responsibility of the FS facade to solely be about the file-system aspects of loading a variety of objects:-

public void ProcessCustomer(string name, DateTime date)
{
  Customer customer = ObjectLoader.LoadCustomer(name, date);
  . . .
}

I think all these steps were trivial and if you’ve got a mountain of code to refactor you can’t always get everything done in one hit. Dependencies often have a habit of getting in the way too so you can up either having to back out changes or keep on going to the bitter end. By initially sticking with a static facade you can improve things one piece at a time without making your testing position any worse. In fact you’ll still make it slightly easier.

Remember, I’m not suggesting the use of a static facade over a “proper service” that has a “proper mockable interface” and a “proper factory”. Static facades are still fairly unpleasant beasts because they spark thoughts of global variables and Singletons which are generally considered a bad smell. Yes, there is tooling available that will allow you to mock this kind of behaviour but you should probably be looking to a much higher level of abstraction anyway when you have any sort of persistence layer to deal with. Just don’t forget that sometimes you can make a big difference with a much smaller, localised set of changes.

 

[1] One system I worked on uses it because the manual mocks we built for integration testing ended up becoming the production implementations.

[2] Tim Barrass wrote about one way of doing this for C# in his post “Using static classes in tests”. The same technique could be used in C++ but substitute “static method” with “free function” and “delegate” with “function pointer”.

[3] I appreciate that I’m skating on thin ice with the suggestion that you can play games with the all-singing-all-dancing String type. Ultimately I’d expect to switch to some Abstract Data Type at some time later, but only if an even richer abstraction hasn’t come into play before that to mitigate all this.

[4] There are two very common types used for surrogate keys - int and string. It’s tempting to pass these around using their primitive types but that makes an API much harder to use correctly as it’s all too easy to switch parameters by accident and get some weird behaviour.

Tuesday 13 November 2012

When Does a Transient Failure Stop Being Transient?

Depending on the type of system you work on the definition of “transient”, when talking about errors, varies. Although I’ve never worked in the embedded arena I can imagine that it could be measured on a different scale to distributed systems. I work in the latter field where the notion of “transient” varies to some degree depending on the day of the week. There is also an element of laziness around dealing with transient failures that means you might be tempted to just punt to the support team on any unexpected failure rather than design recovery into the heart of the system.

The reason I said that the definition of transient varies depending on the day of the week is because, like many organisations, my current client performs their infrastructure maintenance during the weekend. There are also other ad-hoc long outages that occur then such as DR testing. So, whereas during the week there might be the occasional blip that lasts for seconds, maybe minutes tops, at the weekend the outage could last longer than a day. Is that still transient at that point? I’m inclined to suggest it is, at least from our perspective, because the failure will correct itself without direct intervention from our team. To me permanent failures occur when the system cannot recover automatically.

For many systems the weekend glitches probably don’t matter as there are no users in anyway, but the systems I work on generally chug though other work at the weekend that would not be possible to squeeze in every day due to lack of resources. This means that the following kinds of failures are all expected during the weekend and the architecture just tries to allow progress to be made whenever possible:-

  • The database cluster is taken offline or runs performance sapping maintenance tasks
  • The network shares appear and disappear like Cheshire Cats
  • Application servers are patched and bounced in random orders

Ensuring that these kinds of failures only remain transient is not rocket science, by-and-large you just need to remember not to hold onto resources longer than necessary. So for example don’t create a database connection and then cache it as you’ll have to deal with the need to reconnect all over the place. The Pooling pattern from POSA 3 is your friend here as it allows you to delegate the creation and caching of sensitive resources to another object. At a basic level you then be able to treat each request failure independently without it affecting subsequent requests. There is a corollary to this which is Eager Initialisation at start-up which you might use to detect configuration issues.

The next level up is to enable some form of retry mechanism. If you’re only expecting a database cluster failover or minor network glitch you can ride over it by waiting a short period and then retrying. What you need to be careful of is that you don’t busy wait or retry for too long (i.e. indefinitely) so that the failure cascades into the system performing no work at all. If a critical resource goes down permanently then there is little you can do, but if not all requests rely on the same resources then it’s possible for progress to be made. In some cases you might be able to handle the retry locally, such as by using Execute Around Method.

Handling transient failures locally reduces the burden on the caller, but at the expense of increased complexity within the service implementation. You might also need to pass configuration parameters down the chain to control the level of back-off and retry which makes the invocation messy. Hopefully though you’ll be able to delegate all that to main() when you bootstrap your service implementations. The alternative is to let the error propagate right up to the top so that the outermost code gets to take a view and act. The caller always has the ability to keep an eye on the bigger picture, i.e. number and rate of overall failures, whereas local code can only track its own failures. As always Raymond Chen has some sage advice on the matter of localised retries.

Eventually you will need to give up and move on in the hope that someone else will get some work done. At this point we’re talking about rescheduling the work for some time later. If you’re already using a queue to manage your workload then it might be as simple as pushing it to the back again and giving someone else a go. The blockage will clear in due course and progress will be made again. Alternatively you might suspend the work entirely and then resubmit suspended jobs every so often. Just make sure that you track the number of retries to ensure that you don’t have jobs in the system bouncing around that have long outlived their usefulness. In his chapter of Beautiful Architecture Michael Nygard talks about “fast retries” and “slow retries”. I’ve just categorised the same idea as “retries” and “reschedules” because the latter involves deactivating the job which feels like a more significant change in the job’s lifecycle to me.

Testing this kind of non-functional requirement[1] at the system level is difficult. At the unit test level you can generally simulate certain conditions, but even then throwing the exact type of exception is tricky because it’s usually an implementation detail of some 3rd party library or framework. At the system level you might not be able to pull the plug on an app server because it’s hosted and managed independently in some far off data centre. Shutting an app server down gracefully allows clean-up code to run and so you need to resort to TerminateProcess() or the moral equivalent to ensure a process goes without being given the chance to react. I’m sure everyone has heard of The Chaos Monkey by now but that’s the kind of idea I still aspire to.

I suggested earlier that a lazier approach is to just punt to a support team the moment things start going south. But is that a cost-effective option? For starters you’ve got to pay for the support staff to be on call. Then you’ve got to build the tools the support staff will need to fix problems, which have to be designed, written, tested, documented, etc. Wouldn’t it make more financial sense to put all that effort into building a more reliable system in the first place? After all the bedrock of a maintainable system is a reliable one - without it you’ll spend your time fire-fighting instead.

OK, so the system I’m currently working on is far from perfect and has its share of fragile parts but when an unexpected failure does occur we try hard to get agreement on how we can handle it automatically in future so that the support team remains free to get on and deal with the kinds of issues that humans do best.

 

[1] I’m loathed to use the term “non-functional” because robustness and scalability imply a functioning system and being able to function must therefore be a functional requirement. Tom Gilb doesn’t settle for a wishy-washy requirement like “robust” - he wants it quantified - and why not? It may be the only way the business gets to truly understand how much effort is required to produce reliable software.

Monday 12 November 2012

The Cost of Defensive Programming

They say that the first rule of optimisation is to measure. Actually they say “don’t do it” and the third rule is to measure, but let’s invoke artistic license and just get right on and measure. Another rule when it comes to optimising code is that intuition usually sucks and the bottleneck won’t be where you think it is. Even for experienced developers this still holds true.

My current system has a component that is not unit testable, in fact the only way to test it is to run the entire process and either debug it or compare its outputs. As a consequence the test feedback loop between changes is insanely high and if you throw some infrastructure performance problems into the mix too you can probably guess why I decided to let the profiler have a look at it during one lunch time. From the opening paragraph you can probably guess that what I found was not what I expected…

The code I stumbled upon probably deserves to be submitted to The Daily WTF because it almost manages to achieve functional correctness in a wonderfully obtuse way. It looked something like this:-

var bugs = new Dictionary<int, string>();

string sql = “SELECT * FROM Bug ORDER BY version”;

// execute query...

while (reader.Read())
{
  int id = reader[“BugId”];
  string title = reader[“Title”];

  if (!bugs.Contains(id))
    bugs.Add(id, title);
  else
    bugs[id] = title;
}

Hopefully you should be able to deduce from my attempt to disguise and distil the guilty code that it executes a query to load some data and then it builds a map of ID to some attribute value. The reason I say that it only “almost” works is because what you can’t see from it is that there is something fundamentally missing which is a parameter that defines a context (specifically a point in time) for which the data should be loaded.

However, before I get to the real subject of this post let’s just get the other obvious criticisms out of the way. Yes, the component has SQL statements embedded in it. And yes, the SELECT statement uses ‘*’ when it only requires two of the columns in the table. OK, let’s move on…

So, what prompted me to write this post is actually the use of the “upsert[1] style anti-pattern” in the loop at the end:-

if (!bugs.Contains(id))
  bugs.Add(id, title);
else
  bugs[id] = title;

This code is nasty. It says to me that the programmer is not sure what invariants they are expecting to handle or maintain within this method/class. What probably happened (and this is more evident if you know the data model) is that the Add() started throwing exceptions due to duplicates so it was silenced by turning it into an “Add or Update”[2]. Of course the knock-on effect is that the same value can now be silently updated many times and the observable outcome is the same. The addition of the ORDER BY clause in the SQL is what elevates it from non-deterministically broken to nearly always correct because it is never invoked in production in a manner where the tail-end rows are not the ones that matter.

Putting aside the embedded SQL for the moment, the correct implementation only needs to select the data for the context in question and there should only be one row per item so just doing the Add() is sufficient. The invariant in the data model is that for any point in time there is only one version of an item “active”. The only reason the Add() could throw is if the initial query is wrong, the data is broken or the invariant has changed and the code is now broken. If the latter happens you’ve got real problems and the least of them is that some code has failed very loudly and clearly.

Now to unwind the stack and return to the original performance problem. This code masks what should have been a query that returns 40,000 rows into one that actually sucks in 1,200,000 rows instead. Using the aforementioned “back of the envelope calculation” that is over an order of magnitude more rows than required. Luckily the table is fairly short on columns and so in terms of data retrieved it’s not insanely huge which is probably why it’s crept up largely unnoticed.

As an aside I mentioned that it’s never run retrospectively in production and that is another reason why it has remained hidden. It has however caused me to scratch my head on a few occasions during testing as I’ve observed an unanticipated change in the output but not had the time to pin down the exact cause. A bug that only appears occasionally during testing is never going to make it up the TODO list when there are far more pressing features to implement, and the costly way the component needs to be tested pretty much ensures that it will never get fixed in isolation - a double whammy.

 

[1] For those lucky enough to avoid databases an “upsert” is short for update-or-insert. You attempt an UPDATE and if no rows were modified you perform an INSERT instead. SQL Server has proper support for this idiom these days in the shape of the MERGE keyword.

[2] This “pattern” is sprinkled throughout the component’s code so it’s not an isolated case.

Friday 9 November 2012

Include the Units in Configuration Data

[I have no right to chastise anyone else about this because I do it myself - it just always seems so “unnecessary” at the time… Hence this post is an attempt to keep me honest by giving my colleagues the opportunity to shove it back in my face every time I do it in the future.]

Say I asked you to change a “timeout” setting to “3 minutes” what value would you instinctively use?

  1. 0.05
  2. 3
  3. 180
  4. 180000

Although I’m tempted to say that milliseconds are still the de-facto resolution in most low-level APIs I deal with these days, I have seen nanoseconds creeping in, even if the actual resolution of the implementation is actually less than that. However that’s what the internal API deals with. These kinds of numbers are often way too big and error prone for humans to have to manipulate; if you miss a single zero off that last example it could make a rather important difference.

If you don’t need anywhere near millisecond precision, such as dealing with a run-of-the-mill database query, you can reduce the precision to seconds and still have ample wiggle room. Dealing with more sane numbers, such as “3”, feels so much easier than “0.05” (hours) and is probably less likely to result in a configuration error.

There is of course a downside to not using a uniform set of units for all your configuration data - it now becomes harder to know what the value “3” actually represents when looking at an arbitrary setting. You could look in the documentation if you (a) have some and (b) it’s up to date. If you’re privileged enough to be a developer you can look in the source code, but even then it may not have a comment or be obvious.

One answer is to include the units in the name of the setting, much like you probably already do when naming the same variable in your code. Oh, wait, you don’t do it there either? No, it always seems so obvious there, doesn’t it? And this I suspect is where I go wrong - taking the variable and turning it into an externally configurable item. The variable name gets copied verbatim, but of course the rest of the source code context never travels with it and consequently disappears.

So, the first solution might be to name the setting more carefully and include any units in it. Going back to my original example I would call it “TimeoutInMinutes”. Of course, one always feels the urge to abbreviate and so “TimeoutInMins” may be acceptable too, but that’s a whole different debate. Applying the same idea to another common source of settings - file sizes - you might have “MaxLogFileSizeInKB”. The abbreviation of Kilobytes to KB though is entering even murkier waters because KB and Kb are different units that only differ by case. If you’re dealing with case-insensitive key names you better watch out[1].

Another alternative is to look at adding units to the value instead. So, in my timeout example the value might be “3mins”, or you could say “180 secs” if you preferred. If you’re used to doing web stuff (i.e. I’m thinking CSS) then this way of working might seem more natural. At its expense it adds a burden to the consuming code which must now invoke a more complicated parser than just strtoul() or int.Parse()[2]. It’s not hard to add this bit of code to your framework but it’s still extra work that needs thinking about, writing and testing[3].

For timeouts in particular .Net has the TimeSpan type which has a pretty flexible string representation that takes hours, minutes, seconds, etc into account. For the TimeSpan type our example would be “00:03:00”. But would you actually call it “TimeoutInTimeSpan”? It sounds pretty weird though. Better sounding would be “TimeoutAsTimeSpan”. I guess you could drop the “InTimeSpan” suffix if you were to use it consistently for timeouts, but then we’re back to where we started because I’m not aware of any similar scheme for representing bytes, kilobytes, megabytes, etc.

 

[1] This is one of those situations where case-sensitivity is often not formally defined either. Explicit use of the == operator or implicit use via a Dictionary means no one cares until someone else sticks a different-cased duplicate in by accident and things don’t quite work as expected.

[2] Personally I prefer to wrap this kind of parsing because a raw “invalid format” style exception often tells you pretty much nothing about which value it was that choked. Extracting related settings into a richly-typed “Settings” class up-front means you don’t end up with the parsing exception being thrown later at some horribly inconvenient time. And it reduces the need to mock the configuration mechanism because it’s a simple value-like type.

[3] Not necessarily in that order. It’s perfect TDD/unit test fodder though, so what’s not to like…

Wednesday 7 November 2012

Sensible Defaults

If you were writing a product, say, a service, that allows remote clients to connect and submit requests, what would you choose as the default values for the service hostname? Given that your service will no doubt be fault-tolerant it will probably allow the remote end to disappear and reappear in the background. Of course you might decide to allow a client to configure itself so that after ‘N’ failures to connect it will return an error so that some other (manual) action can take place. What would you pick as the default value for this setting too?

Put your hand up if you said “localhost” and “infinite” as the answer to those two questions. Are you really sure they are sensible values to use by default?

Not unsurprisingly I had to work with a service that had exactly those as its default settings. To make matters worse there were other defaults, such as not having any sort of logging by default[1]. Besides programmatic configuration[2] there was also a config file based mechanism that used the Current Working Directory (CWD) by default. Next question. What is the CWD for a Windows service? No, it’s not the application folder, it’s the system32 folder. By now you can probably tell where this is heading…

Basically we installed our NT Service, fired it up and absolutely nothing happened. Not only that but the service API returned no errors. Naturally we checked and double-checked the installation and obvious output locations but could find no reported problems. In the end I attached the Visual Studio remote debugger and watched in awe at the exceptions bouncing around inside the service API as it tried repeatedly in vain to attach to a service running on “localhost”. No “3 strikes and you’re out” either, it just kept trying over-and-over again.

When you’re developing features hosted within a service you’ll rarely run it as an actual service, you’ll probably tend to run it as a normal console application for speed. The natural side-effect of this is that the CWD will likely be set to the same as where the binary resides, unless you start it with a relative path. The large monolithic service I was working on was always developed in that way as all the installation stuff had preceded me be many years. Yes, we caught it the first time we deployed it formally to a DEV system-test environment, but by then so much time had passed that the start-up shenanigans were far behind us[3][4].

My personal choice of defaults would be “nothing” for the hostname and 0 for the retries. How many systems do you know where the client and middleware run on the same machine? The out-of-the-box configuration should assume a synchronous connect call by default because that is what most other services that developers are used to dealing with do. And by most I mean databases and middle tier services. Yes, asynchrony is gaining ground, even in UIs, but as a developer you already have a hard enough time dealing with learning a new product and API without having to fight it too - I can tell you it gains you no friends in the development world. Once you’re comfortable with the API you could look at its more advanced features.

I’m tempted to suggest that the decision on what to use by default was made by a developer of the product, not a consumer of it. No doubt this configuration made their testing easier exactly because they run the entire product on their development machines. That’s why dogfooding and TDD are such important ideas - they force you into your customers shoes. I’m a big advocate of flexible configuration mechanisms and so don’t see why you can’t also adhere to The Principle of Least Surprise too.

 

[1] No console or file based logging is possibly a sensible default, but only when your API has other ways of telling you about failures. At the very least I would opt for using OutputDebugString() as a last resort so that I could fire up the wonderful DbgView tool from the Sysinternals suite. Even the trusty old Windows event log is better than nothing, just so long as you don’t spam it.

[2] After this incident we switched to using the programmatic interface and set more useful default values inside the facade we built.

[3] The first rule of debugging is to look to the most recent changes for the source of a bug. It’s not always obvious but it’s usually there, somewhere. The bug may not be directly in the change itself but be a side-effect of it; either way the recent changes should be the clue.

[4] This was supposed to be “a simple port”. What should have taken a couple of weeks turned into a couple of months and the project was eventually abandoned. Ultimately an impedance mismatch at the architecture level between our system and the service was to blame.