Friday 30 November 2012

Whatever Happened to UML?

I first came across UML in the late ‘90s. One of the chaps I was working with downloaded a demo for a tool called Select. We watched as he drew a few boxes and connected them. I can’t remember much more about that particular product but up until that time I had only seen the occasional model in magazines like Dr Dobbs Journal and wondered what it was all about. It inspired me though to pick up a book[1] on the subject (UML Toolkit) and delve into the topic a bit more.

On my next contract I discovered the company had some licenses for Rational Rose ‘98. I initially used it to knock up a few sketches of some stuff I was working on; as much to help me put a legible drawing together as anything else. When we finally started working on a new project I thought I’d try my hand at using some of the other diagrams, to see what else we could get out of it. In particular I was interested in how the Use Case models might play out. I also gave the code generation options a spin to see whether it would be possible to create a skeleton from the model and then keep them in sync.

Of course none of this really panned out. I’m not sure how much was down to the tool and how much was my misunderstanding of what I thought I could achieve, but either way it felt a waste of time. The Use Case modelling was probably more my fault, but it seemed far more work than a set of bullet points and rough notes; I didn’t really see what the model added. The code generation was more likely down to the tool because it just couldn’t represent the relationships in the way I wanted. Useful though the code generation options were for formatting and file structure it just didn’t generate code I liked (for starters it wanted to use pointers everywhere and I wanted references).

But it’s not all bad news. If I look back at my first experiences what really attracted me wasn’t the formal notion of modelling an entire system up front and then pushing a button to generate the code, but the fact that you could knock up a quick diagram that used a representation everyone could understand (eventually)[2]. Seeing the Design Patterns book using UML for its diagrams also seemed to give credence to it’s usefulness. The very idea of Design Patterns seemed to be about taking the “template” and then knocking up a quick sketch where you replaced their class names with the ones from your problem and then standing back to see whether it made sense, or needed further refinement.

I remember having a discussion back then with a colleague about a logging framework we were planning on writing. The Design Patterns book had also taught me to start thinking in interfaces and I was trying to explain how the various parts could fit together and where we could use the Composite and Adaptor patterns to tackle certain problems. My colleague couldn’t immediately see the beauty that this was bringing so I knocked up a UML sketch of the classes and interfaces and gestured about how they would interact. Seeing the light bulb go on as all these concepts fell into place was pretty neat.

Both then and in the intervening years I have found the Class and Deployment Diagrams the most useful. Although I find TDD satisfies for most of my day-to-day work because of the need to refactor and take small steps, when I have a significant new piece to work on I then go back to the sketches and see how it might fit in. I still find that I need to have at least a vague idea of what I’m supposed to be designing before jumping in; it’s more a case of Small Design Up Front than BDUF though.

I’m not suggesting that it’s UML in particular either that I find useful. Really it’s just about drawing pictures to convey an idea. It just happens though that if I’m going to do any diagrams at all then they might as well be done in a tool and format that has some life in it.

That last statement is one that bothers me though. The problem with any tool more niche that something like MS Word is that you wonder if it’ll still be around long enough for you to open the documents again. The kind of diagrams I draw only get an airing every now and then and so you wonder when they will finally pass their best before date. I’ve been using Star UML for the last few years because it’s a native app so it’s very responsive, unlike the Java offerings which lag badly even on modern hardware. Although supposedly Open Source it’s seen little activity since 2005 with just a couple of sideline projects to try and resurrect it in Java.

One of the recent’ish changes to the Doxygen landscape is that you can embed UML diagrams alongside your comments, which can then be consumed with Plant UML. At the recent ACCU Conference one of the last sessions I went to was by Diomidis Spinellis about UML and declarative diagramming tools. The thought of being able to keep the models inside the source code seems to be about the best chance we have of things staying in step. Comments do go stale, but not nearly as easily as a separate document or one outside the source repository.

This also tackles one of the other major problem I’ve had with UML models in the past - merging. Although it’s nice to keep the related diagrams together so that you can browse them, this increases the chances that you’ll have to merge them as they get bigger. The way I saw it you either avoid branching (which is not a bad idea anyway), only update the model on the mainline (likely to forget) or keep the model as a bunch of separate smaller files instead (less navigable). I like to avoid branching anyway and actually I found that keeping the deployment and class models in separate files wasn’t so bad because they are only sketches after all. The prospect of keeping them in the source code just makes the problem pretty much go away. Now the only merge conflicts you have to deal with are the same kind you already deal with anyway.

I am surprised that UML has pretty much disappeared. But then maybe I’m just one of those people who needs pretty pictures to help them visualise a design. Or maybe everyone else just draws on scraps and paper and shoves them in the drawer - I have some of those too. I guess that I fundamentally believe that anything[3] that is of value to me is also probably of value to my teammates and successors, and therefore it’s at least worth a little extra effort to formalise it in some way.

 

[1] If you’re interested in a book on UML you can’t go far wrong with UML Distilled by Martin Fowler. Here is a link to the book review I did for the ACCU.

[2] The whole “aggregation” thing still confuses me. I’m way too literal and so people who talk about wheels being wholly “owned” by a car just doesn’t make sense to me. What does it then mean when you need to replace a puncture and you get the spare out of the boot and give the broken one to the man at Kwik-Fit to fix?

[3] It used to surprise me when I came across developers that didn’t share their test code and test harnesses (not through any malicious intend I’m sure), especially when you were working on a codebase with no automated tests. Mind you, after after the incident I had on one team perhaps they just learned to keep to themselves.

Thursday 29 November 2012

Primitive Domain Types - Too Much Like Hard Work?

How many books on programming start out by showing you how to define simple domain types and then use them throughout their examples? I’d wager very few; if not none (possibly a few specialist “computer science” books might). And I’m not talking about composite domain types either like an address, I’m talking about the (seemingly) primitive ones like time, date, product ID, URL, etc. In example code it’s so much easier to just write:-

string url = “http://example.com/thingy”;
. . .
int secs = 10;
int millis = secs * 1000;

If you were brought up on a language like BASIC or C you could be forgiven for sticking to the built-in types by default. After all that’s all you had available. In the performance conscious days gone by there was also a natural fear of extra complexity (irrespective of whether it could be measured and proven to show no degradation in a release build) that caused you to sacrifice type safety; just in case. You’d rather stick to the good old faithful “int” type rather than declare a simple C++ class called “date”[1] to encapsulate all those quirks you just know will appear the moment you delve into the murky waters of calendars. After all your current use case is really simple, right?

But what about declaring a typedef, doesn’t that count?

typedef date_t int;

In C that might be the best you can do without full blown classes[2]. At least any interfaces can now express their intent more clearly because both the type and parameter names are now working for you:-

datespan_t calculateInterval(date_t start, date_t end);

Now you don’t have to use “Reverse Hungarian Notation” and append the type name as a suffix to the variable name (which doesn’t work for the return type anyway) if you don’t want to:-

int calculateInterval(int startDate, int endDate);

In C++, C# and pretty much any other OO language you can do better - you have “the power” to create new types! I suspect though many developers don’t instantly reach for the “class” tool when their database is returning them a surrogate key typed as an int. In fact I’d wager again that the int will remain as the type used throughout the OO proficient codebase too. After all, if it’s an int in the database, why shouldn’t it be an int in the C# code too? It’s better to be consistent, right?

Did you stop to consider whether the reason an int was used in the database is because it’s only the closest possible representation the RDBMS provides for that concept? There might also be a bunch of constraints on the columns where it’s used; have you replicated those in your C++/C# code too? If the RDBMS supports User Defined Types (UDTs) you can at least do as well as the C scenario and use a consistent alias across-the-board. Before SQL Server 2008 added a native DATE type this was one possible approach to providing some expression of that concept in your schema:-

CREATE TYPE date AS smalldatetime;

Coming back to the excessive use of plain ints as the default type for surrogate keys, dates, etc. reminds me what some of our interfaces would look like if we didn’t create some simple domain types:-

List<Stuff> FetchStuff(int customerId, int start, int end)

You might argue that the names of the parameters could be better and that IntelliSense will guide you to the correct invocation. If you stick to a common pattern (ID, start date, end date) for methods taking similar sets of arguments[3] you lessen the chances of getting it wrong. But we don’t all notice such “obvious” patterns in our function signatures[4] and so someone is bound to switch them round by accident in a different interface:-

Other[] GetOthers(int start, int end, int customerId)

If you don’t have any decent tests you’re going to be reaching for the debugger when it finally does show up that somebody’s code calls it with the arguments the wrong way round.

There is one other gripe I have about using bare arithmetic types in this way - what exactly does this mean?

int meh = customerId + managerId;

The only time I can see where the use of arithmetic might be required is when generating IDs:-

int customerId = ++nextCustomerId;

But this code should probably be encapsulated somewhere else anyway. The use of the int type to generate IDs does not imply that the resulting “key” type must also be the same primitive type, but what we’re really talking about now is immutability:-

ID customerId = new ID(++nextCustomerId);

If you’re using C++ then you could use an enum as a strongly-typed way to handle an integer based ID[5]. Yes, you need to use cast-like syntax to initialise values but you get an awful lot else in return. If you don’t need any constraints on the value this may be enough, but if you do then you’re probably looking to create a class instead; perhaps with a dash of templates/generics thrown in to help factor out some commonality.

So, let’s continue with the surrogate key problem. I can think of two constraints on an ID that I’d like to enforce which reduces the set of allowable values from +-2 billion to just +1 to +2 billion. The first is that negative IDs are out on the premise of just being weird. The second is that C# uses 0 for an uninitialised value so I’d like to avoid that too. The former constraint might suggest the use of unsigned integers, but then we’re faced with the problem of adding a different constraint at the top end to ensure we don’t enter the 2 billion+ range because our code might support it but the database doesn’t.

I could decide to apply this constraint just to the database and that would at least protect my persisted data, assuming of course everyone remembers to use the same constraint everywhere it’s needed. However I’m a fan of code failing fast, in fact as soon as possible and I can’t think of any moment sooner than construction with a duff value.

So, here is a first-order approximation to get the ball rolling:-

public class SurrogateKey<T>
{
  public T Value { get; private set; }
}

The use of a base class is probably overkill but I thought it might be an interesting first step as I’ve already stated that ints and strings are two common types used for surrogate keys. Our first use case then is the ID field for customers, products, orders, etc. which I suggested a moment ago should have a range less than that of a full integer. So, let’s capture those constraints in a derived class:-

public class Identity : SurrogateKey<int>
{
  public Identity(int value)
  {
    if (value < 1)
      throw new InvalidIdentityException(...);

    Value = value;
  }
}

Now, this is where in C++ and C# the notion of strongly-typed typedefs/aliases would be really useful. We have a whole family of types that are all implemented the same way, but whose values are not implicitly interchangeable. In his first book, Imperfect C++, Matthew Wilson describes one technique for creating such strongly-typed beasts. And what prompted me to write this post was Martin Moene’s recent post on the Whole Value Idiom which touches on the the same ground but from a slightly different perspective[7].

Anyway, given that we don’t have strongly-typed typedefs we’re stuck with creating a very thin class that contains noting more than a constructor:-

public class ProductId : Identity
{
  public ProductId(int identity)
    : base(identity)
  { }
}

I’ll leave it up to you to decide whether the notion of sealing classes is a good or a bad thing (opinions seemed to have changed over time). Given that we’re after implementation inheritance and not polymorphism it probably should be. You could of course swap the use of inheritance for composition but then you’re giving yourself more work to do to expose the underlying value. Given our aim is to provide the ability to create lots of similar value types easily I’m happy with the current approach from the maintenance angle.

There might be a temptation to factor out an ISurrogateKey interface from SurrogateKey. This is an urge I believe comes from programmers more au fait with the pre-template/generics era or who are overly obsessed with mocking. I suggest you resist it on the grounds that these are simple value types[8].

So, how well does this idea translate to our other, similar problems? Well, for string keys I can easily imagine that you might want to restrict the character set. The same constraint could be applied to file paths and URLs. And I’ve already written about how I’m struggling with the “null reference vs empty string” duality. A C# Date class would almost certainly be a far richer abstraction that might leverage the DateTime class through composition. Unlike the more trivial surrogate key problem, the Date class has far more complex behaviours to implement, such as formatting as a string. And that’s before you decide whether to add the arithmetic operators too, or just worry about the storage aspects.

This, I feel, is where the “too much like hard work” comes in. It’s really easy to convince yourself that you can just pass around the “built-in” types and resort to “being careful” instead. After all you probably have a barrage of unit tests to ensure that you use them correctly internally. But the kinds of problems you are heading for cannot be found at the unit level; they occur at integration level and above. This means that the place where the duff data goes in - for example the UI - could be miles away from where the problem first shows up - the database if you’re still really lucky, but days later if the data is queued and there is no immediate processing of it.

I reckon most developers have no qualms about using enums for a type-safe set of limited values, so why does it feel like there is such a leap from that to a much larger, but still restricted, set of primitive values?

 

[1] There was no Boost when I started learning C++. There was no Java either for that matter. Let’s not discount an alternative either such as a simple value type and a set of free standing functions - OO is not the only fruit.

[2] One alternative for C is to create an opaque type with a struct, such as Windows uses for its various handles - HBRUSH, HPEN, HFONT, etc.

[3] There is an argument that this tuple of values (ID, start date, end date) should perhaps have a type of it’s own because they are clearly related in some way. The hard part would be trying to avoid the much overused suffix “Context”…

[4] In the eternal struggle to define what “cohesive” looks like, this would probably be in there, somewhere.

[5] Thanks to fellow ACCU member and C++ expert Jonathan Wakely for pointing this idiom out.

[6] I’ve mostly focused on the much simpler problem of “unitless” non-arithmetic values, but clearly I’m on very shaky ground because a date is normally measured in days and is usually anything but non-arithmetic.

[7] I’ve been working on a post titled “Interfaces Are Not The Only Abstract Fruit” but never been happy with the example I chose. I think I may have just discovered a better one.

Tuesday 20 November 2012

Using Nullable Columns to Handle Large Table Schema Changes

One of the problems with updating the schema of a database table that has a significant amount of data is that it can take ages. And whilst the data is being fixed up your database is in a half-baked state. A faster approach is to make a simpler metadata only change up-front, plug the missing data in in slow-time and then complete the meta-data change afterwards to close the loop. Using nullable columns makes the process more manageable.

Say for example I have a large customer table and I want to add a new non-nullable column with a default value. If I add that column as-is the database is going to have to touch every page to set the default values on all existing rows. However, if I add a nullable column, SQL Server at least, only has to update the table’s meta-data, it touches no data pages.

ALTER TABLE Customer ADD NewColumn INT NULL;

If the public interface to your database is raw access to the database tables then you’re pretty well hosed at this point without fixing the queries that touch the tables. But if you’ve built yourself a facade using views, functions, etc. to abstract the underlying data model you can use ISNULL() to fill in the default values on-the-fly when the client queries old data:-

SELECT c.Name, 
       ISNULL(c.NewColumn, @defaultValue) 
FROM   Customer c

When writing data your facade can manually apply the not null constraint and fill in the default values when not provided by the client. This may not be quite as performant as letting the database handle it natively but, bulk inserts aside, this should be more than adequate for most scenarios.

With the facade doing its job and encapsulating the data you’re free to update the old data in slow time. SQL Server (and probably other RDBMSs too) support using the TOP clause with the UPDATE statement to allow you to fix your data in small batches which helps keeps the transaction time down. It also means you can avoid blowing the transaction log if it’s of a fixed size.

UPDATE TOP(1000) Customer
SET   NewColumn = @defaultValue
WHERE NewColumn is null

Theoretically speaking the nullability of the column is correctly handled, at least from the client’s perspective[1], so you don’t need to actually alter the underlying table again. But if you do want/need to enforce the constraint more naturally you’re going to have to bite the bullet. SQL Server supports the WITH NOCHECK option that still allows you to make a metadata only change, but that comes with its own drawbacks, and so you may just need to accept a final hit. However, at least you can split the whole task up into smaller chunks and execute them as capacity permits rather than panicking over how you’re going to squeeze the entire change into a smaller window.

 

[1] When I talk about Database Development Using TDD, this is the kind of requirement I have in mind as it’s perfect unit test fodder.

From Test Harness To Support Tool

Before I discovered the wonders of unit testing I used to follow the suggestion from Steve Maguire in his book “Writing Solid Code” about stepping through your code in the debugger to test it. In particular he was suggesting that you look at the common pitfalls, such as off-by-one errors when traversing arrays and writing loops. This had a profound effect on the way that I partitioned my code because to perform this practice efficiently I needed to get my code executing in the debugger as quickly as possible. That in turn meant that code was generally packaged into small (static) libraries and then linked into both the main product and, more importantly for this discussion, a test harness.

The test harnesses may have started out as trivial applications, but the library often has some potentially reusable[1] value outside the main product. Consequently I often ended up putting more than a cursory amount of effort into them. Yes, sometimes I even contrived a feature just to allow a particular code path to be executed and therefore tested. In hindsight it was not an overly productive use of my time compared to, say, unit testing, but it did have the benefit of creating both a system/integration-test tool, and sometimes as a by-product a support tool. From what I’ve read about some of the support tools Microsoft ships, they also started out as internal utilities that others eventually found useful and the word spread.

Take for example my own, simple DDE and ODBC libraries. These weren’t written with any particular application in mind, but I created a simple GUI based test harness for each of them - DDEQuery came from the former and PQT from the latter. Although only intended as personal test tools I have used them both on many occasions to aid in the development and support of my real work[2]. OK, these are my tools and therefore I have more than a vested interest in using and maintaining them. But the same idea has paid dividends many times over on the other systems I’ve been paid to work on.

I began to realise how useful the test tools could be outside the team when I started working on a multi-language project for the first time. The front-end was written in PowerBuilder and the back-end in C. I was part of a team tasked with making substantial performance improvements, but the first problem I could see was that I had to waste more than 4 minutes loading the GUI every time before I even hit any of my team’s code. What I felt we needed was our own cut-down tool, not necessarily UI based, that would allow us to invoke the entry points more rapidly for testing and more importantly provide an alternate means through which we could invoke the profiler.

The watershed moment was when the tool got bundled with the main GUI during testing. Although it presented a very Matrix-esque view of the data it was enough to allow them to workaround some UI bugs, such as when an item was missing from the dashboard so you couldn’t even select it, let alone manipulate it. Nowadays even though I’d expect automated unit and collaboration tests I’d still want some sort of harness just to enable profiling and code checking, such as with a tool like BoundsChecker. Unit testing does not make any other sort of testing redundant, but you might get away with doing less of it.

I ended up in a similar kind of position many years later, but this time it was a VB front-end and C++ back-end with a good dose of COM thrown in the middle. This time they already had a command line tool for driving the calculation engine but still nothing to act as a front-end. Once again I was somewhat amazed that the first port of call when debugging was to invoke the VB front-end which an age to start up. This also implied that the COM interface was never tested in isolation either - it must have been painful working on the front/back-end interface.

I was particularly pleased with one simple little tool that made my life much easier. It allowed me to extract portions of a large data set so that I could turnaround my local testing much quicker. What I never expected, when broadcasting the existence of this tool to the rest of the team, was that I would be chastised for using company time to write a tool that helped me do my job more efficiently! I also built a simple GUI to act as a mock front-end, just like before, but this time I was far more careful about who I mentioned it to.

There were other simple test harnesses in the VCS repository from days gone by, but given the attitude of those up above it’s not surprising that they were left to rot. When I came to try them out I found none of them even built. Only one was of any vague use and so I nurtured that and made sure it became part of the standard build along with any other new tools I wrote. I then packaged it along with the others into a separate Utils package that could be deployed into an environment on demand to aid in any manual testing or support.

I eventually felt vindicated when the tool I was chastised over started to be used by both the BAs and BAU team to generate custom data sets for testing specific scenarios. In one of the essays that Fred Brooks published as part of The Mythical Man Month he suggests there might be:-

“It is not unreasonable for there to be half as much code in scaffolding as there is in product”

Nowadays with unit testing I would expect that figure to be even higher. Of course there is a cost to all this “scaffolding” as it needs to be maintained just like the production code. I suspect that the more naive managers believe that if they skimp on test code they will have less waste to maintain and therefore more time can be spent on building new features instead. In my experience those projects I worked on that had a plethora of tests and tools delivered new releases an order of magnitude more frequently than those without.

 

[1] I, along with many others, genuinely believed that we would be writing lots of reusable, pluggable components that would save ourselves time as we approached each new project. For a software house this may well make sense, and it certainly helped in some cases when I worked at one. But for in-house development trying to share code across projects and teams is just too hard it seems.

[2] The DDE tool rarely sees the light of day now, but PQT is still used by me virtually every day for testing and support.

Monday 19 November 2012

The File-System Is An Implementation Detail

Using a native file-system to store your data can be a double-edged sword. On the one-hand the great thing about a file-system is that the concept of files and folders is well understood and so there’s a low learning curve for new developers and a potential backdoor for support issues too. The downside of using the file-system is that it’s a well understood concept and there is a potential backdoor for the “time conscious” developer to use instead of any carefully designed “abstraction”.

The ability to manipulate the file-system with your everyday tools such as shell commands and scripts makes the allure of the “easy win” just too tempting some times. This is compounded by the fact that it may not be obvious whether the choice to use the file-system in such a way was a conscious one, exactly because of its flexibility, or imposed for some other reason[1], but not intended to be exploited. Once that power has been abused it can become the rationale for doing it again and again as people then follow its lead “to fit in with the existing code”. Then, any notion of “implementation detail” becomes obscured by the sheer volume of exploitative code.

But building an “abstraction” over the file-system doesn’t have to be some gigantic API with SQL like semantics to query and update data, it could be as simple as a bunch of static methods that take individual query parameters that you then turn into a path inside the abstraction. For example, say you have the following legacy code that loads customer data from a file in a value dated folder:-

public void ProcessCustomer(string name, DateTime date)
{
  string datedFolder = date.ToString(“YYYY-MM-DD”); 
  string path = Path.Combine(“\\server\share”, datedFolder); 
  path = Path.Combine(path, name); 

  Customer customer = Customer.Load(path);
  . . .

A big problem with this code is that it makes testing hard because you’re in the territory of having to mock the file-system API - notably some static methods in C#. Fortunately it’s reasonably easy to build a thin Facade[2] that can serve you well when testing your error recovery code too. But it is another barrier on the path (of least resistance) to a more wholesome codebase.

One objection you might immediately raise is that the Customer class takes a file-system path anyway and so you’re already tied to it, i.e. you’re doing no worse. What you’re missing is that the Customer class actually takes a string, the semantics of which happens to be a path. In future it could be a RESTful URL or some other “location”. The Load() method could also be overloaded to take different types of “location”. To support that agnosticism you’re best off getting the “location” to the class loader without needing to know yourself what it is, i.e. you want to start treating the “location” as an opaque type[3].

The simplest change is to just move the file handling code into a separate class:-

public static string FormatCustomerPath(string root, string name, DateTime date)
{
  string datedFolder = date.ToString(“YYYY-MM-DD”);
  string path = Path.Combine(root, datedFolder);
  path = Path.Combine(path, name); 
 
  return path;
}
. . .
public void ProcessCustomer(string name, DateTime date)
{
  string path = FS.FormatCustomerPath(“\\server\share”, name, date);

  Customer customer = Customer.Load(path);
  . . .
}

This very simple change already divorces the caller from many of the grungy details of where the data is stored. For starters you could change the order of the name and the date in the path. You could even insert other “random” elements into the path such as environment name, version number, etc. Or you could format the dates in DDMMMYYY instead of YYYY-MM-DD. The very fact that you’re passing the date as a rich DateTime type also makes the API less prone to error because you’re not juggling lots of raw string parameter[4]. Also, if your fellow developers have a habit of using format strings like “{0}\\{1}\\{2}” for paths then this simple change will improve portability too.

The root of the path (“\\server\share”) might be hard-coded but more likely is stored in a variable so the next step could be to extract that out entirely by storing it in a static member of the FS class instead. This can then be done once as part of the bootstrapping process in main(), leaving the client code virtually unaware of the existence of the file-system at all:-

public void ProcessCustomer(string name, DateTime date)
{
  string path = FS.FormatCustomerPath(name, date);

  Customer customer = Customer.Load(path);
  . . .
}

The final bare minimum change might be to add an extra level of indirection and hoist these two lines out into a separate class, or rename and change the responsibility of the FS facade to solely be about the file-system aspects of loading a variety of objects:-

public void ProcessCustomer(string name, DateTime date)
{
  Customer customer = ObjectLoader.LoadCustomer(name, date);
  . . .
}

I think all these steps were trivial and if you’ve got a mountain of code to refactor you can’t always get everything done in one hit. Dependencies often have a habit of getting in the way too so you can up either having to back out changes or keep on going to the bitter end. By initially sticking with a static facade you can improve things one piece at a time without making your testing position any worse. In fact you’ll still make it slightly easier.

Remember, I’m not suggesting the use of a static facade over a “proper service” that has a “proper mockable interface” and a “proper factory”. Static facades are still fairly unpleasant beasts because they spark thoughts of global variables and Singletons which are generally considered a bad smell. Yes, there is tooling available that will allow you to mock this kind of behaviour but you should probably be looking to a much higher level of abstraction anyway when you have any sort of persistence layer to deal with. Just don’t forget that sometimes you can make a big difference with a much smaller, localised set of changes.

 

[1] One system I worked on uses it because the manual mocks we built for integration testing ended up becoming the production implementations.

[2] Tim Barrass wrote about one way of doing this for C# in his post “Using static classes in tests”. The same technique could be used in C++ but substitute “static method” with “free function” and “delegate” with “function pointer”.

[3] I appreciate that I’m skating on thin ice with the suggestion that you can play games with the all-singing-all-dancing String type. Ultimately I’d expect to switch to some Abstract Data Type at some time later, but only if an even richer abstraction hasn’t come into play before that to mitigate all this.

[4] There are two very common types used for surrogate keys - int and string. It’s tempting to pass these around using their primitive types but that makes an API much harder to use correctly as it’s all too easy to switch parameters by accident and get some weird behaviour.

Tuesday 13 November 2012

When Does a Transient Failure Stop Being Transient?

Depending on the type of system you work on the definition of “transient”, when talking about errors, varies. Although I’ve never worked in the embedded arena I can imagine that it could be measured on a different scale to distributed systems. I work in the latter field where the notion of “transient” varies to some degree depending on the day of the week. There is also an element of laziness around dealing with transient failures that means you might be tempted to just punt to the support team on any unexpected failure rather than design recovery into the heart of the system.

The reason I said that the definition of transient varies depending on the day of the week is because, like many organisations, my current client performs their infrastructure maintenance during the weekend. There are also other ad-hoc long outages that occur then such as DR testing. So, whereas during the week there might be the occasional blip that lasts for seconds, maybe minutes tops, at the weekend the outage could last longer than a day. Is that still transient at that point? I’m inclined to suggest it is, at least from our perspective, because the failure will correct itself without direct intervention from our team. To me permanent failures occur when the system cannot recover automatically.

For many systems the weekend glitches probably don’t matter as there are no users in anyway, but the systems I work on generally chug though other work at the weekend that would not be possible to squeeze in every day due to lack of resources. This means that the following kinds of failures are all expected during the weekend and the architecture just tries to allow progress to be made whenever possible:-

  • The database cluster is taken offline or runs performance sapping maintenance tasks
  • The network shares appear and disappear like Cheshire Cats
  • Application servers are patched and bounced in random orders

Ensuring that these kinds of failures only remain transient is not rocket science, by-and-large you just need to remember not to hold onto resources longer than necessary. So for example don’t create a database connection and then cache it as you’ll have to deal with the need to reconnect all over the place. The Pooling pattern from POSA 3 is your friend here as it allows you to delegate the creation and caching of sensitive resources to another object. At a basic level you then be able to treat each request failure independently without it affecting subsequent requests. There is a corollary to this which is Eager Initialisation at start-up which you might use to detect configuration issues.

The next level up is to enable some form of retry mechanism. If you’re only expecting a database cluster failover or minor network glitch you can ride over it by waiting a short period and then retrying. What you need to be careful of is that you don’t busy wait or retry for too long (i.e. indefinitely) so that the failure cascades into the system performing no work at all. If a critical resource goes down permanently then there is little you can do, but if not all requests rely on the same resources then it’s possible for progress to be made. In some cases you might be able to handle the retry locally, such as by using Execute Around Method.

Handling transient failures locally reduces the burden on the caller, but at the expense of increased complexity within the service implementation. You might also need to pass configuration parameters down the chain to control the level of back-off and retry which makes the invocation messy. Hopefully though you’ll be able to delegate all that to main() when you bootstrap your service implementations. The alternative is to let the error propagate right up to the top so that the outermost code gets to take a view and act. The caller always has the ability to keep an eye on the bigger picture, i.e. number and rate of overall failures, whereas local code can only track its own failures. As always Raymond Chen has some sage advice on the matter of localised retries.

Eventually you will need to give up and move on in the hope that someone else will get some work done. At this point we’re talking about rescheduling the work for some time later. If you’re already using a queue to manage your workload then it might be as simple as pushing it to the back again and giving someone else a go. The blockage will clear in due course and progress will be made again. Alternatively you might suspend the work entirely and then resubmit suspended jobs every so often. Just make sure that you track the number of retries to ensure that you don’t have jobs in the system bouncing around that have long outlived their usefulness. In his chapter of Beautiful Architecture Michael Nygard talks about “fast retries” and “slow retries”. I’ve just categorised the same idea as “retries” and “reschedules” because the latter involves deactivating the job which feels like a more significant change in the job’s lifecycle to me.

Testing this kind of non-functional requirement[1] at the system level is difficult. At the unit test level you can generally simulate certain conditions, but even then throwing the exact type of exception is tricky because it’s usually an implementation detail of some 3rd party library or framework. At the system level you might not be able to pull the plug on an app server because it’s hosted and managed independently in some far off data centre. Shutting an app server down gracefully allows clean-up code to run and so you need to resort to TerminateProcess() or the moral equivalent to ensure a process goes without being given the chance to react. I’m sure everyone has heard of The Chaos Monkey by now but that’s the kind of idea I still aspire to.

I suggested earlier that a lazier approach is to just punt to a support team the moment things start going south. But is that a cost-effective option? For starters you’ve got to pay for the support staff to be on call. Then you’ve got to build the tools the support staff will need to fix problems, which have to be designed, written, tested, documented, etc. Wouldn’t it make more financial sense to put all that effort into building a more reliable system in the first place? After all the bedrock of a maintainable system is a reliable one - without it you’ll spend your time fire-fighting instead.

OK, so the system I’m currently working on is far from perfect and has its share of fragile parts but when an unexpected failure does occur we try hard to get agreement on how we can handle it automatically in future so that the support team remains free to get on and deal with the kinds of issues that humans do best.

 

[1] I’m loathed to use the term “non-functional” because robustness and scalability imply a functioning system and being able to function must therefore be a functional requirement. Tom Gilb doesn’t settle for a wishy-washy requirement like “robust” - he wants it quantified - and why not? It may be the only way the business gets to truly understand how much effort is required to produce reliable software.

Monday 12 November 2012

The Cost of Defensive Programming

They say that the first rule of optimisation is to measure. Actually they say “don’t do it” and the third rule is to measure, but let’s invoke artistic license and just get right on and measure. Another rule when it comes to optimising code is that intuition usually sucks and the bottleneck won’t be where you think it is. Even for experienced developers this still holds true.

My current system has a component that is not unit testable, in fact the only way to test it is to run the entire process and either debug it or compare its outputs. As a consequence the test feedback loop between changes is insanely high and if you throw some infrastructure performance problems into the mix too you can probably guess why I decided to let the profiler have a look at it during one lunch time. From the opening paragraph you can probably guess that what I found was not what I expected…

The code I stumbled upon probably deserves to be submitted to The Daily WTF because it almost manages to achieve functional correctness in a wonderfully obtuse way. It looked something like this:-

var bugs = new Dictionary<int, string>();

string sql = “SELECT * FROM Bug ORDER BY version”;

// execute query...

while (reader.Read())
{
  int id = reader[“BugId”];
  string title = reader[“Title”];

  if (!bugs.Contains(id))
    bugs.Add(id, title);
  else
    bugs[id] = title;
}

Hopefully you should be able to deduce from my attempt to disguise and distil the guilty code that it executes a query to load some data and then it builds a map of ID to some attribute value. The reason I say that it only “almost” works is because what you can’t see from it is that there is something fundamentally missing which is a parameter that defines a context (specifically a point in time) for which the data should be loaded.

However, before I get to the real subject of this post let’s just get the other obvious criticisms out of the way. Yes, the component has SQL statements embedded in it. And yes, the SELECT statement uses ‘*’ when it only requires two of the columns in the table. OK, let’s move on…

So, what prompted me to write this post is actually the use of the “upsert[1] style anti-pattern” in the loop at the end:-

if (!bugs.Contains(id))
  bugs.Add(id, title);
else
  bugs[id] = title;

This code is nasty. It says to me that the programmer is not sure what invariants they are expecting to handle or maintain within this method/class. What probably happened (and this is more evident if you know the data model) is that the Add() started throwing exceptions due to duplicates so it was silenced by turning it into an “Add or Update”[2]. Of course the knock-on effect is that the same value can now be silently updated many times and the observable outcome is the same. The addition of the ORDER BY clause in the SQL is what elevates it from non-deterministically broken to nearly always correct because it is never invoked in production in a manner where the tail-end rows are not the ones that matter.

Putting aside the embedded SQL for the moment, the correct implementation only needs to select the data for the context in question and there should only be one row per item so just doing the Add() is sufficient. The invariant in the data model is that for any point in time there is only one version of an item “active”. The only reason the Add() could throw is if the initial query is wrong, the data is broken or the invariant has changed and the code is now broken. If the latter happens you’ve got real problems and the least of them is that some code has failed very loudly and clearly.

Now to unwind the stack and return to the original performance problem. This code masks what should have been a query that returns 40,000 rows into one that actually sucks in 1,200,000 rows instead. Using the aforementioned “back of the envelope calculation” that is over an order of magnitude more rows than required. Luckily the table is fairly short on columns and so in terms of data retrieved it’s not insanely huge which is probably why it’s crept up largely unnoticed.

As an aside I mentioned that it’s never run retrospectively in production and that is another reason why it has remained hidden. It has however caused me to scratch my head on a few occasions during testing as I’ve observed an unanticipated change in the output but not had the time to pin down the exact cause. A bug that only appears occasionally during testing is never going to make it up the TODO list when there are far more pressing features to implement, and the costly way the component needs to be tested pretty much ensures that it will never get fixed in isolation - a double whammy.

 

[1] For those lucky enough to avoid databases an “upsert” is short for update-or-insert. You attempt an UPDATE and if no rows were modified you perform an INSERT instead. SQL Server has proper support for this idiom these days in the shape of the MERGE keyword.

[2] This “pattern” is sprinkled throughout the component’s code so it’s not an isolated case.

Friday 9 November 2012

Include the Units in Configuration Data

[I have no right to chastise anyone else about this because I do it myself - it just always seems so “unnecessary” at the time… Hence this post is an attempt to keep me honest by giving my colleagues the opportunity to shove it back in my face every time I do it in the future.]

Say I asked you to change a “timeout” setting to “3 minutes” what value would you instinctively use?

  1. 0.05
  2. 3
  3. 180
  4. 180000

Although I’m tempted to say that milliseconds are still the de-facto resolution in most low-level APIs I deal with these days, I have seen nanoseconds creeping in, even if the actual resolution of the implementation is actually less than that. However that’s what the internal API deals with. These kinds of numbers are often way too big and error prone for humans to have to manipulate; if you miss a single zero off that last example it could make a rather important difference.

If you don’t need anywhere near millisecond precision, such as dealing with a run-of-the-mill database query, you can reduce the precision to seconds and still have ample wiggle room. Dealing with more sane numbers, such as “3”, feels so much easier than “0.05” (hours) and is probably less likely to result in a configuration error.

There is of course a downside to not using a uniform set of units for all your configuration data - it now becomes harder to know what the value “3” actually represents when looking at an arbitrary setting. You could look in the documentation if you (a) have some and (b) it’s up to date. If you’re privileged enough to be a developer you can look in the source code, but even then it may not have a comment or be obvious.

One answer is to include the units in the name of the setting, much like you probably already do when naming the same variable in your code. Oh, wait, you don’t do it there either? No, it always seems so obvious there, doesn’t it? And this I suspect is where I go wrong - taking the variable and turning it into an externally configurable item. The variable name gets copied verbatim, but of course the rest of the source code context never travels with it and consequently disappears.

So, the first solution might be to name the setting more carefully and include any units in it. Going back to my original example I would call it “TimeoutInMinutes”. Of course, one always feels the urge to abbreviate and so “TimeoutInMins” may be acceptable too, but that’s a whole different debate. Applying the same idea to another common source of settings - file sizes - you might have “MaxLogFileSizeInKB”. The abbreviation of Kilobytes to KB though is entering even murkier waters because KB and Kb are different units that only differ by case. If you’re dealing with case-insensitive key names you better watch out[1].

Another alternative is to look at adding units to the value instead. So, in my timeout example the value might be “3mins”, or you could say “180 secs” if you preferred. If you’re used to doing web stuff (i.e. I’m thinking CSS) then this way of working might seem more natural. At its expense it adds a burden to the consuming code which must now invoke a more complicated parser than just strtoul() or int.Parse()[2]. It’s not hard to add this bit of code to your framework but it’s still extra work that needs thinking about, writing and testing[3].

For timeouts in particular .Net has the TimeSpan type which has a pretty flexible string representation that takes hours, minutes, seconds, etc into account. For the TimeSpan type our example would be “00:03:00”. But would you actually call it “TimeoutInTimeSpan”? It sounds pretty weird though. Better sounding would be “TimeoutAsTimeSpan”. I guess you could drop the “InTimeSpan” suffix if you were to use it consistently for timeouts, but then we’re back to where we started because I’m not aware of any similar scheme for representing bytes, kilobytes, megabytes, etc.

 

[1] This is one of those situations where case-sensitivity is often not formally defined either. Explicit use of the == operator or implicit use via a Dictionary means no one cares until someone else sticks a different-cased duplicate in by accident and things don’t quite work as expected.

[2] Personally I prefer to wrap this kind of parsing because a raw “invalid format” style exception often tells you pretty much nothing about which value it was that choked. Extracting related settings into a richly-typed “Settings” class up-front means you don’t end up with the parsing exception being thrown later at some horribly inconvenient time. And it reduces the need to mock the configuration mechanism because it’s a simple value-like type.

[3] Not necessarily in that order. It’s perfect TDD/unit test fodder though, so what’s not to like…

Wednesday 7 November 2012

Sensible Defaults

If you were writing a product, say, a service, that allows remote clients to connect and submit requests, what would you choose as the default values for the service hostname? Given that your service will no doubt be fault-tolerant it will probably allow the remote end to disappear and reappear in the background. Of course you might decide to allow a client to configure itself so that after ‘N’ failures to connect it will return an error so that some other (manual) action can take place. What would you pick as the default value for this setting too?

Put your hand up if you said “localhost” and “infinite” as the answer to those two questions. Are you really sure they are sensible values to use by default?

Not unsurprisingly I had to work with a service that had exactly those as its default settings. To make matters worse there were other defaults, such as not having any sort of logging by default[1]. Besides programmatic configuration[2] there was also a config file based mechanism that used the Current Working Directory (CWD) by default. Next question. What is the CWD for a Windows service? No, it’s not the application folder, it’s the system32 folder. By now you can probably tell where this is heading…

Basically we installed our NT Service, fired it up and absolutely nothing happened. Not only that but the service API returned no errors. Naturally we checked and double-checked the installation and obvious output locations but could find no reported problems. In the end I attached the Visual Studio remote debugger and watched in awe at the exceptions bouncing around inside the service API as it tried repeatedly in vain to attach to a service running on “localhost”. No “3 strikes and you’re out” either, it just kept trying over-and-over again.

When you’re developing features hosted within a service you’ll rarely run it as an actual service, you’ll probably tend to run it as a normal console application for speed. The natural side-effect of this is that the CWD will likely be set to the same as where the binary resides, unless you start it with a relative path. The large monolithic service I was working on was always developed in that way as all the installation stuff had preceded me be many years. Yes, we caught it the first time we deployed it formally to a DEV system-test environment, but by then so much time had passed that the start-up shenanigans were far behind us[3][4].

My personal choice of defaults would be “nothing” for the hostname and 0 for the retries. How many systems do you know where the client and middleware run on the same machine? The out-of-the-box configuration should assume a synchronous connect call by default because that is what most other services that developers are used to dealing with do. And by most I mean databases and middle tier services. Yes, asynchrony is gaining ground, even in UIs, but as a developer you already have a hard enough time dealing with learning a new product and API without having to fight it too - I can tell you it gains you no friends in the development world. Once you’re comfortable with the API you could look at its more advanced features.

I’m tempted to suggest that the decision on what to use by default was made by a developer of the product, not a consumer of it. No doubt this configuration made their testing easier exactly because they run the entire product on their development machines. That’s why dogfooding and TDD are such important ideas - they force you into your customers shoes. I’m a big advocate of flexible configuration mechanisms and so don’t see why you can’t also adhere to The Principle of Least Surprise too.

 

[1] No console or file based logging is possibly a sensible default, but only when your API has other ways of telling you about failures. At the very least I would opt for using OutputDebugString() as a last resort so that I could fire up the wonderful DbgView tool from the Sysinternals suite. Even the trusty old Windows event log is better than nothing, just so long as you don’t spam it.

[2] After this incident we switched to using the programmatic interface and set more useful default values inside the facade we built.

[3] The first rule of debugging is to look to the most recent changes for the source of a bug. It’s not always obvious but it’s usually there, somewhere. The bug may not be directly in the change itself but be a side-effect of it; either way the recent changes should be the clue.

[4] This was supposed to be “a simple port”. What should have taken a couple of weeks turned into a couple of months and the project was eventually abandoned. Ultimately an impedance mismatch at the architecture level between our system and the service was to blame.

Friday 2 November 2012

Service Providers Are Interested In Your Timeouts Too

There are established figures for the attention span of humans and so when it comes to serving up web pages you can pretty well put a figure on how long you think a person will wait for a response. If you’re in a horizontal market serving up stuff to buy then that’s going to be pretty short - seconds I would have thought. On the other hand, when it comes to submitting your tax return I should imagine these users are a little more forgiving because ultimately it’s their wallet on the line. What this means is that, if in the act of handling a request you have to contact another service, you might be able to get away with a simple timeout to protect yourself.

When it comes to services that provide a much more variable set of data it’s preferable to be able to tell them more about what your expectations are, especially if they have no way of detecting and handling a client-side disconnect (it might be part of the protocol). The other obvious alternative is to support some form of cancellation mechanism. But that can be tricky to implement on the client as it’ll likely involves threads or re-entrancy or asynchronous style event handling. Many UI’s still have the classic single-threaded/synchronous mindset and so cancellable I/O may well be considered a “more advanced feature”.

To give an example, say you want all the orders for a particular customer and you have a large variation in result set sizes, it may be that the majority are small and return very quickly, but you have a few biggies that take considerably longer. Once the service starts to be put under load the larger requests may start to take much longer, so much so that the client times-out. If the user (real or another system) is impatient, or has some form of naive retry logic, it may just try again. The problem is that internally the system might well still be servicing the original request, whilst the retries come in. If the request servicing is poorly written and has no timeout of its own it will just end up competing with dead requests somewhat akin to a feedback loop.

In Vol 2 of Pattern Languages of Program Design (PLOPD) Gerard Meszaros published a few patterns for dealing with reactive systems. One of these is “Fresh Work Before Stale”. The premise is that if you have a long queue and you operate it as a FIFO then everybody waits and so everybody’s experience sucks. Instead, if you operate it something more like a LIFO then, when the system is busy, at least some clients get a decent level of service at the expense of really poor service for others. I’m sure PhD’s have been written on the subject and I’m probably trivialising it here, but I like the simplicity of this particular idea.

Another alternative is to give the service provider more information about your expectations so that they have the opportunity to expend no more effort on your behalf than is absolutely necessary. In essence it’s a variation of the Fail Fast pattern because if you know at the point of starting the task that you either can’t complete it in time, or even that you’ve already exceeded to Time To Live (TTL) then there is little point in continuing. One additional difference in the latter case is that the client will have already given up. Either way you have been given the opportunity to protect yourself from performing unnecessary work which will improve matters when under load.

The system at my current client provides a calculation framework that can be used by both real people and scheduled jobs to perform a variety of workloads that include user-submitted and bulk batch processing style requests. The length of a calculation is unknown up front but generally speaking the users are after a quick turnaround whilst some of the batch jobs take many hours. Although the batch processing work is generally predictable, the user submitted requests are largely out of our control and so they could submit a job that takes hours by accident. In fact they may not realise it and keep submitting it, much like the impatient web surfer who keeps hitting “refresh” in their browser in the hope of a quicker response. For really long running batch jobs it’s possible they could miss the reporting deadline, but they’re still not terminated because a late answer still has value for other reasons.

Balancing this kind of workload is hard because you need to ensure that system resources are not being wasted due to a hung job (e.g. the code has gone into an infinite loop) but at the same time you need to cater for a massive variation in processing time. If the work is queued up you also probably want to distinguish between “time spent processing” and “wall clock time”. A user will only wait for, say, 5 minutes from the time of submission, so if it doesn’t even make it out of the queue by then it should be binned. In contrast the batch jobs are more likely to be controlled by the amount of time spent processing the request. If you submit thousands of jobs up front you expect some waiting to occur before they get actioned. If there is some kind of business deadline you’ll need to prioritise the work and maybe use a fixed-point-in-time as the timeout, say, 9 am.

Naturally these various techniques can be composed so that if you have a hard deadline of 7:00 am, and you pick up a job at 6:58, then the maximum timeout for any service call is the minimum of 2 minutes and the normal service call timeout. It doesn’t matter that you might normally allow 5 minutes - if you can’t get an answer within 2 you’re stuffed.

The flip-side to all this is of course is increased complexity. Re-runability is also effected when you have hard deadlines because you have to adjust the deadline first which makes support that little more difficult. For UIs you have a whole other world of pain just trying to manage the users expectations and impatience. But they were never the target here - this is very much about helping the links in the back-end chain to help you by being more aware of what you’re asking them to do.