Thursday 29 November 2012

Primitive Domain Types - Too Much Like Hard Work?

How many books on programming start out by showing you how to define simple domain types and then use them throughout their examples? I’d wager very few; if not none (possibly a few specialist “computer science” books might). And I’m not talking about composite domain types either like an address, I’m talking about the (seemingly) primitive ones like time, date, product ID, URL, etc. In example code it’s so much easier to just write:-

string url = “http://example.com/thingy”;
. . .
int secs = 10;
int millis = secs * 1000;

If you were brought up on a language like BASIC or C you could be forgiven for sticking to the built-in types by default. After all that’s all you had available. In the performance conscious days gone by there was also a natural fear of extra complexity (irrespective of whether it could be measured and proven to show no degradation in a release build) that caused you to sacrifice type safety; just in case. You’d rather stick to the good old faithful “int” type rather than declare a simple C++ class called “date”[1] to encapsulate all those quirks you just know will appear the moment you delve into the murky waters of calendars. After all your current use case is really simple, right?

But what about declaring a typedef, doesn’t that count?

typedef date_t int;

In C that might be the best you can do without full blown classes[2]. At least any interfaces can now express their intent more clearly because both the type and parameter names are now working for you:-

datespan_t calculateInterval(date_t start, date_t end);

Now you don’t have to use “Reverse Hungarian Notation” and append the type name as a suffix to the variable name (which doesn’t work for the return type anyway) if you don’t want to:-

int calculateInterval(int startDate, int endDate);

In C++, C# and pretty much any other OO language you can do better - you have “the power” to create new types! I suspect though many developers don’t instantly reach for the “class” tool when their database is returning them a surrogate key typed as an int. In fact I’d wager again that the int will remain as the type used throughout the OO proficient codebase too. After all, if it’s an int in the database, why shouldn’t it be an int in the C# code too? It’s better to be consistent, right?

Did you stop to consider whether the reason an int was used in the database is because it’s only the closest possible representation the RDBMS provides for that concept? There might also be a bunch of constraints on the columns where it’s used; have you replicated those in your C++/C# code too? If the RDBMS supports User Defined Types (UDTs) you can at least do as well as the C scenario and use a consistent alias across-the-board. Before SQL Server 2008 added a native DATE type this was one possible approach to providing some expression of that concept in your schema:-

CREATE TYPE date AS smalldatetime;

Coming back to the excessive use of plain ints as the default type for surrogate keys, dates, etc. reminds me what some of our interfaces would look like if we didn’t create some simple domain types:-

List<Stuff> FetchStuff(int customerId, int start, int end)

You might argue that the names of the parameters could be better and that IntelliSense will guide you to the correct invocation. If you stick to a common pattern (ID, start date, end date) for methods taking similar sets of arguments[3] you lessen the chances of getting it wrong. But we don’t all notice such “obvious” patterns in our function signatures[4] and so someone is bound to switch them round by accident in a different interface:-

Other[] GetOthers(int start, int end, int customerId)

If you don’t have any decent tests you’re going to be reaching for the debugger when it finally does show up that somebody’s code calls it with the arguments the wrong way round.

There is one other gripe I have about using bare arithmetic types in this way - what exactly does this mean?

int meh = customerId + managerId;

The only time I can see where the use of arithmetic might be required is when generating IDs:-

int customerId = ++nextCustomerId;

But this code should probably be encapsulated somewhere else anyway. The use of the int type to generate IDs does not imply that the resulting “key” type must also be the same primitive type, but what we’re really talking about now is immutability:-

ID customerId = new ID(++nextCustomerId);

If you’re using C++ then you could use an enum as a strongly-typed way to handle an integer based ID[5]. Yes, you need to use cast-like syntax to initialise values but you get an awful lot else in return. If you don’t need any constraints on the value this may be enough, but if you do then you’re probably looking to create a class instead; perhaps with a dash of templates/generics thrown in to help factor out some commonality.

So, let’s continue with the surrogate key problem. I can think of two constraints on an ID that I’d like to enforce which reduces the set of allowable values from +-2 billion to just +1 to +2 billion. The first is that negative IDs are out on the premise of just being weird. The second is that C# uses 0 for an uninitialised value so I’d like to avoid that too. The former constraint might suggest the use of unsigned integers, but then we’re faced with the problem of adding a different constraint at the top end to ensure we don’t enter the 2 billion+ range because our code might support it but the database doesn’t.

I could decide to apply this constraint just to the database and that would at least protect my persisted data, assuming of course everyone remembers to use the same constraint everywhere it’s needed. However I’m a fan of code failing fast, in fact as soon as possible and I can’t think of any moment sooner than construction with a duff value.

So, here is a first-order approximation to get the ball rolling:-

public class SurrogateKey<T>
{
  public T Value { get; private set; }
}

The use of a base class is probably overkill but I thought it might be an interesting first step as I’ve already stated that ints and strings are two common types used for surrogate keys. Our first use case then is the ID field for customers, products, orders, etc. which I suggested a moment ago should have a range less than that of a full integer. So, let’s capture those constraints in a derived class:-

public class Identity : SurrogateKey<int>
{
  public Identity(int value)
  {
    if (value < 1)
      throw new InvalidIdentityException(...);

    Value = value;
  }
}

Now, this is where in C++ and C# the notion of strongly-typed typedefs/aliases would be really useful. We have a whole family of types that are all implemented the same way, but whose values are not implicitly interchangeable. In his first book, Imperfect C++, Matthew Wilson describes one technique for creating such strongly-typed beasts. And what prompted me to write this post was Martin Moene’s recent post on the Whole Value Idiom which touches on the the same ground but from a slightly different perspective[7].

Anyway, given that we don’t have strongly-typed typedefs we’re stuck with creating a very thin class that contains noting more than a constructor:-

public class ProductId : Identity
{
  public ProductId(int identity)
    : base(identity)
  { }
}

I’ll leave it up to you to decide whether the notion of sealing classes is a good or a bad thing (opinions seemed to have changed over time). Given that we’re after implementation inheritance and not polymorphism it probably should be. You could of course swap the use of inheritance for composition but then you’re giving yourself more work to do to expose the underlying value. Given our aim is to provide the ability to create lots of similar value types easily I’m happy with the current approach from the maintenance angle.

There might be a temptation to factor out an ISurrogateKey interface from SurrogateKey. This is an urge I believe comes from programmers more au fait with the pre-template/generics era or who are overly obsessed with mocking. I suggest you resist it on the grounds that these are simple value types[8].

So, how well does this idea translate to our other, similar problems? Well, for string keys I can easily imagine that you might want to restrict the character set. The same constraint could be applied to file paths and URLs. And I’ve already written about how I’m struggling with the “null reference vs empty string” duality. A C# Date class would almost certainly be a far richer abstraction that might leverage the DateTime class through composition. Unlike the more trivial surrogate key problem, the Date class has far more complex behaviours to implement, such as formatting as a string. And that’s before you decide whether to add the arithmetic operators too, or just worry about the storage aspects.

This, I feel, is where the “too much like hard work” comes in. It’s really easy to convince yourself that you can just pass around the “built-in” types and resort to “being careful” instead. After all you probably have a barrage of unit tests to ensure that you use them correctly internally. But the kinds of problems you are heading for cannot be found at the unit level; they occur at integration level and above. This means that the place where the duff data goes in - for example the UI - could be miles away from where the problem first shows up - the database if you’re still really lucky, but days later if the data is queued and there is no immediate processing of it.

I reckon most developers have no qualms about using enums for a type-safe set of limited values, so why does it feel like there is such a leap from that to a much larger, but still restricted, set of primitive values?

 

[1] There was no Boost when I started learning C++. There was no Java either for that matter. Let’s not discount an alternative either such as a simple value type and a set of free standing functions - OO is not the only fruit.

[2] One alternative for C is to create an opaque type with a struct, such as Windows uses for its various handles - HBRUSH, HPEN, HFONT, etc.

[3] There is an argument that this tuple of values (ID, start date, end date) should perhaps have a type of it’s own because they are clearly related in some way. The hard part would be trying to avoid the much overused suffix “Context”…

[4] In the eternal struggle to define what “cohesive” looks like, this would probably be in there, somewhere.

[5] Thanks to fellow ACCU member and C++ expert Jonathan Wakely for pointing this idiom out.

[6] I’ve mostly focused on the much simpler problem of “unitless” non-arithmetic values, but clearly I’m on very shaky ground because a date is normally measured in days and is usually anything but non-arithmetic.

[7] I’ve been working on a post titled “Interfaces Are Not The Only Abstract Fruit” but never been happy with the example I chose. I think I may have just discovered a better one.

1 comment:

  1. Thanks for posting that link to Wilson's true typedefs. Very useful and what I've been wanting for ages!

    ReplyDelete