Wednesday, 29 June 2011

Unit Tests Obscure Dead Code

I’ve been having a bit of a clean-up in our codebase as I know of various large bits of functionality that for one reason or another never made it into production and now they are just becoming a distraction. Even though we are using the thoroughly useful ReSharper, which can point out some obvious dead code, it doesn’t perform enough analysis to see past code that is only referenced by unit tests.

Oh the irony… I spend all that time banging on about writing unit tests and now I’m cursing their existence because they stop me from seeing dead code! But my team is using NUnit, not some obscure framework, and ReSharper has native support for NUnit, so it’s not out of the realms of possibility that JetBrains could “join the dots” and provide a facility to highlight dead code that only has unit tests for it. If you’re writing a library I presume you’d want this feature off by default :-) but if you’re writing an application you’d probably like it on. Of course if you have a really large codebase and you’ve split it to ensure you only build what you need then I guess it becomes less accurate and false positives are not exactly a confidence builder. I’d still prefer to have the choice of enabling a “be aggressive” flag when I’m in cleanup mode.

Clearly there are other analysis tools, like NCover, that could help determine dead code, but I’m not talking about squeezing every last line of redundant code out (which would probably involve actually running it and then deciding whether you covered enough scenarios), but just pointing out the really obvious stuff like unused classes, methods, interface  implementations etc.

Anyway, until such time as JetBrains (or a suitable alternative) comes entirely to my rescue I’m using the following semi-automatic ReSharper assisted approach:-

  1. Load the solution and remove all unit test projects from it
  2. Use ReSharper to show up dead code
  3. Delete dead code
  4. Goto 2 to find what new dead code ReSharper has unearthed[+]
  5. Reload the solution with unit test projects
  6. Delete redundant unit tests
  7. Fix any false positives[*]

Of course having said all that someone will go and point out that I should RTFM or show me how FxCop already does this… Please do!

 

[+] ReSharper uses the term “heuristically determined unused method”, and yet it still doesn’t point out a lot of dead code - you often have to go and delete the unused callers before it will mark something as dead. And these aren’t methods with deep call graphs, maybe only 1 or two deep. Perhaps it’s more pessimistic when interfaces are involved?

[*] Yes we have other dependent code (such as various PowerShell scripts) that means you can’t just carelessly swing the axe about but it’s pretty obvious what would be called externally.

Tuesday, 28 June 2011

User Defined Types in SQL Server

In my recent post The Public Interface of a Database I talked about using User Defined Types and how I believe they help make the interface more maintainable. Now being in production and having real data stored in our tables means that changing them is going to be more difficult. The other day I needed to do just that and I found it’s not quite as simple as I had hoped…

Domain Types (Theory)

Although I’ve never studied database theory in any formal way I can see how User Defined Types are probably meant to be used - to allow you to express the commonality between data of the same nature and also to attach constraints to deal with restrictions that the underlying storage type may not possess.

For example, if you were storing a percentage (as an integer) you could use any of the integer types - tinyint, smallint, int etc, however it makes sense to use the smallest amount of storage required, i.e. tinyint (1 byte). But this type has a range of 0 - 255 which is wider than a percentage (0 - 100) so you need to bind a rule to say the value is actually limited to just (0 <= value <= 100).

Domain Types (Practice)

The problem is that the rule (at least under SQL Server) only applies to a column in a table; it is effectively a Check Constraint[+]. This means that declaring them in your stored procedure parameter list and using them within your SQL code only has a partial effect - you get the effects from abstracting the underlying storage type, but not the additional validation rules.

To make matters worse when you try to change the UDT and go searching for the ALTER TYPE statement you’ll find it doesn’t exist[*]. The only way you can change a UDT is to drop it and then create it again. But, to drop it you need to have nothing referencing it - no tables, functions or procedures.

Putting aside the difficulties in changing them for the moment let’s see what benefits they can bestow…

Type Aliases

Even though they do not enforce their constraints when as variables or arguments, they still have a marked effect on the maintainability of your code. Using the kind of example I cited in my previous post, which do you think is more expressive?

declare @name varchar(100)
vs
declare @name dbo.CustomerName_t

With good variable naming it is arguable how much the type adds in this example. If I changed the variable name to “@customerName” then they would both be just as explicit; but at least in the latter case you have the choice to let the type name enhance the readability.

There is also the temptation, especially with varchar columns, to just pick a size that is “big enough”. This means you might see varchar(100) in some places, varchar(128) in others and perhaps a very specific varchar(33) elsewhere. User defined types as simple aliases take all that guesswork away so that you don’t have to think nearly so hard. Other common sources of discrepancy are the various numbers used in financial scenarios that have a habit of being stored with different scales and precision, e.g. float(n), decimal(s, p) depending on the paranoia level of the developer.

Abstracting Identities

Entities, such as customers and products, are commonly referred to by some integer based surrogate key (perhaps generated by an IDENTITY column), but they may also be based on a natural key if, say, a short code exists. Once again, using a type alias means that you don’t have to think so hard; the example below could be based on an underlying integer or varchar type:-

declare @productId dbo.ProductId_t

This is a big win because unlike some cases where you could argue that you need to know what the underlying storage type is (perhaps you’re going to perform some manipulation of its contents) an entity ID is (in my experience) nearly always a unique & opaque value - it’s contents are unimportant. I’ve come to find this construct very unsatisfactory (in any language):-

declare @id int

The Integer type supports so many operations and allows for unnecessary promotions that it’s an accident waiting to happen[#] when used for the identity of an entity. Sadly, using UDT’s will not disable the undesirable arithmetic operations, but it should help avoid any unexpected type mismatches by virtue of not being able to get it wrong in the first place.

Mechanisms for Change

Let’s come back to that gnarly issue about how to change them later because that’s what prompted this post in the first place…

As mentioned earlier there is no magical ALTER TYPE statement and so that means you have two real choices:-

  1. Drop every referenced object, drop the old type and then recreate everything
  2. Introduce a new type, migrate everything across to that and then drop the old type

The second option may sound simpler, and it is, slightly; but only because you can recreate objects in an order that is less susceptible to dependency problems. But functions and procedures are the easy part anyway, what’s hard is altering a tables’ schema, especially big ones!

Ultimately your choice for updating a UDT that is used in a table depends very much on the change you need to make to the UDT and whether it requires data in the existing table to be touched. For example, if you widen a column from varchar(10) to varchar(20) SQL Server only has to change the table metadata to say the maximum length has increased because the actual length of each value does not change. Shrinking the column size or changing an integral type will likely cause data to be touched, or even just examined to ensure it still conforms. According to one of the Inside SQL Server books you can tell a priori whether data will need to be touched just by looking at the estimated execution plan.

So far my needs to change a UDT have only revolved around varchar columns and so I’ve been lucky that the table change has been the simple one. The changes to the functions and procedures were also pretty easy because we have the ability to rebuild the database from scratch (our Continuous Integration process does this each build) so I could ensure I caught and checked every place where the type was used. It’s a somewhat laborious process, but with the right development process around you it can be pretty easy.

Updating Tables

As I understand it the most efficient way to update an existing table schema (where you’re not just appending a new column) is to rename it, create the new one and then copy the data back over. When I first came to change my UDT I knew that I could either try and fix-up the existing one or create a new “version” of it. When I considered what actually needed to be done to fix the type and how big the table was I buckled and created a new type, called “MyType2_t”. This was because it felt like there were far less ordering issues and I didn’t have to generate scripts to drop things first. So, instead of this lengthy tirade:-

  1. Drop dependent functions and procedures
  2. Create copies of tables without the UDT to be modified
  3. Copy data to backup tables
  4. Recreate the UDT with the new definition
  5. Create new tables with the new UDT
  6. Copy the data back again
  7. Recreate the functions and procedures

…I could just use ALTER TABLE and switch the types, taking whatever hit was necessary to adjust the data in the tables (which was none in this case):-

  1. Create the new version of the UDT
  2. Alter the tables to use the new version of the UDT
  3. Recreate the functions and procedures[$]
  4. Drop the old version of the UDT

Yes, the deployment was easier, but we’re now left with a wart - a type with a version suffix (e.g. MyType2_t). After having gone through all this it dawned on me soon after that if you could use ALTER TABLE to switch types then surely you can just do this instead:-

  1. Drop dependent functions and procedures
  2. Alter the table to use a non-UDT based equivalent type
  3. Recreate the UDT with the new definition
  4. Alter the table to use the new UDT
  5. Recreate the functions and procedures

If SQL Server is smart enough to just apply a metadata change when growing a varchar(n) then it should be able to just replace the metadata when switching from a UDT that is effectively a varchar(n) to a non-UDT alias of a varchar(n). This would remove the dependency on the UDT without any upheaval at all. Sadly it’s an experiment I cannot run on this netbook, but I intend to.

Unnecessary Context Switching

Having worked for some time on a codebase that uses UDTs virtually everywhere, now, when I do see an explicit type I begin to question whether it’s correct or not and I find that very distracting. I begin to question whether any thought was put into it or whether the author was just being lazy. Perhaps it’s a new type and they just didn’t have time to create a UDT for it? Or maybe they did but missed fixing up this case? Any way you look at it I’m now being diverted from my real goal - but that’s just the age old argument about being consistent I suppose.

 

[+] According to the documentation for CREATE RULE that these are now deprecated anyway in favour of using Check Constraints. Was that because they just weren’t as useful as expected?

[*] There was a “petition” on Microsoft Connect about the lack of an ALTER TYPE statement which I discovered whilst trawling StackOverflow for alternate solutions.

[#] This is one area where C++ enumerations are a boon because they disallow the arithmetic operations and also give the value a type-safe holder. This jolly useful use of enumerations was pointed out to me by fellow ACCU member Jonathan Wakely.

[$] Each function and procedure script has the common “if (object_id(Xxx) is not null) drop Xxx” prologue.