Monday, 29 October 2012

Constraints Are Logical, Indexes Are Physical

A few times, both verbally and through comments in code, I have heard developers talk about adding constraints or indexes when what they really meant was the opposite object. For example, I see comments in our SQL index scripts that suggest the index was added to ensure the column data is unique, or the converse, which is a developer suggesting we add a primary key constraint so that we can generate an index on a table.

The confusion is not surprising because the two concepts are tightly bound and literature on the subject of query optimisation and tuning often focuses heavily on the topic of indexes as they are dealing with the issues of efficient storage and retrieval rather than the logical properties of your data. It just so happens that a database engine such as SQL Server will use an index to enforce a uniqueness constraint because it is the most efficient way to achieve it. After all, what other choices are there? If the table is sufficiently small a simple table scan may be adequate enough[1]. This would suggest that the creation and use of an index may well be driven by the volume of data. Another technique might be to use a cut-down variant of an index where you only store the column values and not the actual row pointers[2] so that you can determine uniqueness alone, but it would probably be of little use elsewhere.

If you look at it from a testing perspective you can write a functional test in one case but not the other, because an index is an optimisation and optimisations should, by definition, not be detectable at a functional level. When putting together a new SQL feature that requires a table I’ll use TDD to drive out the constraints[3]. Here’s an example test in SS-Unit for testing a unique constraint:-

create procedure
    exec pub.Customer_Add @id=1, @name=‘TEST’;
    exec pub.Customer_Add @id=2, @name=‘TEST’;

create procedure
  exec AssertThrows ‘_@Helper@_InsDupCustomer’; 
  exec AssertTableRowCountIsEqualTo 0, ‘Customer’;

This separation of concerns between the logical uniqueness of the data and the presence of an index is highlighted in SQL Server by the fact that it will only allow you to drop an index that implements a constraint by dropping the constraint itself. Without it SQL Server couldn’t enforce the constraint efficiently. And this I suspect is where the purity of the data model gives way to the reality of database performance as the need to finely tune the index parameters may mean the constraint just gets in the way. Fair enough, in those cases where it’s required maybe the uniqueness of the data is just a note on a UML model and not implemented in the schema, but remember that should be the exception, not the norm[4].


[1] The lookup tables we use used for our early implementation of enumerations fit this perfectly as they are just a handful of small rows, i.e. the single data page is bigger than all the table data itself. We now use a UDF with a CASE statement instead, which is also what I use for the TestOutcome enumerations in SS-Unit.

[2] Recent versions of SQL Server store the key values when the table contains a clustered index so the row pointer is essentially as wide as the number of columns in your primary key.

[3] As I pointed out in “The Public Interface of a Database”, you do not always have to use table based constraints when you have a facade. Yes, it’s by far the most convenient and conventional but there are times when it needs to be done another way (blog post already queued up).

[4] I once worked on a system where the database had a large entity model showing all the relationships with foreign keys. It turned out though none of these actually existed in the database schema. There were murmurs of “foreign keys cause performance problems” and also the static data was updated by reloading the entire table. If you’re going to tune your production environment for speed I suggest you tune your development environment for finding bugs.

No comments:

Post a Comment