Tuesday, 30 October 2012

Don’t Rewrite, Refactor

All systems eventually reach a point where they are either no longer needed or the cost of maintenance outweighs their benefit to the business. If the system was originally built using hardware and tools that are now unavailable what choice do you have but to rewrite from scratch? Hopefully you’ve seen this position coming and have instead made provision to migrate to newer tools and technologies whilst at least maintaining the usefulness of the system. I’ve heard this kind of change referred to as an Architectural Refactoring. The use of the term refactoring has the same meaning as you’d expect - there is no change in behaviour, only in the underlying mechanism.

In one sense refactoring is rewriting, after all, code has changed. But the difference is that the change is still part of the same feature and is therefore still the code providing the same business value it did before, but hopefully in a “better” way. What rewriting seems to have come to mean is developing the same feature, but to one side of the production codebase. In essence there is a switchover from the old to the new, and probably a change in behaviour too as it’s likely that new features are what has driven it to occur in the first place.

I’ve seen plenty written about the fallacy of rewriting, but usually from the perspective of whether it actually delivers the intended business benefits it was expected to. One commonly cited problem is that the same mistakes are often repeated or the system ends up doing exactly what the old one did, which in turn is not what the business probably wanted in the first place!

However, my beef here is with rewriting internal parts of the system, especially when that rewrite does not involve a major change in tool or technology. If you already have a feature providing value, then the desire should be to evolve the implementation of that feature whilst continuing to draw on its value. By evolving through refactoring you get to continue extracting value whilst at the same time being in a position to put your spade down and switch focus to more important matters when necessary. If you choose the rewriting path you gain no value out of the change until you have switched over to it in production and decommissioned the original code.

Some time ago, at a previous client, I wrote a lengthy diatribe about how COM was not the problem in itself, but the way COM was used within the system was. I declared that one of the barriers to delivering new features was the inability to do unit testing because the excessive use of COM within the components made it intrinsically hard[1]. To try and turn-the-tide I wrote a couple of simple helper functions[2] that allowed us to reach into the COM component and extract the native C++ object inside as we owned both sides of the interface. This gave us a clear path for the future where we would push the logic from the COM component down into a native C++ class, which would then be unit testable. Initially we would use the dirty helper functions to overcome the COM boundary and then start fixing up the call graphs by taking the native type instead of the COM interface as the parameter. This would then push COM back out the module boundary where it belongs.

More recently I’ve been hit a couple of times by the worst of both worlds - a rewrite that is only partially wired in - so we are currently maintaining two versions of the same feature. Worse still is that the end state almost certainly means ripping out the temporary code used to partially wire it in, so there is yet more disruption ahead of us. One obvious question is whether refactoring would have just lead to the same outcome? Personally, I don’t believe so and it’s because refactoring keeps you in permanent contact with all the current consumers of your feature. When you rewrite you have the opportunity to focus on a single use case and then try and evolve it as you reel in the remainder; with refactoring you are always dealing with real use cases[3].

Aside from the mental gymnastics required to juggle two implementations there is another cost which is commonly missed - source code continuity. As is probably already apparent from my other posts I’m a huge fan of using Software Configuration Management (SCM) tools to perform Software Archaeology when I need to track down why a change was made. Writing new code makes that task much harder because you have to trace back through the current implementation, whilst keeping an eye out for where the previous one was removed; just so that you can then trace back through that too. Although refactoring moves code around it generally remains within the same set of source files and commits and so you stand a better chance of piecing together the history.

In an modern project every commit should expect to deliver some value today. Refactoring may take longer in the long run, but at least every commit is a step forward. Rewriting stores value up in the hope that it will deliver in the future, but it also stores up risk too along with creating a disconnect between past and present.


[1] Registration-free COM was yet to come. Solutions involving registering every component before running any tests are brittle at best and disallow running parallel builds on the same machine. Registering the components using relative paths is possible, but you have to do a little work yourself.

[2] Essentially I created a marker COM interface called ISupportDynamicCast that any natively implemented COM class would “implement”. The presence of this interface means the caller knows it’s safe to attempt a dynamic_cast<> on the object from ISupportDynamicCast to the underlying C++ type. The are plenty of caveats to doing this, but if you implement both the caller and callee, and are using the same compiler for both it just may dig you out of a hole.

[3] Yes, it’s possible that one or more of those use cases is no longer required, if it is you get to rip it out there and then rather than discover later that you’ve been living with dead code all along.

Monday, 29 October 2012

Constraints Are Logical, Indexes Are Physical

A few times, both verbally and through comments in code, I have heard developers talk about adding constraints or indexes when what they really meant was the opposite object. For example, I see comments in our SQL index scripts that suggest the index was added to ensure the column data is unique, or the converse, which is a developer suggesting we add a primary key constraint so that we can generate an index on a table.

The confusion is not surprising because the two concepts are tightly bound and literature on the subject of query optimisation and tuning often focuses heavily on the topic of indexes as they are dealing with the issues of efficient storage and retrieval rather than the logical properties of your data. It just so happens that a database engine such as SQL Server will use an index to enforce a uniqueness constraint because it is the most efficient way to achieve it. After all, what other choices are there? If the table is sufficiently small a simple table scan may be adequate enough[1]. This would suggest that the creation and use of an index may well be driven by the volume of data. Another technique might be to use a cut-down variant of an index where you only store the column values and not the actual row pointers[2] so that you can determine uniqueness alone, but it would probably be of little use elsewhere.

If you look at it from a testing perspective you can write a functional test in one case but not the other, because an index is an optimisation and optimisations should, by definition, not be detectable at a functional level. When putting together a new SQL feature that requires a table I’ll use TDD to drive out the constraints[3]. Here’s an example test in SS-Unit for testing a unique constraint:-

create procedure
    exec pub.Customer_Add @id=1, @name=‘TEST’;
    exec pub.Customer_Add @id=2, @name=‘TEST’;

create procedure
  exec AssertThrows ‘_@Helper@_InsDupCustomer’; 
  exec AssertTableRowCountIsEqualTo 0, ‘Customer’;

This separation of concerns between the logical uniqueness of the data and the presence of an index is highlighted in SQL Server by the fact that it will only allow you to drop an index that implements a constraint by dropping the constraint itself. Without it SQL Server couldn’t enforce the constraint efficiently. And this I suspect is where the purity of the data model gives way to the reality of database performance as the need to finely tune the index parameters may mean the constraint just gets in the way. Fair enough, in those cases where it’s required maybe the uniqueness of the data is just a note on a UML model and not implemented in the schema, but remember that should be the exception, not the norm[4].


[1] The lookup tables we use used for our early implementation of enumerations fit this perfectly as they are just a handful of small rows, i.e. the single data page is bigger than all the table data itself. We now use a UDF with a CASE statement instead, which is also what I use for the TestOutcome enumerations in SS-Unit.

[2] Recent versions of SQL Server store the key values when the table contains a clustered index so the row pointer is essentially as wide as the number of columns in your primary key.

[3] As I pointed out in “The Public Interface of a Database”, you do not always have to use table based constraints when you have a facade. Yes, it’s by far the most convenient and conventional but there are times when it needs to be done another way (blog post already queued up).

[4] I once worked on a system where the database had a large entity model showing all the relationships with foreign keys. It turned out though none of these actually existed in the database schema. There were murmurs of “foreign keys cause performance problems” and also the static data was updated by reloading the entire table. If you’re going to tune your production environment for speed I suggest you tune your development environment for finding bugs.

Tuesday, 16 October 2012

Instrument Everything You Can Afford To

One of the decisions we made right back at the beginning of my current project was to measure everything that we could afford to. That meant every database call, I/O request, complex task etc. We all knew from past experience that it’s much harder to diagnose performance problems later if you don’t have your code telling about what the performance of your system is in your terms. Attaching performance tools to databases and measuring network traffic is much harder when you don’t really know the context within which the events are occurring. Also, without any sort of diagnostic logging when a problem does occur it will no doubt pass before you’ve had time to start monitoring it.

The main problem with this kind of code is that it is often seen to get in the way of the business logic it’s measuring. My belief is that if that’s the case then you probably need to do a bit of refactoring because your methods are way too big. In the past I’ve found that the logging and instrumentation logic usually sits in the public methods of the service class which then invokes a private method or even another class to do the actual work - a separation of concerns.

In C++ you have RAII & macros and in C# you have the Dispose pattern to help you reduce the clutter. The following are two examples of the kind of code I’m talking about from previous systems:-

ThingPtr requestThing(int id)
  DO_REMOTE_CALL(“Request Remote Thing”)
    return requestThingImpl(id);

public Thing RequestThing(int id)
  using(Scope.Measure(Instrument.WallClock |
        Instrument.Memory), “Remote Request Thing”)
    return proxy.RequestThing(id);

The __FUNCTION__ pre-processor macro in C/C++ and StackFrame class in .Net make it easy to reach out and grab the called method name without resorting to the copy-and-paste error prone techniques of the past. However, that said, I still prefer to give the task (or operation) a very short name as it allows you to refactor the implementation without disturbing the saved metrics as they should represent a slightly larger goal than just a minor implementation detail.

The tasks will often be grouped too, so I might have a “Remote Request Thing” for the proxy/client-side, “Handle Request Thing” in the service stub, and just “Request Thing” as the underlying in-process action. Using the first two I can then measure the infrastructure costs of remoting the call. Background growth in the number of connections and volumes of data have a habit of creeping up on you and so it’s important to keep a close eye on them. In fact for the past two weeks myself and a colleague have been looking very closely at the instrumentation data around talking to a 3rd party grid. From the data we can see ourselves occasionally waiting 30 secs for the results to a calculation that only took 50 secs in the first place. Ideally we want the the size of the computation to dwarf any infrastructure costs and so we need to look into this further. However this problem is compounded by the fact that the payload is being optimised by another team to overcome a different set of performance problems - so we need these metrics to ensure the run-of-the-mill case doesn’t suddenly become sub-optimal by mistake.

Many tasks are treated as singular in nature, i.e. they just are what they are. When it comes to dealing with reading/writing large collections though I’ve found it useful to attach extra metrics that help quantify the size of the inputs and/or outputs. For example when extracting a trade portfolio for a customer you might attach the trade count as an “external measurement” so you can get a rough per-trade cost for retrieval:-

public List<Trade> RequestPortfolio(int customerId)

  using(Scope.Measure(Instrument.WallClock |
        Instrument.Memory), “Fetch Portfolio”) 
    var portfolio = RequestPortfolioImpl(customerId); 

    Scope.Attach(“Trade:Count”, portfolio.Count); 

    return portfolio; 

This does start to make the code look unwieldy but at least it all remains in the method stub whilst the actual business logic is held cleanly in a separate method or class. All of this extra jiggery-pokery should also only have minimal impact on the performance because the calls across the network or disk accesses will likely be orders of magnitude greater than starting/stopping a timer and calculating the difference. As always you should let the profiler guide you if you think you might be instrumenting tasks that are too fine-grained. In principle, at the level of code I’m generally working, the resolution of the standard clock should be enough to measure what I’m doing. Any more than that I would start to think a bit more carefully about both the granularity of the task and the potential volume of data I’d be producing.

I’m a big fan of text log files[1] that you can pull apart with the usual array of off-the-shelf tools. Many years ago I discovered Microsoft’s LogParser whilst I was involved in some IIS based work. Since then it has been my tool of choice for extracting instrumentation data from our log files because of its SQL-like syntax. It’s getting pretty old these days but still has excellent performance which is extremely useful when you’re trying to pull ad-hoc data from a plethora of grid nodes.

My current system uses the following log file format[2]:-

<datetime> <pid> <tid> <severity> <message>

We have a special message severity for instrumentation data called “PRF” just to make it a little easier to extract. It also has the benefit of allowing you to turn down the highlighting for these messages when you’re tailing a log file. The <message> itself is then split into three sections:-

<task> <metrics> <context>

The task name, as I said above, is a short name without any context that describes a high-level goal. The metrics are a classic list of key/value pairs with the “:” character used as a simple namespace separator (ala XML). Finally the context is a similar set of key/value pairs that describes the instance specific details of the task. Putting this all together a very simple example instrumentation message might be (allowing for word-wrap):-

<datetime, etc.> PRF [Fetch Portfolio] [WallClock:Duration=00:05:00;Trade:Count=100] [RequestId=1234;RemoteHost=ServerXxx]

For our grid-based tasks there can be upwards of a dozen or more metrics covering CPU usage, wall clock time, memory consumption and payload sizes covering both the client and service sides. Although we can get some of this data from the grid vendor’s own reporting database and logs it’s much easier to deal with it when it’s already in-line as you don’t have to try and match it up.

As an aside the salient attributes of the context travel across the wire on remote calls to help stitch multiple threads together. We call his salient context “causality” and it’s based on a similar technique in DCOM along with the Diagnostic Context pattern from PLOPD vol 3. But that’s for another time…


[1] I once made the mistake of going with a binary log file format in the hope that it would buy better performance and allows us to write lots more system-level data like the current “tick count”, etc. by default. It may have been more performant, but we then ended up writing our own tools to manipulate the files. I totally underestimated the power you get from a simple column aligned text log file with some core fields.

[2] I’ve stuck with this format (or something very similar) ever since my disappointment with the binary log file above as it just seems perfect by default. The four leading columns are all fixed-width which makes it very easy on the eye when tailing the file. My choice of GUI based tail program (with highlighting, natch) for many years has been BareTail. I also use their GUI grep tool BareGrep too as it also supports highlighting. For automation it’s the usual array of sed, awk, egrep, PowerShell, etc. although my colleague Tim Barras recently highlighted that “cut” was an important omission from my toolbox.

Friday, 12 October 2012

Regular Expressions Should Only Be An Implementation Detail

Last week I had a spot of déjà-vu as I found myself questioning another developer’s suggestion that they might use regular expressions as a way of defining some simple rules. It’s easy to see why experienced developers might choose to use regular expressions in this way, and if it was purely an implementation detail I wouldn’t have paid it any more attention. But these rules were going in a database where they are probably going to be consumed by support staff and business analysts as well in the future[1].

Immediately it might sound as though I am denigrating support staff and business analysts by suggesting that they are not au fait with regular expressions. On the contrary, I’m sure there are many that know bucket loads more than I do about them. No, what I’m suggesting is that regular expressions are probably not the best default choice for this kind of feature.

I have gone through this conversation a couple of times before, and in both cases the requirements to support anything more than some simple matches never materialised in the meantime. In the current case we just needed to support very simple whole token inclusive matches, but support was added for a simple NOT style operator as well (~token). The domain of the strings was such that this is extremely unlikely to cause a conflict in the foreseeable future, so much so that I’d suggested the “escape” character be elided to keep the code really simple. Also the set of tokens to match are unordered by default which instantly makes any regex more complex than a simple CSV style list where more than one token needs matching. Taking a cue from my previous post though we quickly went through a thought exercise to ensure that we weren’t creating any problems that would be costly to change later.

Of course that doesn’t mean that the implementation of the rules matching can’t be done with regular expressions. The best solution may well be to transform the (now) very simple rules into regexs and then apply them to do the matching. In one of the previous cases that’s exactly what we did.

My reason for disliking exposing regular expressions by default is that the wildcards are subtly different from the other sorts of wildcards you are more likely to encounter when you start in IT. For example you don’t write “SELECT .* FROM Table” to select all columns from a table and you don’t search for all files in folder by using “DIR .*\..*”. Throw into the mix the need to escape common textual elements such as parenthesis and it becomes a much trickier prospect. In my opinion keeping the DSL (of sorts) minimal and focused up front will lead to far less surprises in the future as the data changes.

To those jumping up-and-down at the back shouting about testing, of course that is key to making any change successful - code or configuration. But we all know where this path leads… Any post about regular expressions is not complete without an obligatory reference to this quote from Jamie Zawinski[2]:-

“Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.”

Not allowing things to get too complicated in the first place is a good step on the road to keeping this kind of feature under control. The moment complexity sets in and it starts getting painful you’ll have the chance to re-architect long before you’re dealing with the kind of horrendous regexs that only appear on The Daily WTF.

As an aside one of the features of PowerShell that I’m particularly fond of is that it supports both “-like” and “-match” comparison operators. The former uses the same semantics as the shell for matching files, whilst the latter is for use with regular expressions. I’m sure this seems abhorrent to others (having two operators that do a similar job) and it has the potential to cause just the kind of confusion that I’m suggesting we avoid. In practice (so far) it seems to work because the context that you use it (files or strings) is normally clear from the code. Time will tell though...


[1] Yes, the database is internal to the system and by that definition it is still an implementation detail. However enterprise systems have a habit of leaking like a sieve, especially data in databases, and so I treat it in a much looser way than constants and configuration defaults that are baked into code and scripts.

[2] The book Coders at Work has a fine interview with him as I found out a couple of years back.

Monday, 8 October 2012

Putting the Cost of Change Into Perspective

One of the topics in Programming Pearls by Jon Bentley that I still find most valuable on a daily basis is the Back of the Envelope Calculation. Often just a rough calculation can help you see the orders of magnitude involved with a task and let you decide whether to categorise it as “small stuff” that’s not worth sweating about or “potentially costly” that deserves more thought.

Refactoring is a good case in point. Sometimes there is a direct need to refactor code to allow you make a safe change, such as when dealing with legacy code. But there is also the “boy scout” bit where you might have a little extra time to just do a bit of litter picking and delete some unused code, or improve the names of some related types/methods. The conscientious programmer always feels uneasy about doing this because the value of the change is far less tangible and therefore it can be seen by some to be approaching the realm of gold-plating.

On a different, but vaguely related note I read an article in the June 2012 edition of MSDN Magazine about the benefits of the new code-based configuration support in MEF. Now, I still struggle to see the advantages of an Inversion of Control (IoC) container and this article did nothing to help me see what the significant savings are that I’d see from adopting such a technology[*]. Anyway, about half way through the piece the author suggests that if he needed to add support for “a new weather service provider” he could implement the service and then plumb the changes into the app with just a line or two of code. His use of “That’s It!” may well be intended to be ironic because implementing most new functionality is clearly non-trivial in comparison to the mechanism needed to plumb it in; but it’s not an isolated case where I’ve seen an IoC container presented in the same light as a Silver Bullet.

What both of these cases have in common though is that the costs are often presented within the perspective of the “development time” of a feature - the time taken to “write the code”, and hopefully (at least) unit test it. But there can be so much more to make a feature not just done, but “done done”. Even if you’re using Continuous Integration and Continuous Deployment so that your build and deployment times are minimal, the level of system-testing coupled with the original discussions that lead up to the decision to implement it may be orders of magnitude more than the time you’ll spend doing a little cleaning-up or hand-crafting a factory[#].

Of course no change is made in isolation and so it could be said that the build, deployment and system testing costs are shared across all the features currently being developed. Even then I still think doing the maths is worth it as it may well show that the effort is just a drop in the ocean. Where you do need to be more careful is holding up a build & deployment to get “one more tiny change in” as this can be more disruptive than productive. It also means your process is probably not as fluid as it could be.

By way of an opposite example a request was once made by a project manager to look into making it easier to add support for new calculation types. In his mind it should be simple to configure the inputs, parse the output and then persist it without needing a code change; and in essence he was right. We could have re-designed the code to allow the whole process to be “more dynamic” - swapping code complexity for configuration complexity. My personal objection though was not on technical grounds (it sounded entirely useful after all) but on the grounds that we spent all our time trying to work out why the structure of the inputs and outputs didn’t match our expectations. Or having a few round trips with our 3rd party as we point out a variety of cases where it failed. Or writing test cases to make sure we can detect when this structure changes again unexpectedly in the future. In other words changing the production code is just a small part of the overall cost of getting the entire feature into production.

I guess the flipside to all this is the old saying about taking care of the pennies and the pounds will take care on themselves, after all nobody goes out of their way to be inefficient on purpose. Of course if it’s the same penny every time then the usual rules about automation apply. Either way just be sure that you know what it is you’re giving up in return - the lunch will almost certainly not be free and the best-before date will eventually arrive.


[*] I’ve worked on both desktop applications and back-end systems and I still don’t see the allure. But I’ve not worked on a modern large-scale web site or heavily “screens” based desktop app so that may explain it. My only experience is with helping someone remove their dependency on one as a by-product of making their code more unit testable so I’m definitely not “feeling” it. However, it’s still lodged in my Conscious Incompetence ready to be put to good use when/if that day finally comes.

[#] That’s one of the things about IoC containers that appears snake-oil-like to me - the age old GoF Design Patterns such as Abstract Factory, Factory Method & Facade are trivial to implement and have the added bonus of not forcing any of my code to have a direct dependency on a framework. One of the techniques I used in the footnote above when improving unit test coverage, was to use Facade to factor out the IoC code.