Friday 12 October 2012

Regular Expressions Should Only Be An Implementation Detail

Last week I had a spot of déjà-vu as I found myself questioning another developer’s suggestion that they might use regular expressions as a way of defining some simple rules. It’s easy to see why experienced developers might choose to use regular expressions in this way, and if it was purely an implementation detail I wouldn’t have paid it any more attention. But these rules were going in a database where they are probably going to be consumed by support staff and business analysts as well in the future[1].

Immediately it might sound as though I am denigrating support staff and business analysts by suggesting that they are not au fait with regular expressions. On the contrary, I’m sure there are many that know bucket loads more than I do about them. No, what I’m suggesting is that regular expressions are probably not the best default choice for this kind of feature.

I have gone through this conversation a couple of times before, and in both cases the requirements to support anything more than some simple matches never materialised in the meantime. In the current case we just needed to support very simple whole token inclusive matches, but support was added for a simple NOT style operator as well (~token). The domain of the strings was such that this is extremely unlikely to cause a conflict in the foreseeable future, so much so that I’d suggested the “escape” character be elided to keep the code really simple. Also the set of tokens to match are unordered by default which instantly makes any regex more complex than a simple CSV style list where more than one token needs matching. Taking a cue from my previous post though we quickly went through a thought exercise to ensure that we weren’t creating any problems that would be costly to change later.

Of course that doesn’t mean that the implementation of the rules matching can’t be done with regular expressions. The best solution may well be to transform the (now) very simple rules into regexs and then apply them to do the matching. In one of the previous cases that’s exactly what we did.

My reason for disliking exposing regular expressions by default is that the wildcards are subtly different from the other sorts of wildcards you are more likely to encounter when you start in IT. For example you don’t write “SELECT .* FROM Table” to select all columns from a table and you don’t search for all files in folder by using “DIR .*\..*”. Throw into the mix the need to escape common textual elements such as parenthesis and it becomes a much trickier prospect. In my opinion keeping the DSL (of sorts) minimal and focused up front will lead to far less surprises in the future as the data changes.

To those jumping up-and-down at the back shouting about testing, of course that is key to making any change successful - code or configuration. But we all know where this path leads… Any post about regular expressions is not complete without an obligatory reference to this quote from Jamie Zawinski[2]:-

“Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.”

Not allowing things to get too complicated in the first place is a good step on the road to keeping this kind of feature under control. The moment complexity sets in and it starts getting painful you’ll have the chance to re-architect long before you’re dealing with the kind of horrendous regexs that only appear on The Daily WTF.

As an aside one of the features of PowerShell that I’m particularly fond of is that it supports both “-like” and “-match” comparison operators. The former uses the same semantics as the shell for matching files, whilst the latter is for use with regular expressions. I’m sure this seems abhorrent to others (having two operators that do a similar job) and it has the potential to cause just the kind of confusion that I’m suggesting we avoid. In practice (so far) it seems to work because the context that you use it (files or strings) is normally clear from the code. Time will tell though...

 

[1] Yes, the database is internal to the system and by that definition it is still an implementation detail. However enterprise systems have a habit of leaking like a sieve, especially data in databases, and so I treat it in a much looser way than constants and configuration defaults that are baked into code and scripts.

[2] The book Coders at Work has a fine interview with him as I found out a couple of years back.

No comments:

Post a Comment