Thursday 20 June 2013

Don’t Use ORDER BY as a Progress Indicator

I’ve had the misfortune to be involved with investigating a few performance problems with existing SQL queries that just suddenly went AWOL for no apparent reason. In one case it was the same query again. On the first occurrence bad statistics looked to be the blame as an import process had just finished dumping a whole bucket load of data in and so a manual invocation to update the statistics was instigated whenever the import process had run. It shouldn’t have been necessary of course, but that’s just one aspect of the Black Art that is database maintenance.

As an aside I saw similar problems at a different organisation and each time it happened the developer’s immediate reaction was to question why Auto Update Statistics was disabled. It was “the policy”, apparently. The “story” goes that the company lost money because of a borked query that failed due to “bad statistics”. As a consequence no one was allowed to enable Auto Update Statistics without special dispensation. Of course this all happened “years ago” on an earlier version of the RDBMS product. Even more amusing was that a statistics update maintenance job had been added by the DBAs but the timing was such that it ran after most of the data had been deleted by the archiving process!

Anyway, during the investigation a second time I noticed that the query ended with an ORDER BY clause. I remembered from the first time around that removing the sort was another way to avoid the query performing badly. Back then though, what with it being production and following on from a large import, it seemed more fitting to add a task to refresh the statistics to avoid similar problems elsewhere. This time around the sort still seemed to be causing the problem and removing it needed to be more seriously considered.

The main reason I thought it needed serious consideration was because I couldn’t understand why the data was being sorted in the first place. The processing that took place on the 100,000+ results was completely order independent. Luckily the original developer was still around and a quick question revealed that by sorting the results the log output would be in alphabetical order and therefore it would be easier to determine how far along the processing was. A quick grep of the SQL codebase showed that there were a number of queries that had ORDER BY clauses for which I was sure there was absolutely no business need for them. In fact, I think I could count on one hand the cases where the sort was actually a necessity.

As a rule of thumb sorting in the database is a bad idea. You are very likely to have far less database instances than clients and so putting additional load on the database to sort the data is only adding to what is probably already an overloaded beast. Sorting on the client is invariably going to be cheaper and it’s not like you need to write an efficient sort algorithm yourself these days either. I don’t disagree that sorting the data to provide an easy way of seeing progress is altogether a bad idea, it’s just where you do the sorting that matters most. A comment in the client code wouldn’t go a miss either as its not exactly obvious.

This particular piece of advice (not sorting server-side) also caused somewhat of a stir when this was posted: “7 Things Developers Should Know About SQL Server”. The comments (and subsequent updates) are as ever quite revealing about how literally some people take advice like this without thinking.

Wednesday 19 June 2013

A Tail of Two Lists

On the one hand I like to celebrate the fact that I come from an engineering background (I did a degree in Electronic System Engineering) and prefer to put an emphasis on the applied rather than the theoretical. But as I get older and more competent in the practical side I am becoming more curious about the theoretical side of programming so I can appreciate more about where so much of it comes from. That’s another post in itself though, whereas this one is specially about how that lack of Computer Science knowledge hindered me from properly understanding a fundamental construct - lists.

Linked Lists

Like many programmers I learnt C as my first high-level language (if you ignore the introductions to Pascal at A Level and Modula 2 in University) where the only built-in container type is the array. If you wanted something more dynamic, or you were “learning programming”, then the first structure you’re probably shown is the list; or as it’s often commonly known, the linked list. In fact a singly-linked list is often glossed over because it’s generally more awkward to use and so a doubly-linked list becomes the default (implementation of) choice. What could be simpler than this:-

typedef struct Item
{
    struct Item* m_next;
    struct Item* m_prev;
    const char*  m_data;
} Item;

typedef struct StringList
{
    struct Item* m_head;
    struct Item* m_tail;
} StringList;

Convert StringList to a class and throw templates into the mix and you’ve got yourself the (somewhat naive) C++ equivalent. This kind of list also serves very nicely as the implementation for a queue (FIFO) or stack (LIFO), with items being added and removed at the head and/or tail depending on what container you’re aiming for. What is important to realise for this post is that the head and tail only refer to a single item, e.g. in ASCII art something like this:-

A <-> B <-> C <-> D
^                 ^
|                 |
head           tail

If the first and last elements are the head and tail respectively, then what is the bit in the middle called - the body?

Type Lists

Fast forward a few years and I’ve been “doing” C++ for a while now. Naturally I’ve stopped using custom list classes and am instead relying on the library implementations, most notably the one that come with MFC [1]. Of course being the journeyman that I (hope I) am I discover this really amazing book called Modern C++ Design by Andrei Alexandrescu which describes these weird new concepts, like “template meta-programming” that relies on heavy template programming voodoo. At this point I’m still only just about getting my head around the “iterator model” and the standard containers. Oh yeah, and I’m on v5 or v6 of Visual C++ which barely seems to understand any more about Standard C++ than I do, so nothing in the book compiles.

In chapter 3 Andrei introduces a concept called Type Lists, which is a precursor to implementing some more “computer science-y” type stuff called Tuples. Try as I might I just can’t seem to understand what’s going on. I originally had 3 attempts at getting my head around the way he manipulates these “type lists” and each time I understood how he splits off the head and the tail but I just can’t see where the rest (the body?) of the list goes. I found myself getting really frustrated because I mentally stepped through his template code again and again and again, but I just couldn’t see why he didn’t end up with a head, body and a tail! But I also knew that wasn’t right either because the definition of a type list is trivial:-

<template typename H, typename T>
struct TypeList
{
    H head;
    T tail;
}

Of course there were a few other minor details that added slightly to the noise, like the last element in the list being the NullType that he introduced in an earlier chapter, but essentially a type list laid out long-hand would be like you might expect:-

typedef MyList TypeList< Widget, TypeList< int, TypeList< char, NullType> > >;

So, where was I going wrong?

Mental Models

I can’t remember exactly what it was that finally caused the light bulb to go on, I think it was whilst starting to play with Python and trying to understand the differences between lists and tuples. Anyway, what surprised me was that the documentation I was reading seemed to be suggesting that when you slice a list you get a head, which contains one or more items, and a tail which contains the rest of the list. Eh? So, the tail is not a single item but everything except the head! My mental model was more like that of a mammal that has a head, body and tail, when I should have been thinking of something more like a snake [2].

When you’re working on an architecture like DOS or 16-bit Windows where your data and stack share a measly 64K, you quickly learnt that recursion was something to be avoided at all costs. Consequently the only model for traversing arrays, lists, etc. is iteratively, from head to tail - one item at a time. Conversely the “functional programming” model [3] suggests a more recursive way of working where you effectively keep picking the head off the list at each step until the tail becomes empty.

As our computing resources continue to lose such performance constraints and our tools get smarter we have to remember that what we have been working with so far is sometimes just a pale imitation of a bigger concept brought about by the usual engineering trade-offs.

 

[1] When you spend all day writing MFC based applications it’s easy to forget that the default container type is std::vector<> and you end up using CArray<> instead. There was enough of an impedance mismatch between the STL and MFC containers that meant the path of least resistance was simply to just stick with the MFC implementations. Sadly that also meant the beauty of the container/iterator/algorithm model went unnoticed by me for many years.

[2] I’m sure someone well versed in reptile physiology will no doubt point out that snakes are more than a head and tail, pardon my ignorance.

[3] I’m sure I’m going to get some flack for abusing such a term and/or over simplifying it, but like many others I’m only just starting out on the road to understanding more about the functional and declarative paradigms.

Tuesday 18 June 2013

PowerShell Command Line Quoting Weirdy

My last post “Handling Paths with Spaces & Parenthesis in Batch Files” was really just a precursor to this one which is about another weird thing I bumped into recently with executing simple PowerShell one-liners via a batch file. I applied the same logic I always do to quoting long filenames [1] and ran into confusing behaviour that seems to create the exception to my rule of thumb…

PowerShell Strings

In PowerShell you can define strings with either double quotes (“string”) or single quotes (‘string’). The main difference is that the former will perform variable substitution (“Hello $FullName”) whereas the latter won’t. Essentially, if the string contains no variables or expressions they are identical and as such if I create a PowerShell script with the following two statements I should get the same behaviour:-

Get-ChildItem "C:\Program Files" | Measure-Object | Select Count
Get-ChildItem 'C:\Program Files' | Measure-Object | Select Count

Both do execute fine and I get a simple table with the same answer, twice:-

Count
-----
   55
   55

PowerShell One-Liners

So far so good. But what about if I want to execute the same simple one liner as part of a batch file? I don’t want to have to ship another .ps1 script so I can put the whole thing on the PowerShell command line and execute it directly like this:-

C:\> PowerShell Get-ChildItem C:\ ^| Measure-Object ^| Select Count

Obviously the pipe symbol means something to the CMD.EXE shell and so I need to escape it with the caret (^) character. This works a treat and I get the output I expect. The next step is to generalise it, then I can use it in a batch file. So I’ll create a variable and then follow my own advice and surround it with double quotes to allow for long filenames:-

set folder=%~1

PowerShell Get-ChildItem "%folder%" ^| Measure-Object ^| Select Count

Huh? When I run this I get a rather unexpected error:-

Get-ChildItem : Cannot find path 'C:\Program' because it does not exist.
At line:1 char:14
+ Get-ChildItem <<<<  C:\Program Files | Measure-Object | Select Count

Originally my gut reaction was to try the second form - using single quotes - to see if that worked instead:-

PowerShell Get-ChildItem '%folder%' ^| Measure-Object ^| Select Count

And it does. However, another way to deal with escaping the pipe symbols is to provide the entire command as a single argument by surrounding it with double quotes, like so:-

PowerShell "Get-ChildItem C:\ | Measure-Object | Select Count"

Good, we’re back to green again. A common technique for embedding double quotes within an argument is to escape them with a backslash. And that, it seems, also works:-

PowerShell "Get-ChildItem \"%folder%\" | Measure-Object | Select Count"

Hmmm, unwinding the trial-and-error stack and applying the same escaping policy to the original command line now gets us what we wanted from the beginning:-

set folder=%~1

PowerShell Get-ChildItem \"%folder%\" ^| Measure-Object ^| Select Count

. . .

Count
-----
   55

Who’s Escaping From Whom?

The whole trial-and-error thing really makes me nervous as I don’t know what other error scenario I might have just traded to get this one working. When it comes to escaping command lines you’re running from a shell you need to factor in what escaping policy the shell uses along with what policy the application’s runtime uses. Just to make sure I know where this escaping is being handled I can take the CMD.EXE shell out of the equation by running the command via Explorer’s “Run…” option (Win+R) or via Task Manager’s “File | New Task…” command.

PowerShell Start-Transcript C:\Temp\Test.txt; Get-ChildItem \"C:\Program Files\" | Measure-Object | Select Count

Running the actual statement causes a PowerShell window to appear and disappear within a blink of an eye, so by prefixing the command with the Start-Transcript statement I can capture the output to a text file as if I was redirecting via a shell. Lo and behold it needs the same escape sequence - \” - to work and so I now know it’s PowerShell that for some reason needs the extra escaping (or use of single quotes) instead of the shell.

 

[1] I wonder how many developers today remember the (good|bad) old days of “short” filenames? For the uninitiated the term long filename is often synonymous with a path that contains spaces as short filenames (8.3 - filename.ext) never [officially] supported them.

Monday 17 June 2013

Handling Paths with Spaces & Parenthesis in Batch Files

Whoever it was that decided the 32-bit Program Files folder on 64-bit windows was called “Program Files (x86)” was clearly having a bad day. Not only does it still have spaces in, which has already caused many a developer grief, but it now contains parenthesis too. This second addition can be bad for batch file programmers because that’s what you use to create scopes with IF statements and FOR loops.

Error Handling

For example this is a pattern that I use in batch file programming to handle errors at each step [1]:-

tool.exe --do-something %folder%
if errorlevel 1 (
  echo ERROR: Oops processing ‘%folder%’ 
  exit /b %errorlevel%
)

I always like to enclose my string parameters in quotes so that you can easily see when the parameter is empty; which is a common problem that can be hard to spot at first. As a rule of thumb I’ve tended to use single quotes mostly because they end up being pasted into a SQL query and they look marginally less noisy than double quotes. So, what happens when the error handler is invoked and %folder% is something like the aforementioned 32-bit programs folder? Well, you’d hope to see this:-

ERROR: Oops processing 'C:\Program Files (x86)'

Of course what you actually get is this:-

' was unexpected at this time.

If you’re writing a deployment script, then you probably have another bit of a path tacked on the end, which if it’s also got spaces, such as C:\Program Files (x86)\Chris Oldwood\My Tool, then you’re going to see something more freaky like this:-

\Chris was unexpected at this time.

The error message “was unexpected at this time.” is now fairly well ingrained as meaning I screwed up something in the quoting (or un-quoting as we’ll see later) somewhere in an IF block. The solution of course is to use double quotes, around paths when printing them out:-

echo ERROR: Oops processing "%folder%"

As always there is no substitute for testing your error handling, even in a batch file! What might make this situation worse is that the script will likely have been written and initially tested on a 32-bit desktop and then it fails some time later when it’s finally used on a 64-bit machine. Hopefully as the take-up of 64-bit Windows increases for desktops the feedback loop will diminish and it’ll show up much sooner (assuming you do the testing).

Quoted Arguments

The Windows command interpreter does not behave like “normal” application runtimes when it comes to handling script arguments. In a C/C++ or C# application the arguments you get passed via main() have any outer quotes removed. When you need to pass an argument, such as a path that has spaces, you generally need to put double quotes around it to have the application treat it as a single argument. But with CMD.EXE what you’ll get instead is the argument with the quotes still surrounding it.

For example, if you create a batch file with the single line @ECHO %1 and invoke it with “Hello World” (with the quotes) what you’ll see is:-

"Hello World"

This causes a problem if you use the following common pattern to detect if an argument has been provided or not:-

if "%1" == "" (
  call :usage
  exit /b 1
)

The use of double quotes around the argument %1 to ensure you don’t get a syntax error when the string is empty will instead give you a slightly cryptic error message when fed with a path like “C:\Program Files (x86)”:-

Files was unexpected at this time.

One alternative I’ve seen a few people use is to replace the double quotes with a different character, such as parenthesis, e.g.

if (%1) == ()  (
  call :usage
  exit /b 1
)

Personally I’d be tempted to use single quotes as it looks a little less weird these days what with HTML/XML allowing both “” & ‘’ for empty strings.

Stripping Quotes

Of course I’d argue that doing that actually solves the wrong problem, which is that you might receive a path with or without surrounding quotes. If you intend to manipulate the path then you’ll want to remove the quotes first. After you’ve manipulated it you’ll want to put the quotes back on again before passing it on to another script or process. The interpreter already has support for doing this with the “~” variable modifier [2].

As an aside I dislike using %1, %2, etc directly in my scripts and prefer to copy them to a named variable as soon as possible. This is the ideal time to strip any quotes too, e.g.

@echo off
setlocal

set installFolder=%~1

if "%installFolder%" == "" (
  call :usage
  exit /b 1
)

As long as you use SETLOCAL before creating any variables they won’t pollute the current context when the script terminates. From then on you always use the variable which of course makes the script easier to understand.

if not exist "%installFolder%" (
  echo ERROR: Invalid root folder "%installFolder%"
  exit /b 1
)

set installFolder=%installFolder%\Chris Oldwood\My Tool

By stripping any quotes that are present in (path) arguments you then make it easier on yourself by having a consistent policy within the script - only append quotes when the path itself needs evaluating, not the variable. So for the example above in the existence check I needed to add the quotes, but in the manipulation below it I didn’t. Of course logging the path in the case of an error is the exception to the rule :-).

[1] For this pattern to nest correctly you need to enable delayed expansion of variables (see HELP SETLOCAL) and use the “!” instead of “%” when evaluating variables.

[2] If you type HELP FOR you’ll get a help page for the FOR command, the end of which contains the various modifiers.