Wednesday 7 November 2012

Sensible Defaults

If you were writing a product, say, a service, that allows remote clients to connect and submit requests, what would you choose as the default values for the service hostname? Given that your service will no doubt be fault-tolerant it will probably allow the remote end to disappear and reappear in the background. Of course you might decide to allow a client to configure itself so that after ‘N’ failures to connect it will return an error so that some other (manual) action can take place. What would you pick as the default value for this setting too?

Put your hand up if you said “localhost” and “infinite” as the answer to those two questions. Are you really sure they are sensible values to use by default?

Not unsurprisingly I had to work with a service that had exactly those as its default settings. To make matters worse there were other defaults, such as not having any sort of logging by default[1]. Besides programmatic configuration[2] there was also a config file based mechanism that used the Current Working Directory (CWD) by default. Next question. What is the CWD for a Windows service? No, it’s not the application folder, it’s the system32 folder. By now you can probably tell where this is heading…

Basically we installed our NT Service, fired it up and absolutely nothing happened. Not only that but the service API returned no errors. Naturally we checked and double-checked the installation and obvious output locations but could find no reported problems. In the end I attached the Visual Studio remote debugger and watched in awe at the exceptions bouncing around inside the service API as it tried repeatedly in vain to attach to a service running on “localhost”. No “3 strikes and you’re out” either, it just kept trying over-and-over again.

When you’re developing features hosted within a service you’ll rarely run it as an actual service, you’ll probably tend to run it as a normal console application for speed. The natural side-effect of this is that the CWD will likely be set to the same as where the binary resides, unless you start it with a relative path. The large monolithic service I was working on was always developed in that way as all the installation stuff had preceded me be many years. Yes, we caught it the first time we deployed it formally to a DEV system-test environment, but by then so much time had passed that the start-up shenanigans were far behind us[3][4].

My personal choice of defaults would be “nothing” for the hostname and 0 for the retries. How many systems do you know where the client and middleware run on the same machine? The out-of-the-box configuration should assume a synchronous connect call by default because that is what most other services that developers are used to dealing with do. And by most I mean databases and middle tier services. Yes, asynchrony is gaining ground, even in UIs, but as a developer you already have a hard enough time dealing with learning a new product and API without having to fight it too - I can tell you it gains you no friends in the development world. Once you’re comfortable with the API you could look at its more advanced features.

I’m tempted to suggest that the decision on what to use by default was made by a developer of the product, not a consumer of it. No doubt this configuration made their testing easier exactly because they run the entire product on their development machines. That’s why dogfooding and TDD are such important ideas - they force you into your customers shoes. I’m a big advocate of flexible configuration mechanisms and so don’t see why you can’t also adhere to The Principle of Least Surprise too.

 

[1] No console or file based logging is possibly a sensible default, but only when your API has other ways of telling you about failures. At the very least I would opt for using OutputDebugString() as a last resort so that I could fire up the wonderful DbgView tool from the Sysinternals suite. Even the trusty old Windows event log is better than nothing, just so long as you don’t spam it.

[2] After this incident we switched to using the programmatic interface and set more useful default values inside the facade we built.

[3] The first rule of debugging is to look to the most recent changes for the source of a bug. It’s not always obvious but it’s usually there, somewhere. The bug may not be directly in the change itself but be a side-effect of it; either way the recent changes should be the clue.

[4] This was supposed to be “a simple port”. What should have taken a couple of weeks turned into a couple of months and the project was eventually abandoned. Ultimately an impedance mismatch at the architecture level between our system and the service was to blame.

No comments:

Post a Comment