Tuesday 20 December 2011

Diagnostic & Support User Interfaces

When people talk about user interfaces they are invariably talking about the face of a product or system that the end-users see and/or interact with. But within a complex system there are other user interfaces too that are not seen or interacted with by end-users at all; instead it is fellow developers and, more importantly, support staff that get to see this less salubrious side.

In a small well-established team of developers there will naturally be a lot of cross fertilisation meaning that they will be well versed in significant portions of the system. In contrast, a larger team will have skills spread across different disciplines and so from a support perspective it starts to approach the knowledge levels of a support team - far less. In my experience support teams do not get the luxury of devoting themselves to learning one system instead out; instead they invariably have to support a whole bunch of in-house systems written by different teams using different technologies and having different ideas about what is required to make a product or system “supportable”.

So what kinds of interfaces am I referring too? Well, the most traditional are status monitoring pages and custom admin tools; the kind of utilities built specifically for the purpose of administering and supporting the system. Due to the time constraints imposed and the lack of direct business value they are usually not given any real TLC but are just thrown together, presumably under the misguided assumption that it will never really be used except by them or their close colleagues. But these scripts and command line tools are not the only faces of the system you’ll see when doing support duties; before that you’ll probably be faced with hunting through the custom format log files, which may or may not be easily consumable. Stretching the analogy somewhat further you could encompass configuration files and the database schema as interfaces of sorts. They’re not graphical in themselves, but careful though about their structure can ensure that the off-the-shelf tools stand more chance of making interaction a more productive experience.

Anyway, here is my list of things to think about. I’ve not done any “production” UI programming for a few years and it’s been even longer since I read About Face (the 1st edition no less) so I’ll no doubt contradict some modern wisdom along the way. Your target audience may not be novice computer users, but they do have the clock against them and they won’t thank you for making an already difficult situation any harder.

Command-Line Tools

Server-side developers aren’t renowned for their user interface programming. After all, that’s why you have UX designers and developers whose major skills lie in the area of UI programming. But that doesn’t mean we have be to sloppy about the interfaces for the scripts and tools we do create. I’m not suggesting for a moment that we spend hours gold-plating our utilities, but I am proposing that you adhere to some of the common conventions that have been established over the years:-

Provide a help switch (-?, -h, --help) so that the user can see what the options are. Documentation, if you can even find it for a utility, can easily get out of step with the tool usage and so you need to be able to see which switch it is that you’re apparently missing, or that is no longer relevant for the task at hand. Even if you’re au fait with the tool you might just want to double-check its usage before going ahead and wiping out last years data.

Provide a version switch (-v, --version) so the user can check its compatibility, or if it’s become out of date. Many tools are installed on the application servers or alongside client binaries, but sometimes those tools are just “passed around” and then become available on some desktop share without you realising. You can even dump the version out on every invocation as the first line; this can be particularly useful if you’re capturing stdout when running it as a scheduled task.

Provide long switch names (--switch-name) by default and use them in documentation so it is obvious what the example command is doing. Short switch names are for advanced users and so should be considered the optional case. Positional parameters may save the user some typing but when you use concepts like integer based surrogate keys it’s not exactly obvious what an example may be suggesting:-

C:\> DC 1234 2001-02-03 1

compared to:-

C:\> CustomerAdmin delete --customer-id 1234
  --date 2001-02-03 --force

Use a consistent switch format. I’ve had to use utilities where one developer has made the switch names case-sensitive and another case-insensitive. Sometimes they use ‘/’ as the switch marker and other times it’s a ‘-‘. Often the case-sensitivity issue is down to lazy programming - using == or Equals() instead of the correct method. If you’re on Windows try and support both styles as even Microsoft’s developers aren’t consistent. Better yet download a 3rd party library that lets you use either style as there are plenty to choose from[#].

Return a meaningful result code. The convention (on Windows) is that zero means success and non-zero means failure or partial success. Some tools do provide more detailed result codes, but fundamentally you want assume that anyone trying to automate your tool on Windows is going to write something along these lines as a first-order approximation:-

CustomerAdmin.exe --do-something-funky
if errorlevel 1 exit /b 1

Sadly, 0 is also the default result code that most frameworks will return for an unhandled exception which really makes the whole process rather messy and means that you need to get a Big Outer Try Block in place in main() as part of your bootstrap code.

Support a read-only/test/what-if mode. Most of my support tools have grown out of test harnesses and other development tools and so I’ve naturally added switches that allow you to run them against production in a benign mode so that you can see what would happen. PowerShell has taken this idea forward formally with its common -WhatIf switch. When you’re not 100% sure of what the expected action is or how many items it will act upon this can be a life saver.

Most of the same comments apply to scripts, but they have the added advantage that they are deployed in source code form and so you have the ability to see the logic first hand. Of course when you’re in the thick of it you need to be able to trust the tool to do what it says it will.

Logging

When faced with a production issue, log files will often become the first port of call after you’ve established which process it is that has failed. But log files[+] often serve two different masters and the latter one can get forgotten once the development work has finished. During development it is probably the developer that spends the most time pouring over the output and therefore they have a habit of dumping every last bit of detail out. After the feature is complete though the main consumer then becomes the support team. What they don’t want to see is reams of output so that the proverbial wood is lost behind the trees.

Make sure your log messages make sense. This may seem obvious but spelling mistakes, whilst cute in a time of stability, are a distraction when you’re up to your eyes in it. Log messages should form a narrative that lead you up to the moment of interest, and that narrative should be written with the reader in mind - mostly likely a support engineer - someone who doesn’t know the code intimately. By all means be succinct, but ensure you’re not being ambiguous either.

Use the console output for high-level commentary, warnings and errors only. If extra output is required it should be enabled by an extra --verbose switch (which can easily be added temporarily under automation) or by having a companion log file with more detail. I tend to favour the second approach so that the detail is always there if needed but so that it’s not in your face the majority of the time. Remember we’re talking about in-house stuff here and so you have the luxury of controlling the environment in ways a 3rd party can often only dream of.

Keep to a low number of severity levels. If you have a large number of severity levels developers will not be able to log messages consistently. I recently saw a tweet that suggested you use only the following levels - FYI, WFT & OMG. If you put yourself in the support person’s shoes that’s actually a pretty good scale. Once the issue is triaged and the development team are involved the TRACE or DEBUG level output suddenly becomes of more use, but until then keep it simple.

Write the messages as text. A simple lined-based text format may seem very old-fashioned, but there are a plethora of tools out there designed to slice-and-dice text based output any way you want, e.g. grep, awk & sed. Plus you have modern GUI based Tail programs that support highlighting to really make those important messages shine, e.g. BareTail & LogExpert. Layer on top something like LogParser which supports a SQL like syntax and support for many formats out of the box and you’re in good shape to triage quickly. Also don’t underestimate the importance of NotePad as a tool of last resort.

Make the log message format simple. If you’re writing a separate log file include at least the date & time, and preferably the process ID (PID). If the tool is multi-threaded then the thread ID (TID) will be invaluable as well, along with the highest precision[*] of time (e.g. milliseconds or better) to spot potential race conditions. I’ve found the following format to be very easy to scan by eye and parse by tool as the left-hand side is of fixed width (the PID & TID are padded appropriately):-

<ISO date & time> <PID> <TID> <severity> <message...>

2001-01-01 01:02:03.456 1234 3456 INF Starting...
2001-01-01 01:02:04.123 1234 5678 ERR Stuff broke...

Enclose string values in some form of quotes or braces to make empty strings more visible. As I posted very recently, empty strings are a form of Null Object and so they have a habit of being silently propagated around before finally causing odd behaviour. Personally I like to enclose strings in single quotes because I often have to use those values in SQL queries and so then it’s already quoted for me. As an example I once spent far longer than necessary investigating an issue because the log message said this:-

Valuing trade ...

instead of:-

Valuing trade ''...

Include all relevant context, preferably automatically. Messages that just tell you something bad happened are useless without the context in which the operation is taking place. There is nearly always some higher-level context, such as a request ID or customer name that you can associate with the PID or TID to help you narrow down the data set. Debugging a complex or lengthy process just so you can find out the set of inputs to find that the data is dodgy is not a good use of time and costs the business money, especially if the relevant variables are in scope at the point of writing the error message![^]

Don’t spam the log, write a summary. If you have a process that deals with particularly dirty data, or you have a service that is temperamental try not to log every single indiscretion as a warning or error. Instead try and batch them up, perhaps by item count or time window and write a single message that gives a summary of the issue. At a high-level you’ll just be interested in seeing if the problem is getting worse and so using a tolerance is even better as once you have an idea of what “normal operation” is you can save the scary messages for when they’re really needed.

One of the reviews I do during system testing is to look at the log output. I try and put myself in the position of the support engineer and read the log messages to see if they make sense to me (I often only have a vague appreciation of the process in question). The thing I’m most looking out for is log spam which is usually solved by downgrading messages from, say, INFO to TRACE so that the main output is all about the big picture. It should contain virtually no warnings or errors to avoid support staff becoming desensitised by noise and when they are present they should hopefully be backed by more detailed diagnostics somewhere else.

 

[*] Don’t be fooled into thinking that because a datetime value can be formatted with a milliseconds part that it is actually that accurate. IIRC 16-bit Windows used to have a resolution of 50 ms whilst the NT lineage was better at 10 ms. If you wanted better accuracy than that you needed to resort to the Performance Counters API.

[#] My own free C++ Core library has a reasonably decent parser should you want to tap into the 3rd party option.

[+] I have mixed sentiments about the use of log files. On the one hand I feel they should largely be unnecessary because we should be able to anticipate, test for and handle most types of failure gracefully. But I’m also aware that in-house systems rarely have that level of robustness as a requirement and so we’re often forced into using it as a defensive measure against inadequate test environments and the other in-house external systems.

[^] I clearly remember spending a few hours at the weekend debugging a financial process that was failing with the message “Failed to correlate pairs”. What annoyed me most was that the developer had the two failing currency names in variables in scope but had not bothered to include them in the message! I fixed the error message first thing Monday morning...

1 comment:

  1. An interesting insight on a real cost and quality issue.

    I guess exception and error logging needs better planning and design from the outset. It often gets 'beefed up' when issues emerge during the development process.

    Perhaps formally incorporating this aspect into testing would be the 'sanity check' that could avoid the 'Valuing trade ... and 'Failed to correlate pairs' type of gaffes. Or at least some code review of trace during development?

    It might be achieved if the developers had a range of functions designed to anticipate and handle the applications error and warning conditions. An abstraction layer of some sort would provide more flexibility in tweaking the actual mechanics and semantics of the logging etc.

    Maybe a formal approach would be too rigid and a moving target in any case, perhaps a more generic XML approach to logging would allow enough information to be logged using something like an XDocument object which would allow later analysis to use Linq to XML querying to the degree of focus required (assuming a .NET platform). The API approach could be substituted with formal guidance and a dictionary of well known property types to log by convention.

    Of course once this type of standard is in place across the organisation, a GUI to help view and interrogate the output would be a good investment and interesting to develop. Presumably by your department!

    ReplyDelete