Friday, 22 July 2011

Every Solution Starts With “FOR /F”

Our team has recently grown and as part of that process I’ve been showing how I’ve tended to do some of the sysadmin type stuff, such as deployment, managing log files[*], checking system health etc. For a brief period it felt as if the answer to every question started with “FOR /F” - the Swiss Army Knife of Windows batch file programming. One week it also ended up being the answer to two questions via Twitter!

Given that I work on Distributed Systems is that such a shock? After all, whatever I want to do, I’m likely to want to do the same thing to every server in the farm and that implies use of some looping construct either to generate a batch file with the same command multiple times:-

@echo off
psexec \\SERVER-1 cmd.exe /c . . .
psexec \\SERVER-2 cmd.exe /c . . .
psexec \\SERVER-3 cmd.exe /c . . .
. . .

Or just one command executed multiple times, controlled by a simple list:-

for /f %h in(machines.txt) do @psexec \\%h cmd.exe /c . . .

This latter variant requires a text file (called machines.txt) with one server hostname per line like so:-

SERVER-1
SERVER-2
SERVER-3

Of course rather than crafting one on-the-fly every time I have a number of files, one called DEV-machines.txt, another called UAT-machines.txt and finally a PROD-machines.txt which are pre-configured.

Deleting Folders

Whilst PSEXEC pretty much makes an appearance in every one-liner I write, usually to invoke a command on a remote host, the simpler task of cleaning up folders in a non-sequential way doesn’t feature it for once. Sometimes you’d like a little more progress than a straight “RMDIR /S /Q” will give you (i.e. none) and so a sprinkling of FOR /D allows you to iterate a bunch of folders and output their name before recursively deleting them:-

for /d %d in (folder\2010_??_??) do @echo %d & rmdir /s /q %d

A little extra parallelism can then also easily be achieved by firing up a few more command prompts with START (which will default to the CWD of the parent) and splitting up the work:-

Prompt-1> for /d %d in (folder\2010_01_*) do @echo %d & rmdir /s /q %d

Prompt-2> for /d %d in (folder\2010_02_*) do @echo %d & rmdir /s /q %d

For those that have successfully avoided batch file programming the “&” separates commands. A “&&” ensures the second is only executed if the first succeeded, whilst a “|” does the opposite and executes the second only if the first fails.

Reading Variables

Probably one of the most unintuitive aspects of Windows batch file programming is creating variables from data in a file. For example our current Heath Robinson overnight batch scheduler uses a combination of the Windows Task Scheduler and batch files to sequence it[#]. The “batch date” is stored in a text file (BatchDate.txt) like so:-

2011-01-01

To read this into a variable so that it can then be passed onto the child scripts and processes requires the following:-

for /f %v in (\\dfs\share\BatchDate.txt) do @set BatchDate=%v

Who would of thought you need a loop to read a single value! If you want to store it as a “key=value” pair instead you need a little more magic to parse the line:-

for /f "delims== tokens=1,2" %i in (c:\temp\test.txt) do @set %i=%j

There are various options that you can pass to FOR to control the parsing, such as “eol=;” if you want to allow comments in your input file like so:-

D-SERVER-1
D-SERVER-2
; NB: Server being repaired
; D-SERVER-3

PowerShell Integration

PowerShell is clearly a better tool for doing any serious scripting under Windows but it suffers one major drawback when you’re using it as we are to sequence a set of operations - PowerShell can’t modify the variables of the calling process. If a PowerShell script forms the root process this isn’t a problem, but when old school batch files are leading the way it’s harder work. If it wasn’t for the annoying problems PowerShell has with doing simple stdout redirection[$] it would be a no brainer to switch.

Anyway, going back to our previous example of reading in variables, you can invoke a PowerShell one-liner and read the output into a variable, as I showed in this StackOverflow answer to get yesterday’s date:-

for /f "usebackq" %i in (`PowerShell ^(get-date^).adddays^(-1^).tostring^('yyyy-MM-dd'^)`) do set Yesterday=%i

However there is a fair bit of ugliness in this technique because you need to escape the parenthesis with the ^ (hat) character and to allow the date format string to be passed through correctly (it either needs to be in single or double quotes) you need to enclose the command with the ` (backtick ) character instead of the usual ‘ (apostrophe).

You Can Remember This Stuff?

Even after all this time I still can’t remember half the switches and options that you can use with “FOR” so it’s lucky that I do remember this command to bring up the manual:-

help for

Bizarrely enough it also lists the set of modifiers you can use on the loop variables to munge paths, such as getting the parent folder, file extension etc.

 

[*] This should of course be automated, but with limited resources it’s been well down the priority list.

[#] See [*] for the reasons why this came about :-)

[$] I already have a post queued up on this as I think it’s one of the most annoying things you discover when trying to switch from using batch files to PowerShell scripts.

Thursday, 7 July 2011

Recovering From Unknown Exceptions

In my last post I talked about using the Execute Around Method pattern to implement a common Big Outer Try Block to handle exceptions that propagate up as far as the service handler. But what do you do with them? Can you do anything with them? After all, if they’re unhandled you clearly never expected them so you need to recover from something you never anticipated...

Recoverability

The point of an exception handler is to allow you to recover from an exceptional condition. Now, there is much debate about what “exceptional” means but in this post I’m talking about scenarios for which no recovery has been attempted. That could be due to lack of knowledge about how to recover, or more likely, because the scenario never came out during testing.

There are really only two ways to behave after catching an unhandled exception - continue or terminate yourself. The former is often classified as the “best effort” approach, while the latter is described as “fail fast”. These are often seen as two opposing views, but in reality they are both valid approaches so long as you can decide how to loosely classify the error to ensure that the stability of the service isn’t compromised as a result of an unanticipated failure.

Systemic vs Domain Exceptions

What we are concerned with at the service handler entry point is trying to decide if the exception we caught indicates that the service has become unstable and will therefore cause problems with subsequent requests or if there will be no residual effects and we can carry on as normal (perhaps after logging a message to notify someone etc). For example a divide by zero exception is not likely to indicate that the process is stuffed whereas an access violation probably is[*].

We can split these two types of errors into two broad categories - Systemic and Domain. The former implies that there is something technically wrong with the process while the latter implies that there was something wrong with the request. We can then create two exception hierarchies rooted on them (ignoring the ultimate base such as System.Exception in .Net) - ServerException and DomainException. Then our Big Outer Try Block will look something like this:-

try
{
  . . .
}
catch (ServerException e)
{
  // Process borked - shutdown service
}
catch (DomainException e)
{
  // Dodgy request - return error
}

3rd Party Exceptions

Of course .Net and the rest of the world won’t know about our exception hierarchy and so we’ll need to categorise any known 3rd party exceptions into the Systemic or Domain groups and then extend the list of catch handlers appropriately. Taking a leaf out of my last post you can also encapsulate the exception handlers into a separate method that will catch and translate so you reduce to amount of cutting-and-pasting:-

public void TranslateCommonErrors(Action<T> method)
{
  try
  {
    method();
  }
  catch (OutOfMemoryException e)
  {
    throw new ServerException(e);
  }
  . . .
  catch (ArgumentOutOfRangeException e)
  {
    throw new DomainException(e);
  }
  . . .
}

We can use nested exceptions to ensure the caller has a chance to react appropriately. One of the first extension methods I (and no doubt many others) wrote was to flatten an exception hierarchy so it could be logged in all its glory or passed across the wire.

Remote Services

If we’re saying that a ServerException signifies the unstable nature of the service process then we need to translate that when it crosses a remote boundary. Due to the requirement that exception types must be serializable (at least with .Net & WCF) to cross the wire and we will be catching everything, including 3rd party exceptions which may not be serializable, we need to marshal the exception chain ourselves.

So, for remote service calls I like to have a mirror pair of exceptions - RemoteServerException and RemoteDomainException. These act as both the container for the marshalled exception chain and more importantly a signal to the client that the service is unstable. This gives the client a chance to perform recovery such as retrying the same request via a different server:-

while (!serverList.Empty())
{
  try
  {
    m_connection.DoTrickyStuff(parameters);
  }
  catch (RemoteServerException e)
  {
    // Out of servers!
    if (serverList.Empty())
      throw new ServerException(“No servers left”, e);

    // Server borked - try another one
    m_connection.Close();
    m_connection.Open(serverList.NextServer());
  }
  catch (RemoteDomainException e)
  {
    // Request stuffed - never gonna work...
    throw new DomainException(e);
  }
}

Another approach could be to immediately translate the RemoteServerException into a DomainException and throw that because the stability of the remote service does not indicate any instability within the client. However, as I mentioned last time, you need to be careful here because a technical error can grow from being a local problem on one server to a bigger one affecting the entire system when all the load balancing and retry logic starts kicking in for many failed requests.

If there is one thing the recent Amazon outage shows it’s that even with lots of smart people and decent testing you still can’t predict how a complex system is going to behave under duress.

The Lazy Approach to Exception Discovery

The eagle-eyed will have noticed that I’ve managed to avoid writing one particular catch block - the ‘catch all’ handler (catch(...) in C++ and catch(System.Exception) in C#). This is the ultimate handler and it’s going to handle The Unknown Unknown, so what do you do? Experience has probably taught you what you would consider a systemic error (e.g. Out of Memory) and so you’ll likely have those bases already covered leaving you to conclude that you can apply the “best effort” rule for everything else. And that is the approach I’ve taken.

In theory (given adequate documentation[#]) for every line of code you write you should be able to determine what the set of exceptions are and come up with a plan for recovering from all of them. Or deciding not to recover. Either way you can make an informed choice about what you’re going to do. But quite frankly who actually has the time to do this (except maybe for programmers writing software where lives are at stake)? What is more likely is that you’ll know the general pitfalls of what you’re invoking, such as when dealing with file-systems, networks or databases, and you’ll explicitly test those scenarios up-front but leave out the crystal-ball gazing. I think the effort is better spent ensuring the integration test environment runs continuously and is suitably enabled for injecting plausible faults than trying to second guess what is going to go bang under any given scenario.

This lazy approach also serves a secondary purpose and that is to see if your diagnostic tooling is up to the job when the time comes to use it. Have you just logged some simple message like “it failed” or do you have a message with the entire chain of exceptions, and do they tell you enough? Did you remember to log the stack trace or create a memory dump? Did the event register on your “system console” so you get a heads-up before it goes viral?

When a new scenario does come up that you realise you can recover from more gracefully that’s the time you’ll be glad you spent the effort mocking the underlying file-system, database and networking API’s so that you can write an automated test for it.

 

[*] Although I’d argue that an attempt to de-reference a NULL pointer probably is recoverable because it is the most common default initialisation state. In my experience access violations involving NULL pointers have nearly always been due to some logic error and not an indication of a wayward service that has started doing random stuff that eventually ended up with a NULL pointer. On the contrary an arbitrary memory location nearly always signals something pretty bad has gone before.

[#] Exception specifications don’t appear to have alleviated that.

Monday, 4 July 2011

Execute Around Method - The Subsystem Boundary Workhorse

I first came across the pattern Execute Around Method at an ACCU London talk by Paul Grenyer. The talk was about factoring out all the boilerplate code you need when dealing with databases in Java at the lower level. The concept revolves around the ubiquitous Extra Level of Indirection[*] with you passing either an object interface (Java/C++) or free/member function (C++) into another function so that the callee can perform all the boilerplate stuff on your behalf whilst your method can get on with focusing on the real value. The original article, like Paul’s talk, looks at resource management but many see it as a more general purpose pattern than that. There is also some correlation with the classic Gang of Four pattern Template Method but without the use of inheritance. In my mind they are two sides of the same coin it just depends on whether the discussion focuses on the encapsulation of the surrounding code or the inner behaviour being invoked.

Going The Extra Mile

Here’s a C++ example from the start-up code I’ve been employing for many moons[+]:-

typedef int (*EntryPoint)(int argc, const char* argv[]);

int bootstrap(EntryPoint fn, int argc, const char* argv[])
{
  try
  {
    return fn(argc argv);
  }
  catch (const std::exception& e)
  {
    // Log it...
  }
  . . .
  catch (...)
  {
    // Log it...
  }

  return EXIT_FAILURE;
}

int main(int argc, const char* argv[])
{
  return bootstrap(applicationMain, argc, argv);
}

int applicationMain(int argc, const char* argv[])
{
  // Parse command line, run app etc.
  . . .
  return EXIT_SUCCESS;
}

The physical process entry point (main) invokes the logical process entry point (applicationMain) via the start-up code (bootstrap). This leaves applicationMain() to do application specific stuff whilst leaving the grunge work of dealing with unhandled exceptions and other common low-level initialisation to bootstrap().

Naturally the advent of Boost means that you can adapt the code above to pass member functions or free functions with the same construct (Boost::Function). You could also do it with interfaces and the Command pattern but it’s even more verbose. Either way the extra level of indirection is still in your face.

A Brave New World

Now that I’m firmly entrenched in the world of C# I have the opportunity to write exactly the same code once more. And right back at the beginning that’s exactly what I did. Of course they’re called Delegates in C# and not Function Pointers, but the extra level of indirection was still there... Until I realised that I was finally working with a language that supported Closures. So now I can eschew the extra method and do it all inline:-

public static class ConsoleApplication
{
  protected int bootstrap(Func<string[], int> main,
                          string[] args)
  {
    try
    {
      return main(args);
    }
    catch
    . . .
    return ExitFailure;
  }
  public const int ExitSuccess = 0;
  public const int ExitFailure = 1;
}

public static class Program : ConsoleApplication
{
  public static int Main(string[] args)
  {
    return boostrap(args =>
    {

      // Parse command line, run app etc. 
      . . .
      return ExitSuccess;
    });
  }
}

The use of a closure with Execute Around Method has made many of those ugly parts of the system (most notably the subsystem boundaries) far more readable and maintainable. What follows are the scenarios where I find it’s been put to constant use…

SQL execution

On the face of it executing a SQL query is quite trivial, but when trying to write a robust and scalable system you soon start to deal with many thorny issues such as transient errors and performance problems. These are the behaviours I tend to wrap around SQL queries:-

  • Instrument the query execution time
  • Drop and reconnect the connection on a cluster failover
  • Catch and retry when a timeout (or other recoverable transient error) occurs

Dealing with transient errors is particularly tricky because you can make matters worse if you keep retrying a query when the database is already under heavy load. It’s also important to ensure you’re dealing with the error that you think you are by catching specific exception types; SQL exceptions come through as one exception type and so you also need to inspect its properties carefully.

Remote Service Stub

Working back up the call stack we end up at the server-side stub to the remote service call. The caller of that method will probably be some framework code (e.g. WCF) and so it’s our last chance saloon to handle anything. Once again duties are in the realms of performance and error recovery:-

  • Instrument the remote call
  • Use a Big Outer Try Block to catch all exceptions and handle as appropriate

As a rule of thumb you shouldn’t let exceptions propagate across a “module boundary” because you don’t know what the callee will be able to do with it. In practice this is aimed at COM components which can be written in, say C++, while the application may be written in the “exception-less” C. It also applies to services too because they have to marshal anything you pass (or throw) back to them and if they can’t do that then you’ll get a significantly less helpful error to work with.

The Big Outer Try Block in the remote stub is also the point where you get to decide what “best effort” means. It may mean anything from “catch everything and do nothing” to “catch nothing and let everything through” but I would suggest you pick some middle ground. You want to avoid something trivial like an invalid argument causing the server to fail whilst at the same time not letting a systemic issue such as an access violation cause the server to enter a toxic state. If possible I like systemic issues to cause the server to take itself down ASAP to avoid causing more harm, but that can be worse than the disease[$].

Remote Service Proxy

Stepping back once more across the wire we come to the client-side invocation of a remote request. This is really just a generalised version of the SQL execution scenario but with one extra facility available to us:-

  • Instrument the request invocation time
  • Drop and reconnect to the service on a transport failure
  • Catch and retry when a timeout (or other recoverable transient error) occurs

One obvious question might be why you would instrument the service request from both the client proxy and service stub? Well, the difference tells you how much overhead there is in the transport layer. When talking using a low-level transport like a point-to-point socket there’s not much code in-between yours, but once you start using non-trivial middleware like DCOM, MSMQ and 3rd party grid computing products knowing where the bottleneck is becomes much harder. However if you can measure and monitor the request and response latencies automatically you’ll be in much better shape when the inevitable bottleneck appears.

Aspect Orientated Programming?

Not that it really matters, but is Execute Around Method just an implementation vehicle for AOP? I’ve never really got my head around where AOP truly fits into the picture, it always seems like you need external tooling to make it work and I’m not a big fan of “hidden code” as it makes it harder to comprehend what’s really going on without using the debugger. The wrappers listed above tend to have all the aspects woven together rather than as separate concerns so perhaps that’s a code smell that true AOP helps highlight. I’m not going to lose sleep over it but I am aware it’s a technique I don’t understand yet.

 

[*] And I don’t mean Phil Nash’s blog :-)

[+] You would have thought this problem would have been solved by now, but most runtimes seem to insist on letting an unhandled exception be treated as a successful invocation by the calling application!

[$] In a grid computing system you can easily create a “black hole” that sucks up and toasts your entire workload if you’re not careful. This is created by an engine process that requests work and then fails it very quickly either because itself or one of its dependent services has failed thereby causing it to request more work much quicker than it’s siblings.