Tuesday, 16 October 2012

Instrument Everything You Can Afford To

One of the decisions we made right back at the beginning of my current project was to measure everything that we could afford to. That meant every database call, I/O request, complex task etc. We all knew from past experience that it’s much harder to diagnose performance problems later if you don’t have your code telling about what the performance of your system is in your terms. Attaching performance tools to databases and measuring network traffic is much harder when you don’t really know the context within which the events are occurring. Also, without any sort of diagnostic logging when a problem does occur it will no doubt pass before you’ve had time to start monitoring it.

The main problem with this kind of code is that it is often seen to get in the way of the business logic it’s measuring. My belief is that if that’s the case then you probably need to do a bit of refactoring because your methods are way too big. In the past I’ve found that the logging and instrumentation logic usually sits in the public methods of the service class which then invokes a private method or even another class to do the actual work - a separation of concerns.

In C++ you have RAII & macros and in C# you have the Dispose pattern to help you reduce the clutter. The following are two examples of the kind of code I’m talking about from previous systems:-

ThingPtr requestThing(int id)
  DO_REMOTE_CALL(“Request Remote Thing”)
    return requestThingImpl(id);

public Thing RequestThing(int id)
  using(Scope.Measure(Instrument.WallClock |
        Instrument.Memory), “Remote Request Thing”)
    return proxy.RequestThing(id);

The __FUNCTION__ pre-processor macro in C/C++ and StackFrame class in .Net make it easy to reach out and grab the called method name without resorting to the copy-and-paste error prone techniques of the past. However, that said, I still prefer to give the task (or operation) a very short name as it allows you to refactor the implementation without disturbing the saved metrics as they should represent a slightly larger goal than just a minor implementation detail.

The tasks will often be grouped too, so I might have a “Remote Request Thing” for the proxy/client-side, “Handle Request Thing” in the service stub, and just “Request Thing” as the underlying in-process action. Using the first two I can then measure the infrastructure costs of remoting the call. Background growth in the number of connections and volumes of data have a habit of creeping up on you and so it’s important to keep a close eye on them. In fact for the past two weeks myself and a colleague have been looking very closely at the instrumentation data around talking to a 3rd party grid. From the data we can see ourselves occasionally waiting 30 secs for the results to a calculation that only took 50 secs in the first place. Ideally we want the the size of the computation to dwarf any infrastructure costs and so we need to look into this further. However this problem is compounded by the fact that the payload is being optimised by another team to overcome a different set of performance problems - so we need these metrics to ensure the run-of-the-mill case doesn’t suddenly become sub-optimal by mistake.

Many tasks are treated as singular in nature, i.e. they just are what they are. When it comes to dealing with reading/writing large collections though I’ve found it useful to attach extra metrics that help quantify the size of the inputs and/or outputs. For example when extracting a trade portfolio for a customer you might attach the trade count as an “external measurement” so you can get a rough per-trade cost for retrieval:-

public List<Trade> RequestPortfolio(int customerId)

  using(Scope.Measure(Instrument.WallClock |
        Instrument.Memory), “Fetch Portfolio”) 
    var portfolio = RequestPortfolioImpl(customerId); 

    Scope.Attach(“Trade:Count”, portfolio.Count); 

    return portfolio; 

This does start to make the code look unwieldy but at least it all remains in the method stub whilst the actual business logic is held cleanly in a separate method or class. All of this extra jiggery-pokery should also only have minimal impact on the performance because the calls across the network or disk accesses will likely be orders of magnitude greater than starting/stopping a timer and calculating the difference. As always you should let the profiler guide you if you think you might be instrumenting tasks that are too fine-grained. In principle, at the level of code I’m generally working, the resolution of the standard clock should be enough to measure what I’m doing. Any more than that I would start to think a bit more carefully about both the granularity of the task and the potential volume of data I’d be producing.

I’m a big fan of text log files[1] that you can pull apart with the usual array of off-the-shelf tools. Many years ago I discovered Microsoft’s LogParser whilst I was involved in some IIS based work. Since then it has been my tool of choice for extracting instrumentation data from our log files because of its SQL-like syntax. It’s getting pretty old these days but still has excellent performance which is extremely useful when you’re trying to pull ad-hoc data from a plethora of grid nodes.

My current system uses the following log file format[2]:-

<datetime> <pid> <tid> <severity> <message>

We have a special message severity for instrumentation data called “PRF” just to make it a little easier to extract. It also has the benefit of allowing you to turn down the highlighting for these messages when you’re tailing a log file. The <message> itself is then split into three sections:-

<task> <metrics> <context>

The task name, as I said above, is a short name without any context that describes a high-level goal. The metrics are a classic list of key/value pairs with the “:” character used as a simple namespace separator (ala XML). Finally the context is a similar set of key/value pairs that describes the instance specific details of the task. Putting this all together a very simple example instrumentation message might be (allowing for word-wrap):-

<datetime, etc.> PRF [Fetch Portfolio] [WallClock:Duration=00:05:00;Trade:Count=100] [RequestId=1234;RemoteHost=ServerXxx]

For our grid-based tasks there can be upwards of a dozen or more metrics covering CPU usage, wall clock time, memory consumption and payload sizes covering both the client and service sides. Although we can get some of this data from the grid vendor’s own reporting database and logs it’s much easier to deal with it when it’s already in-line as you don’t have to try and match it up.

As an aside the salient attributes of the context travel across the wire on remote calls to help stitch multiple threads together. We call his salient context “causality” and it’s based on a similar technique in DCOM along with the Diagnostic Context pattern from PLOPD vol 3. But that’s for another time…


[1] I once made the mistake of going with a binary log file format in the hope that it would buy better performance and allows us to write lots more system-level data like the current “tick count”, etc. by default. It may have been more performant, but we then ended up writing our own tools to manipulate the files. I totally underestimated the power you get from a simple column aligned text log file with some core fields.

[2] I’ve stuck with this format (or something very similar) ever since my disappointment with the binary log file above as it just seems perfect by default. The four leading columns are all fixed-width which makes it very easy on the eye when tailing the file. My choice of GUI based tail program (with highlighting, natch) for many years has been BareTail. I also use their GUI grep tool BareGrep too as it also supports highlighting. For automation it’s the usual array of sed, awk, egrep, PowerShell, etc. although my colleague Tim Barras recently highlighted that “cut” was an important omission from my toolbox.

No comments:

Post a Comment