The OldWood Thing: 2020

Thursday, 19 November 2020

Planning is Inevitable

Like most programmers I’ve generally tried to steer well clear of getting involved in management duties. The trouble is that as you get older I think this becomes harder and harder to avoid. Once you get the mechanics of programming under control you might find you have more time to ponder about some of those other duties which go into delivering software because they begin to frustrate you.

The Price of Success

Around the turn of the millennium I was working in a small team for a small financial organisation. The management structure was flat and we had the blessing of the owner to deliver what we thought the users needed and when. With a small but experienced team of programmers we could adapt to the every growing list of feature requests from our users. Much of what we were doing at the time was trying to work out how certain financial markets were being priced so there was plenty of experimentation which lead to the writing and rewriting of the pricing engine as we learned more.

The trouble with the team being successful and managing to reproduce prices from other more expensive 3rd party pricing software was that we were then able to replace it. But of course it also has some other less important features that users then decided they needed too. Being in-house and responsive to their changes just means the backlog grows and grows and grows…

The Honeymoon Must End

While those users at the front of the queue are happy their needs are being met you’ll end up pushing others further down the queue and then they start asking when you’re going to get around to them. If you’re lucky the highs from the wins can outweigh the lows from those you have to disappoint.

The trouble for me was that I didn’t like having to keep disappointing people by telling them they weren’t even on the horizon, let alone next on the list. The team was doing well at delivering features and reacting to change but we effectively had no idea where we stood in terms of delivering all those other features that weren’t being worked on.

MS Project Crash Course

The company had one of those MSDN Universal licenses which included a ton of other Microsoft software that we never used, including Microsoft Project. I had a vague idea of how to use it after seeing some plans produced by previous project managers and set about ploughing through our “backlog” [1] estimating every request with a wild guess. I then added the five of us programmers in the team as the “resources” [2] and got the tool to help distribute the work amongst ourselves as best as possible.

I don’t remember how long this took but I suspect it was spread over a few days while I did other stuff, but at the end I had a lovely Gantt Chart that told us everything we needed to know – we had far too much and not enough people to do it in any meaningful timeframe. If I remember correctly we had something like a year’s worth of work even if nothing else was added to the “TODO list” from now on, which of course is ridiculous – software is never done until it’s decommissioned.

For a brief moment I almost felt compelled to publish the plan and even try and keep it up-to-date, after all I’d spend all that effort creating it, why wouldn’t I? Fortunately I fairly quickly realised that the true value in the plan was knowing that we had too much work and therefore something had to change. Maybe we needed more people, whether that was actual programmers or some form of manager to streamline the workload. Or maybe we just needed to accept the reality that some stuff was never going to get done and we should ditch it. Product backlogs are like the garage or attic where “stuff” just ends up, forgotten about but taking up space in the faint hope that one day it’ll be useful.

Saying No

The truth was uncomfortable and I remember it lead to some very awkward conversations between the development team and the users for a while [3]. There is only so long that you can keep telling people “it’s on the list” and “we’ll get to it eventually” before their patience wears out. It was unfair to string people along when we pretty much knew in our hearts we’d likely never have the time to accommodate them, but being the eternal optimists we hoped for the best all the same.

During that period of turmoil having the plan was a useful aid because it allowed is to have those awkward conversations about what happens if we take on new work. Long before we knew anything about “agility” we were doing our best to respond to change but didn’t really know how to handle the conflict caused by competing choices. There was definitely an element of “he who shouts loudest” that had a bearing on what made its way to the top of the pile rather than a quantitative approach to prioritisation.

Even today, some 20 years on, it’s hard to convince teams to throw away old backlog items on the premise that if they are important enough they’ll bubble up again. Every time I see an issue on GitHub that has been automatically closed because of inactivity it makes me a little bit sad, but I know it’s for the best; you simply cannot have a never-ending list of bugs and features – at some point you just have to let go of the past.

On the flipside, while I began to appreciate the futility of tracking so much work, I also think going through the backlog and producing a plan made me more tolerant of estimates. Being that person in the awkward situation of trying to manage someone’s expectations has helped me get a glimpse of what questions some people are trying to answer by creating their own plans and how our schedule might knock onto them. I’m in no way saying that I’d gladly sit through sessions of planning poker simply for someone to update some arbitrary project plan because it’s expected of the team, but I feel more confident asking the question about what decisions are likely to be affected by the information I’m being asked to provide.

Self-Organising Teams

Naturally I’d have preferred someone else to be the one to start thinking about the feature list and work out how we were going to organise ourselves to deal with the deluge of work, but that’s the beauty of a self-organising team. In a solid team people will pick up stuff that needs doing, even if it isn’t the most glamourous task because ultimately what they want is to see is the team succeed [4], because then they get to be part of that shared success.

[1] B.O.R.I.S (aka Back Office Request Information System) was a simple bug tracking database written with Microsoft Access. I’m not proud of it but it worked for our small team in the early days :o).

[2] Yes, the air quotes are for irony :o).

[3] A downside of being close to the customer is that you feel their pain. (This is of course a good thing from a process point of view because you can factor this into your planning.)

[4] See “Afterwood – The Centre Half” for more thoughts on the kind of role I seem to end up carving out for myself in a team.

Monday, 16 November 2020

Pair Programming Interviews

Let’s be honest, hiring people is hard and there are no perfect approaches. However it feels somewhat logical that if you’re hiring someone who will spend a significant amount of their time solving problems by writing software, then you should probably at least try and validate that they are up to the task. That doesn’t mean you don’t also look for ways to asses their suitability for the other aspects of software development that don’t involve programming, only that being able to solve a problem with code will encompass a fair part of what they’ll be doing on a day-to-day basis [1].

Early Computer Based Tests

The first time I was ever asked to write code on a computer as part of an interview was way back in the late ‘90s. Back then pair programming wasn’t much of a thing in the Enterprise circles I moved in and so the exercise was very hands-off. They left me in the boardroom with a computer (but no internet access) and gave me a choice of exercises. Someone popped in half way through to make sure I was alright but other than that I had no contact with anyone. At the end I chatted briefly with the interviewer about the task but it felt more like a box ticking affair than any real attempt to gain much of an insight into how I actually behaved as a programmer. (An exercise in separating “the wheat from the chaff”.)

I got the job and then watched from the other side of the table as other people went through the same process. In retrospect being asked to write code on an actual computer was still quite novel back then and therefore we probably didn’t explore it as much as we should have.

It was almost 15 years before I was asked to write code on a computer again as part of an interview. In between I had gone through the traditional pencil & paper exercises which I was struggling with more and more [2] as I adopted TDD and refactoring as my “stepwise refinement” process of choice.

My First Pair Programming Interview

Around 2013 an old friend in the ACCU, Ed Sykes, told me about a consultancy firm called Equal Experts who were looking to hire experienced freelance software developers. Part of their interview process was a simple kata done in a pair programming style. While I had done no formal pair programming up to that time [3] it was a core technique within the firm and so any candidates were expected to be comfortable adopting this practice where preferable.

I was interviewed by Ed Sykes, who played a kind of Product Owner role, and Adam Straughan, who was more hands-on in the experience. They gave me the Roman Numerals kata (decimal to roman conversion), which I hadn’t done before, and an hour to solve it. I took a pretty conventional approach but didn’t quite solve the whole thing in the allotted time as I didn’t quite manage to get the special cases to fall out more naturally. Still, the interviewers must have got what they were after as once again I got the job. Naturally I got involved in the hiring process at Equal Experts too because I really liked the process I had gone through and I wanted to see what it was like on the other side of the keyboard. It seemed so natural that I wondered why more companies didn’t adopt something similar, irrespective of whether or not any pair programming was involved in the role.

Whenever I got involved in hiring for the end client I also used the same technique although I tended to be a lone “technical” interviewer rather than having the luxury of the PO + Dev approach that I was first exposed to but it was still my preferred approach by a wide margin.

Pairing – Interactive Interviewing

On reflection what I liked most about this approach as a candidate, compared to the traditional one, is that it felt less like an exam, which I generally suck at, and more like what you’d really do on the job. Putting aside the current climate of living in a pandemic where many people are working at home by themselves, what I liked most was that I had access to other people and was encouraged to ask questions rather than solve the problem entirely by myself. To wit, it felt like I was interviewing to be part of a team of people, not stuck in a booth and expected to working autonomously [4]. Instead of just leaving you to flounder, the interviewers would actively nudge you to help unblock the situation, just like they (hopefully) would do in the real world. Not everyone notices the same things and as long as they aren’t holding the candidate’s hand the whole time that little nudge should be seen as a positive sign about taking on-board feedback rather than failing to solve the problem. It’s another small, but I feel hugely important, part of making the candidate feel comfortable.

The Pit of Success

We’ve all heard about those interviews where it’s less about the candidate and more about the interviewer trying to show how clever they are. It almost feels like the interviewer is going out of their way to make the interview as far removed from normal operating conditions as possible, as if the pressure of an interview is somehow akin to a production outage. If your goal is to get the best from the candidate, and it should be if you want the best chance of evaluating them fairly, then you need to make them feel as comfortable as possible. You only have a short period of time with them so getting them into the right frame of mind should be utmost in your mind.

One of the problems I faced in that early programming test was an unfamiliar computer. You have a choice of whether to try and adapt to the keyboard shortcuts you’re given or reconfigure the IDE to make it more natural. You might wonder if that’s part of the test which wastes yet more time and adds to the artificial nature of the setting. What about the toolset – can you use your preferred unit testing framework or shell? Even in the classic homogenous environment that is The Windows Enterprise there is often still room for personal preference, despite what some organisations might have you believe [5].

Asking the candidate to bring their own laptop overcomes all of these hurdles and gives them the opportunity to use their own choice of tools thereby allowing them to focus more on the problem and interaction with you and less on yak shaving. They should also have access to the Internet so they can google whatever they need to. It’s important to make this perfectly clear so they won’t feel penalised for “looking up the answer” to even simple things because we all do that for real, let alone under the pressure of an interview. Letting them get flustered because they can’t remember something seemingly trivial and then also worrying about how it’ll look if they google it won’t work in your favour. (Twitter is awash with people asking senior developers to point out that even they google the simple things sometimes and that you’re not expected to remember everything all the time.)

Unfortunately, simply because there are people out there that insist on interviewing in a way designed to trip up the candidate, I find I have to go overboard when discussing the setup to reassure them that there really are no tricks – that the whole point of the exercise is to get an insight into how they work in practice. Similarly reassuring the candidate that the problem is open-ended and that solving it in the allotted is not expected also helps to relax them so they can concentrate more on enjoying the process and feel comfortable with you stopping to discuss, say, their design choices instead of feeling the need to get to the end of yet another artificial deadline instead.

The Exercise

I guess it’s to be expected that if you set a programming exercise that you’d want the candidate to complete it; but for me the exercise is a means to a different end. I’m not interested in the problem itself, it’s the conversation we have that provides me with the confidence I need to decide if the candidate has potential. This implies that the problem cannot be overly cerebral as the intention is to code and chat at the same time.

While there are a number of popular katas out there, like the Roman Numerals conversion, I never really liked any of them. Consequently I came up with my own little problem based around command line parsing. For starters I felt this was a problem domain that was likely to be familiar to almost any candidate even if they’re more GUI oriented in practice. It’s also a problem that can be solved in a procedural, functional, or object-oriented way and may even, as the design evolves, be refactored from one style to the other, or even encompass aspects of multiple paradigms. (Many of the classic katas are very functional in nature.) There is also the potential to touch on I/O with the program usage and this allows the thorny subject of mocking and testability to be broached which I’ve found to be a rich seam of discussion with plenty of opinions.

(Even though the first iteration of the problem only requires supporting “-v” to print a version string I’ve had candidates create complex class hierarchies based around the Command design pattern despite making it clear that we’ll introduce new features in subsequent iterations.)

Mechanics

Aside from how a candidate solves a problem from a design standpoint I’m also interested in the actual mechanics of how they program. I don’t mean whether they can touch type or not – I personally can’t so that would be a poor indicator :o) – no, I mean how they use the tools. For example I find it interesting what they use the keyboard or mouse for, what keyboard shortcuts they use, how they select and move text, whether they use snippets or prefer the editor not to interfere. While I don’t think any of the candidate’s choices says anything significant about their ability to solve the problem, it does provide an interesting avenue for conversation.

It’s probably a very weak indicator but programmers are often an opinionated bunch and one area they can be highly opiniated about is the tools they use. Some people love to talk about what things they find useful, in essence what they feel improves or hinders their productivity. This in turn begs the question about what they believe “productivity” is in a software development context.

Reflection

What much of this observation and conversation boils down to is not about whether they do things the same way I do – on the contrary I really hope they don’t as diversity is important – it’s about the “reflective” nature of the person. How much of what they do is through conscious choice and how much is simply the result of doing things by rote.

In my experience the better programmers I have worked with tend to more aware of how they work. While many actions may fall into the realm of unconscious competence when “in the zone” they can likely explain their rationale because they’re are still (subconsciously) evaluating it in the background in case a better approach is suitable.

(Naturally this implies the people I tend to interview are, or purport to be, experienced programmers where that level of experience is assumed to be over 10 years. I’m not sure what you can expect to take away from this post when hiring those just starting out on their journey.)

An Imperfect Process

Right back at the start I said that interviewing is an imperfect process and while I think pairing with someone is an excellent way to get a window into their character and abilities, so much still comes down to a gut feeling and therefore a subjective assessment.

I once paired with someone in an interview and while I felt they were probably technically competent I felt just a tinge of uneasiness about them personally. Ultimately the final question was “would I be happy to work with this person?” and so I said “yes” because I felt I would be nit-picking to say “no”. As it happens I did end up working with this person and a couple of months into the contract I had to have an awkward conversation with my other two colleagues to see if they felt the same way I did about this team mate. They did and the team mate was “swapped out” after a long conversation with the account manager.

What caused us to find working with this person unpleasant wasn’t something we felt could easily and quickly be rectified. They had a general air of negativity about them and had a habit of making disparaging, sweeping remarks which showed they looked down on database administrators and other non-programming roles. They also lacked an attention to detail causing the rest of us to dot their I’s and cross their T’s. Even after bringing this up directly it didn’t get any better; they really just wanted to get on and write new code and leave the other tasks like reviewing, documenting, deploying, etc. to other people.

I doubt there is anything you can do in an hour of pairing to unearth these kind of undesirable traits [6] to a level that you can adequately assess, which is why the gut still has a role to play. (I suspect it was my many years of experience in the industry working with different people that originally set my spider senses tingling.)

Epilogue

The hiring question I may find myself putting to the client is whether they would prefer to accidentally let a good candidate slip away because the interview let them (the candidate) down or accidentally hire a less suitable candidate that appeared to “walk-the-walk” as well as “talk-the-talk” and potentially become a liability. Since doing pairing interviews this question has come up very rarely with a candidate as it’s been much clearer from the pairing experience what their abilities and attitude are.

[1] This doesn’t just apply to hiring individuals but can also work for whole teams, see “Choosing a Supplier: The Hackathon”.

[2] See “Afterwood – The Interview” for more on how much I dislike the pen & paper approach to coding interviews.

[3] My first experience was in a Cyber Dojo evening back in September 2010 that Jon Jagger ran at Skills Matter in London. I wrote it up for the ACCU: “Jon Jagger’s Coding Dojo”.

[4] Being a long-time freelancer this mode of operation is not unexpected as you are often hired into an organisation specifically for your expertise; your contributions outside of “coding” are far less clear. Some like the feedback on how the delivery process is working while others do not and just want you to write code.

[5] My In The Toolbox article “Getting Personal” takes a look at the boundary between team conventions and personal freedom for choices in tooling and approach.

[6] I’m not saying this person could not have improved if given the right guidance, they probably could have and I hope they actually have by now; they just weren’t right for this particular environment which needed a little more sensitivity and rigour.

Sunday, 18 October 2020

Fast Hardware Hides Many Sins

Way back at the beginning of my professional programming career I worked for a small software house that wrote graphics software. Although it had a desktop publisher and line-art based graphics package in its suite it didn’t have a bitmap editor and so they decided to outsource that to another local company.

A Different User Base

The company they chose to outsource to had a very high-end bitmap editing product and so the deal – to produce a cut-down version – suited both parties. In principle they would take their high-end product, strip out the features aimed at the more sophisticated market (professional photographers) and throw in a few others that the lower end of the market would find beneficial instead. For example their current product only supported 24-bit video cards, which were pretty unusual in the early to mid ‘90s due to their high price, and so supporting 8-bit palleted images was new to them. Due to the large images their high-end product could handle using its own virtual memory system they also demanded a large, fast hard disk too.

Even though I was only a year or two into my career at that point I was asked to look after the project and so I would get the first drop of each version as they delivered it so that I could evaluate their progress and also keep an eye on quality. The very first drop I got contained various issues that in retrospect did not bode well for the project, which ultimately fell through, although that was not until much later. (Naturally I didn’t have the experience I have now that would probably cause me to pull the alarm chord much sooner.)

Hard Disk Disco

One of the features that they partially supported but we wanted to make a little more prominent was the ability to see what the RGB value of the pixel under the cursor was – often referred to now as a colour dropper or eye dropper. When I first used the feature on my 486DX PC I noticed that it was a somewhat laggy; this surprised me as I had implemented algorithms like Floyd-Steinberg dithering so knew a fair bit about image manipulation and what algorithms were expensive and this definitely wasn’t one! As an aside I had also noticed that the hard disk light on my PC was pretty busy too which made no sense but was probably worth mentioning to them as an aside.

After feeding back to them about this and various other things I’d noticed they made some suggestions that their virtual memory system was probably overly aggressive as the product was designed for more beefier hardware. That kind of made sense and I waited for the next drop.

On the next drop they had apparently made various changes to their virtual memory system which helped it cope much better with smaller images so they didn’t page unnecessarily but I still found the feature laggy, and as I played with it some more I noticed that the hard disk light was definitely flashing lots when I moved the mouse although it didn’t stop flashing entirely when I stopped moving it. For our QA department who only had somewhat smaller 386SX machines it was almost even more noticeable.

DBWIN – Airing Dirty Laundry

At our company all the developers ran the debug version of Windows 3.1. enhanced mode with a second mono monitor to display messages from the Windows APIs to point out bugs in our software, but it was also very interesting to see what errors other software generated too [1]. You probably won’t be surprised to discover that the bitmap editor generated a lot of warnings. For example Windows complained about the amount of extra (custom) data it was storing against a window handle (hundreds of bytes) which I later discovered was caused by them constantly copying image attribute data back-and-forth as individual values instead of allocating a single struct with the data and copying that single pointer around.

Unearthing The Truth

Anyway, back to the performance problem. Part of the deal enabled our company to gain access to the bitmap editor source code which they gave to us earlier than originally planned so that I could help them by debugging some of their gnarlier crashes [2]. Naturally the first issue I looked into was the colour dropper and I quickly discovered the root cause of the dreadful performance – they were reading the application’s .ini file every time [3] the mouse moved! They also had a timer which simulated a WM_MOUSEMOVE message for other reasons which was why it still flashed the hard disk light even when the mouse wasn’t actually moving.

When I spoke to them about it they explained that once upon a time they ran into a Targa video card where the driver returned the RGB values as BGR when calling GetPixel(). Hence what they were doing was checking the .ini file to see if there was an application setting there to tell them to swap the GetPixel() result. Naturally I asked them why they didn’t just read this setting once at application start-up and cache the value given that the user can’t swap the video card whilst the machine (let alone the application) was running. Their response was simply a shrug, which wasn’t surprising by that time as it was becoming ever more apparent that the quality of the code was making it hard to implement the features we wanted and our QA team was turning up other issues which the mostly one-man team was never going to cope with in a reasonable time frame.

Epilogue

I don’t think it’s hard to see how this feature ended up this way. It wasn’t a prominent part of their high-end product and given the kit their users ran on and the kind of images they were dealing with it probably never even registered with all the other swapping going on. While I’d like to think it was just an oversight and one should never optimise until they have measured and prioritised there were too many other signs in the codebase that suggested they were relying heavily on the hardware to compensate for poor design choices. The other is that with pretty much only one full-time developer [5] the pressure was surely on to focus on new features first and quality was further down the list.

The project was eventually canned and with the company I was working for struggling too due to the huge growth of Microsoft Publisher and CorelDraw I only just missed the chop myself. Sadly neither company is around today despite quality playing a major part in the company I worked for and it being significantly better than many of the competing products.

[1] One of the first pieces of open source software I ever published (on CiX) was a Mono Display Adapter Library.

[2] One involved taking Windows “out at the knees” – not even CodeView or BoundsChecker would trap it – the machine would just restart. Using SoftICE I eventually found the cause – calling EndDialog() instead of DestroyWindow() to close a modeless dialog.

[3] Although Windows cached the contents of the .ini file it still needed to stat() the file on every read access to see if it had changed and disk caching wasn’t exactly stellar back then [4].

[4] See this tweet of mine about how I used to grep my hard disk under Windows 3.1 :o).

[5] I ended up moonlighting for them in my spare time by writing them a scanner driver for one of their clients while they concentrated on getting the cut-down bitmap editor done for my company.

Sunday, 9 August 2020

Simple Tables From JSON Data With JQ and Column

My current role is more of a DevOps role and I’m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things you’re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API – I’d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasn’t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
{
   "id": 6,
   "status": "online"
   . . .
}
,
{
   "id": 8,
   "status": "offline"
   . . .
}
]

JQ – The JSON Tool

I’d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the –c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
jq -c
[{"id":1,"status":"online"}]

but later I discovered the –r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called “column” which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
column -t -s $'\t'
ID Status
-- -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for “start time”, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runner’s jobs the job ID ordering wasn’t that useful as the IDs were allocated up front but the job might start much later if it’s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq ‘sort_by(.started_at) | reverse’

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

[1] See my 2013 article “In The Toolbox – Wrapper Scripts” for more about this common technique of simplifying tools.

Saturday, 8 August 2020

Weekend Maintenance as Chaos Engineering

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.

Monday, 3 February 2020

Blog Post #300

I signed off My 200th Blog Post in November 2014 with the following words:

See you again in a few years.

At the time I didn’t think it would take me over 5 years to write another 100 blog posts, but it has. Does this mean I’ve stopped writing and gone back to coding, reading, and gaming more on my daily commute? No, the clue is also in that blog post:

My main aspiration was that writing this blog would help me sharpen my writing skills and give me the confidence to go and write something more detailed that might then be formally published.

No, I haven’t stopped writing; on the contrary, since my first “proper” [1] article for ACCU in late 2013 I’ve spent far more of my time writing further articles, somewhere around the 60 mark at the last count. These have often been longer and also required more care and attention but I’ve probably still written a similar amount of words in the last five years to the previous five.

Columnist

My “In The Toolbox” column for C Vu was a regular feature from 2013 to 2016 but that has tailed off for now and been replaced by a column on the final page of ACCU’s Overload. After it’s editor Frances Buontempo suggested the title “Afterwood” in the pub one evening how could I not accept?

In my very first Afterwood, where I set out my stall, I described how the final page of a programming journal has often played host to some entertaining writers in the past (when printed journals were still all the rage) and, while perhaps a little late to the party given the demise of the printed page, I’m still glad to have a stab at attempting such a role.

This 300th blog post almost coincided with the blog’s 10th anniversary 9 months ago but I had a remote working contract at the time so my long anticipated “decade of writing” blog post was elevated to an Afterwood instead due to the latter having some semblance of moral obligation unlike the former [2]. That piece, together with this one which focuses more on this blog, probably forms the whole picture.

Statistics

I did wonder if I’d ever get bored of seeing my words appear in print and so far I haven’t; it still feels just that little bit more special to have to get your content past some reviewers, something you don’t have with your own blog. Being author and editor for my blog was something I called out as a big plus in my first anniversary post, “Happy Birthday, Blog”.

Many of us programmers aren’t as blessed in the confidence department as people in some other disciplines so we often have to find other ways to give ourselves that little boost every now and then. The blog wins out here as you can usually see some metrics and even occasionally the odd link back from other people’s blogs or Stack Overflow, which is a nice surprise. (Metrics only tell you someone downloaded the page, whereas a link back is a good indication they actually read it too :o). They may also have agreed, which would be even more satisfying!)

While we’re on the subject of “vanity” metrics I’ve remained fairly steadfast and ignored them. I did include a monthly “page views”counter on the sidebar just to make sure that it hadn’t got lost in the ether, search-engine wise. It’s never been easy searching for my own content; I usually have to add “site:chrisoldwood.blogspot.com” into the query, but it’s not that big an issue as first-and-foremost it’s notes for myself, other readers are always a bonus. For a long time my posts about PowerShell exit codes (2011) and Subversion mergeinfo records (2010) held the top spots but for some totally unknown reason my slightly ranty post around NTLM HTTP proxies (2016) is now dominating and will likely take over the top spot. Given there are no links to it (that I can find) I can only imagine it turns up in search engine queries and it’s not what people are really looking for. Sorry about that! Maybe there are devs and sysadmins out there looking for NTLM HTTP proxy therapy and this page is it? :o) Anyway, here are the top posts as of today:

Somewhat amusingly the stats graph on my 200th blog post shows a sudden meteoric rise in page views. Was I suddenly propelled to stardom? Of course not. It just so happened that my most recent post at the time got some extra views after the link was retweeted by a few people who’s follower count is measured in the thousands. It happened again a couple of years later, but in between it’s sat around the 4,500 views / month from what I can tell.

The 1 million views mark is still some way off, probably another 2.5 years, unless I manage to write something incredibly profound before then. (I won’t hold my breath though as 10 years of sample data must be statistically valid and it hasn’t happened so far.)

The Future

So, what for the future? Hopefully I’m going to keep plodding along with both my blog and any other outlets that will accept my written word. I have 113 topics in my blog drafts folder so I’m not out of ideas just yet. Naturally many of those should probably be junked as my opinion has undoubtedly changed in the meantime, although that in itself is something to write about which is why I can’t bring myself to bin them just yet – there is still value there, somewhere.

Two things I have realised I’ve missed, due to spending more time writing, is reading books (both technical and fiction) and writing code outside of work, i.e. my free tools. However, while I’ve sorely missed both of these pursuits I have in no way regretted spending more time writing as software development is all about communication and therefore it was a skill that I felt I definitely needed to improve. My time can hardly be considered wasted.

Now that I feel I’ve reached an acceptable level of competency in my technical writing I’m left wondering whether I’m comfortable sticking with that or whether I should try and be more adventurous. Books like The Goal show that technical subjects can presented in more entertaining ways and I’m well aware that my writing is still far too dry. My suspicion is that I need to get back to reading more fiction, and with a more critical eye, before I’ll truly feel confident enough to branch out more regularly into other styles [3].

Where I signed off my 200th post with a genuine expectation that I’d be back again for my 300th I’m less sure about the future. Not that I’ll have given up writing, more that I’m less sure this blog will continue to be the place where I express myself most. Here’s to the next 100 posts.

[1] I wrote a few reviews of branch meetings and book reviews before then, but that didn’t feel quite the same to me as writing about technical aspects of the craft itself. The latter felt like you were exposing more of your own thoughts rather than “simply” recording the opinions of others.

[2] See “Missing the Daily Commute by Train” about why my volume of writing is highly correlated with where I’m working at the time.

[3] To date my efforts to be more adventurous have been limited to my Afterwood left-pad spoof “Knocked for Six” and the short poem “Risk-a-Verse”.

Wednesday, 22 January 2020

Cargo Culting GitFlow

A few years back I got to spend a couple of weeks consulting at a small company involved in the production of smart cards. My team had been brought in by the company’s management to cast our critical eye over their software development process and provide a report on what we found along with any recommendations on how it could be improved.

The company only had a few developers and while the hardware side of the business seemed to be running pretty smoothly the software side was seriously lacking. To give you some indication of how bad things used to be, they weren’t even using version control for their source code. Effectively when a new customer came on board they would find the most recent and relevant existing customer’s version (stored in a .zip file), copy their version of the system, and then start hacking out a new one just for the new customer.

As you can imagine in a set-up like this, if a bug is found it would need to be fixed in every version and therefore it only gets fixed if a customer noticed and reported it. This led to more divergence. Also as the software usually went in a kiosk the hardware and OS out in the wild was often ancient (Windows 2000 in some cases) [1].

When I say “how bad things used to be” this was some months before we started our investigation. The company had already brought in a previous consultant to do an “Agile Transformation” and they had recognised these issues and made a number of very sensible recommendations, like introducing version control, automated builds, unit testing, more collaboration with the business, etc.

However, we didn’t think they looked too hard at the way the team were actually working and only addressed the low hanging fruit by using whatever they found in their copy of The Agile Transformation Playbook™, e.g. Scrum. Naturally we weren’t there at the time but through the course of our conversations with the team it became apparent that a cookie-cutter approach had been prescribed despite it being (in our opinion) far too heavyweight for the handful of people in the team.

As the title of this post suggests, and the one choice I found particularly amusing, was the introduction of VSTS (Visual Studio Team Services; rebranded Azure DevOps) and a GitFlow style workflow for the development team. While I applaud the introduction of version control and isolated, repeatable builds to the company, this feels like another heavyweight choice. The fact that they were already using Visual Studio and writing their web service in C# probably means it’s not that surprising if you wanted to pick a Big Iron product.

The real kicker though was the choice of a GitFlow style workflow for the new product team where there were only two developers – one for the front-end and another for the back-end. They were using feature branches and pull requests despite the fact that they were the only people working in their codebase. While the company might have hired another developer at some point in the future they had no immediate plans to to grow the team to any significant size [2] so there would never be any merge conflicts to resolve in the short to medium term! Their project was a greenfield one to create a configurable product instead of the many one-offs to date, so they had no regressions to worry about at this point either – it was all about learning and building a prototype.

It’s entirely possible the previous consultant was working on different information to us but there was nothing in our conversations with the team or management that suggested they previously had different goals to what they were asking from us now. Sadly this is all too common an occurrence – a company hires an agile coach or consultant who may know how to handle the transformation from the business end [3] but they don’t really know the technical side. Adopting an agile mindset requires the XP technical practices too to be successful and so, unless the transformation team really knows its development onions, the practices are going to be rolled out and applied with a cargo cult mentality instead of being taught in a way that the team understands which practices are most pertinent to their situation and why.

In contrast, the plan we put forward was to strip out much of the fat and focus on making it easy to develop something which could be easily demo’d to the stakeholders for rapid feedback. We also proposed putting someone who was “more developer than scrum master” into the team for a short period so they could really grok the XP practices and see why they matter. (This was something I personally pushed quite hard for because I’ve seen how this has played out before when you’re not hands-on, see “The Importance of Leading by Example”.)

[1] Luckily these kiosks weren’t connected to a network; upgrades were a site visit with a USB stick.

[2] Sadly there were cultural reasons for this – a topic for another day.

[3] This is debatable but I’m trying to be generous here as my expertise is mostly on the technical side of the fence.

The OldWood Thing