Sunday 9 August 2020

Simple Tables From JSON Data With JQ and Column

My current role is more of a DevOps role and I’m spending more time than usual monitoring and administrating various services, such as the GitLab instance we use for source control, build pipelines, issue management, etc. While the GitLab UI is very useful for certain kinds of tasks the rich RESTful API allows you to easily build your own custom tools to to monitor, analyse, and investigate the things you’re particularly interested in.

For example one of the first views I wanted was an alphabetical list of all runners with their current status so that I could quickly see if any had gone AWOL during the night. The alphabetical sorting requirement is not something the standard UI view provides hence I needed to use the REST API or hope that someone had already done something similar first.

GitLab Clients

I quickly found two candidates: python-gitlab and go-gitlab-client which looked promising but they only really wrap the API – I’d still need to do some heavy lifting myself and understand what the GitLab API does. Given how simple the examples were, even with curl, it felt like I wasn’t really saving myself anything at this point, e.g.

curl --header "PRIVATE-TOKEN: $token" "https://gitlab.example.com/api/v4/runners"

So I decided to go with a wrapper script [1] approach instead and find a way to prettify the JSON output so that the script encapsulated a shell one-liner that would request the data and format the output in a simple table. Here is the kind of JSON the GitLab API would return for the list of runners:

[
  {
   "id": 6,
   "status": "online"
   . . .
  }
,
  {
   "id": 8,
   "status": "offline"
   . . .
  }
]

JQ – The JSON Tool

I’d come across the excellent JQ tool for querying JSON payloads many years ago so that was my first thought for at least simplifying the JSON payloads to the fields I was interested in. However on further reading I found it could do some simple formatting too. At first I thought the compact output using the –c option was what I needed (perhaps along with some tr magic to strip the punctuation), e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -c
[{"id":1,"status":"online"}]

but later I discovered the –r option provided raw output which formatted the values as simple text and removed all the JSON punctuation, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
  jq -r '( .[] | "\(.id) \(.status)" )'
1 online

Naturally my first thought for the column headings was to use a couple of echo statements before the curl pipeline but I also discovered that you can mix-and match string literals with the output from the incoming JSON stream, e.g.

$ echo '[{"id":1, "status":"online"}]' |\
   jq -r '"ID Status",
          "-- ------",
          ( .[] | "\(.id) \(.status)" )'
ID Status
-- ------
1 online

This way the headings were only output if the command succeeded.

Neater Tables with Column

While these crude tables were readable and simple enough for further processing with grep and awk they were still pretty unsightly when the values of a column were too varied in length such as a branch name or description field. Putting them on the right hand side kind of worked but I wondered if I could create fixed width fields ala printf via jq.

At this point I stumbled across the StackOverflow question How to format a JSON string as a table using jq? where one of the later answers mentioned a command line tool called “column” which takes rows of text values and arranges them as columns of similar width by adjusting the spacing between elements.

This almost worked except for the fact that some fields had spaces in their input and column would treat them by default as separate elements. A simple change of field separator from a space to a tab meant that I could have my cake and eat it, e.g.

$ echo '[ {"id":1, "status":"online"},
          {"id":2, "status":"offline"} ]' |\
  jq -r '"ID\tStatus",
         "--\t-------",
         ( .[] | "\(.id)\t\(.status)" )' |\
  column -t -s $'\t'
ID  Status
--  -------
1   online
2   offline

Sorting and Limiting

While many of the views I was happy to order by ID, which is often the default for the API, or in the case of jobs and pipelines was a proxy for “start time”, there were cases where I needed to control the sorting. For example we used the runner description to store the hostname (or host + container name) so it made sense to order by that, e.g.

jq 'sort_by(.description|ascii_downcase)'

For the runner’s jobs the job ID ordering wasn’t that useful as the IDs were allocated up front but the job might start much later if it’s a latter part of the pipeline so I chose to order by the job start time instead with descending order so the most recent jobs were listed first, e.g.

jq ‘sort_by(.started_at) | reverse’

One other final trick that proved useful occasionally when there was no limiting in the API was to do it with jq instead, e.g

jq "sort_by(.name) | [limit($max; .[])]"

 

[1] See my 2013 article In The Toolbox – Wrapper Scripts” for more about this common technique of simplifying tools.

Saturday 8 August 2020

Weekend Maintenance as Chaos Engineering

I was working on a new system – a grid based calculation engine for an investment bank – and I was beginning to read about some crazy ideas by Netflix around how they would kill off actual production servers to test their resilience to failure. I really liked this idea as it had that “put your money where your mouth is” feel to it and I felt we were designing a system that should cope with this kind of failure, and if it didn’t, then we had learned something and needed to fix it.

Failure is Expected

We had already had a few minor incidents during its early operation caused by dodgy data flowing down from upstream systems and had tackled that by temporarily remediating the data to get the system working but then immediately fixed the code so that the same kind of problem would not cause an issue in future. The project manager, who had also worked on a sister legacy system to one I’d worked on before, had made it clear from the start that he didn’t want another “support nightmare” like we’d both seen before [1] and pushed the “self-healing” angle which was a joy to hear. Consequently reliability was always foremost in our minds.

Once the system went live and the business began to rely on it the idea of randomly killing off services and servers in production was a hard prospect to sell. While the project manager had fought to help us get a UAT environment that almost brought us parity with production and was okay with us using that for testing the system’s reliability he was less happy about going to whole hog and adopting the Netflix approach. (The organisation was already very reserved and despite our impeccable record some other teams had some nasty failures that caused the organisation to become more risk adverse rather than address then root problems.)

Planned Disruption is Good!

Some months after we had gone live I drew the short straw and was involved with a large-scale DR test. We were already running active/active by making use of the DR facilities during the day and rotated the database cluster nodes every weekend [2] to avoid a node getting stale, hence we had a high degree of confidence that we would cope admirably with the test. Unfortunately there was a problem with one of the bank’s main trade systems such that it wouldn’t start after failover to DR that we never really got to do a full test and show that it was a no-brainer for us.

While the day was largely wasted for me as I sat around waiting for our turn it did give me time to think a bit more about how we would show that the system was working correctly and also when the DR test was finished and failed back over again that it had recovered properly. At that point I realised we didn’t need to implement any form of Chaos Engineering ourselves as the Infrastructure team were already providing it, every weekend!

It’s common for large enterprises to only perform emergency maintenance during the week and then make much more disruptive changes at the weekend, e.g. tearing parts of the network up, patching and rebooting servers, etc. At that time it was common for support teams to shut systems down and carefully bring them back up after the maintenance window to ensure they were operating correctly when the eastern markets opened late Sunday evening [3]. This was the perfect opportunity to do the complete opposite – drive the system hard over the weekend and see what state it was after the maintenance had finished – if it wasn’t still operating normally we’d missed some failure modes.

An Aria of Canaries

We were already pushing through a simple canary request every few minutes which allowed us to spot when things had unusually gone south but we wanted something heavier that might drive out subtler problems so we started pushing through heavy loads during the weekend too and then looked at what state they were in at the end of the weekend. These loads always had a lower priority than any real work so we could happily leave them to finish in the background rather than need to kill them off before the working week started. (This is a nice example of using the existing features of the system to avoid it disrupting the normal workload.)

This proved to be a fruitful idea as it unearthed a couple of places where the system wasn’t quite as reliable as we’d thought. For example we were leaking temporary files when the network was glitching and the calculation was restarted. Also the load pushed the app servers over the edge memory-wise and highlighted a bug in the nanny process when the machine was short of memory. There was also a bug in some exponential back-off code that backed off a little too far as it never expected an outage to last most of the weekend :o).

Order From Chaos

When they finally scheduled a repeat DR test some months later after supposedly ironing out the wrinkles with their key trade capture systems our test was a doddle as it just carried on after being brought back to life in the DR environment and similarly after reverting back to PROD it just picked up where it had left off and retried those jobs that had failed when the switchover started. Rather than shying away from the weekend disruption we had used it to our advantage to help improve its reliability.

 

[1] Eventually the team spends so much time fire-fighting there is no time left to actually fix the system and it turns into an endless soul-destroying job.

[2] Rotating the database cluster primary causes the database to work with an empty cache which is a great way to discover how much your common queries rely on heavily cached data. In one instance a 45-second reporting query took over 15 minutes when faced with no cached pages!

[3] See Arbitrary Cache Timeouts for an example where constant rebooting masked a bug.