Monday, 23 May 2016

The Tortoise and the Hare

[This was my second contribution to the 50 Shades of Scrum book project. It was also written back in late 2013.]

One of the great things about being a parent is having time to go back and read some of the old parables again; this time to my children. Recently I had the pleasure of re-visiting that old classic about a race involving a tortoise and a hare. In the original story the tortoise wins, but surely in our modern age we'd never make the same kinds of mistakes as the hare, would we?

In the all-too-familiar gold rush that springs up in an attempt to monetize any new, successful idea a fresh brand of snake oil goes on sale. The particular brand on sale now suggests that this time the tortoise doesn't have to win. Finally we have found a sure-fire way for the hare to conquer — by sprinting continuously to the finish line we will out-pace the tortoise and raise the trophy!

The problem is it's simply not possible in the physical or software development worlds to continue to sprint for long periods of time. Take Usain Bolt, the current 100m and 200m Olympic champion. He can cover the 100m distance in just 9.63 secs and the 200m in 19.32 secs. Assuming a continuous pace of 9.63 secs per 100m the marathon should have been won in ~4063 secs, or 1:07:43. But it wasn't. It was won by Stephen Kiprotich in almost double that — 2:08:01. Usain Bolt couldn't maintain his sprinting pace even if he wanted to over a short to medium distance, let alone a longer one.

The term "sprint" comes loaded with expectations, the most damaging of which is that there is a finish line at the end of the course. Rarely in software development does a project start and end in such a short period of time. The more likely scenario is that it will go on in cycles, where we build, reflect and release small chunks of functionality over a long period. This cyclic nature is more akin to running laps on a track where we pass the eventual finishing line many times (releasing) before the ultimate conclusion is reached (decommissioning).

In my youth my chosen sport was swimming; in particular the longer distances, such as 1500m. At the start of a race the thought of 16 minutes of hard swimming used to fill me with dread. So I chose to break it down into smaller chunks of 400m, which felt far less daunting. As I approached the end of each 400m leg I would pay more attention to where I was with respect to the pace I had set, but I wouldn't suddenly sprint to the end of the leg just to make up a few seconds of lost time. If I did this I’d start the next one partially exhausted which really upsets the rhythm, so instead I’d have to change my stroke or breathing to steadily claw back any time. Of course when you know the end really is in sight a genuine burst of pace is so much easier to endure.

I suspect that one of the reasons management like the term "sprint" is for motivation. In a more traditional development process you may be working on a project for many months, perhaps years before what you produce ever sees the light of day. It’s hard to remain focused with such a long stretch ahead and so breaking the delivery down also aids in keeping morale up.

That said, what does not help matters is when the workers are forced to make a choice about how to meet what is an entirely arbitrary deadline – work longer hours or skimp on quality. And a one, two or three week deadline is almost certainly exactly that — arbitrary. There are just too many daily distractions to expect progress to run smoothly, even in a short time-span. This week alone my laptop died, internet connectivity has been variable, the build machine has been playing up, a permissions change refuses to have the desired effect and a messaging API isn't doing what the documentation suggests it should. And this is on a run-of-the-mill in-house project only using supposedly mature, stable technologies!

Deadlines like the turn of the millennium are real and immovable and sometimes have to be met, but allowing a piece of work to roll over from one sprint to the next when there is no obvious impediment should be perfectly acceptable. The checks and balances should already be in place to ensure that any task is not allowed to mushroom uncontrollably, or that any one individual does not become bogged down or "go dark". The team should self-regulate to ensure just the right balance is struck between doing quality work to minimise waste whilst also ensuring the solution solves the problem without unnecessary complexity.

As for the motivational factors of "sprinting" I posit that what motivates software developers most is seeing their hard work escape the confines of the development environment and flourish out in production. Making it easy for developers to continually succeed by delivering value to users is a far better carrot-and-stick than "because it's Friday and that's when the sprint ends".

In contrast, the term "iteration" comes with far less baggage. In fact it speaks the developer's own language – iteration is one of the fundamental constructs used within virtually any program. It accentuates the cyclic nature of long term software development rather than masking it. The use of feature toggles as an enabling mechanism allows partially finished features to remain safely integrated whilst hiding them from users as the remaining wrinkles are ironed out. Even knowing that your refactorings have gone live without the main feature is often a small personal victory.

That doesn't meant the term sprint should never be used. I think it can be used when it better reflects the phase of the project, e.g. during a genuine milestone iteration that will lead to a formal rather than internal release. The change in language at this point might aid in conveying the change in mentality required to see the release out the door, if such a distinction is genuinely required. However the idea of an iteration "goal" is already in common use as an alternative approach to providing a point of focus.

If continuous delivery is to be the way forward then we should be planning for the long game and that compels us to favour a sustainable pace where localized variations in scope and priority will allow us to ride the ebbs and flows of the ever changing landscape.

Debbie Does Development

[This was my initial submission to the 50 Shades of Scrum book project, hence it’s slightly dubious sounding title. It was originally written way back in late 2013.]

Running a software development project should be simple. Everyone has a well-defined job to do: Architects architect, Analysts analyse, Developers develop, Testers test, Users use, Administrators administer, and Managers manage. If everyone just got on and did their job properly the team would behave like a well-oiled machine and the desired product would pop out at the end — on time and on budget.

What this idyllic picture describes is a simple pipeline where members of one discipline take the deliverables from their predecessor, perform their own duty, throw their contribution over the wall to the next discipline, and then get on with their next task. Sadly the result of working like this is rarely the desired outcome.

Anyone who believes they got into software development so they could hide all day in a cubicle and avoid interacting with other people is sorely mistaken. In contrast, the needs of a modern software development team demands continual interaction between its members. There is simply no escaping the day-to-day, high-bandwidth conversations required to raise doubts, pass knowledge, and reach consensus so that progress can be made efficiently and, ultimately, for value to be delivered.

Specializing in a single skill is useful for achieving the core responsibilities your role entails, but for a team to work most effectively requires members that can cross disciplines and therefore organize themselves into a unit that is able to play to its strengths and cover its weaknesses. My own personal preference is to establish a position somewhat akin to a centre half in football — I'm often happy taking a less glamorous role watching the backs of my fellow team mates whilst the "strikers" take the glory. Enabling my colleagues to establish and sustain that all important state of "flow" whilst I context-switch helps overall productivity.

To enable each team member to perform at their best they must be given the opportunity to ply their trade effectively and that requires working with the highest quality materials to start with. Rather than throwing poor-quality software over a wall to get it off their own plate, the team should take pride in ensuring they have done all that they can to pass on quality work. The net effect is that with less need to keep revisiting the past they have more time to focus on the future. This underlies the notion of "done done" — when a feature is declared complete it comes with no caveats.

The mechanics of this approach can clearly be seen with the technical practices such as test-driven development, code reviews, and continuous integration. These enable the development staff to maintain a higher degree of workmanship that reduces the traditional burden on QA staff caused by trivial bugs and flawed deployments.

Testers will themselves write code to automate certain kinds of tests as they provide a more holistic view of the system which covers different ground to the developers. In turn this grants them more time to be spent on the valuable pursuits that their specialised skills demand, like exploratory testing.

This skills overlap also shows through with architects who should contribute to the development effort too. Being able to code garners a level of trust that can often be missing between the developer and architect due to an inability to see how an architecture will be realized. This rift between the "classes" is not helped either when you hear architects suggesting that their role is what developers want to be when they "grow up".

Similar ill feelings can exist between other disciplines as a consequence of a buck-passing mentality or mistakenly perceived job envy. Despite what some programmers might believe, not every tester or system administrator aspires to be a developer one day either.

Irrespective of what their chosen technical role is, the one thing everyone needs to be able to do is communicate. One of the biggest hurdles for a software project is trying to build what the customer really wants and this requires close collaboration between them and the development team. A strict chain of command will severely reduce the bandwidth between the people who know what they want and the people who will design, build, test, deploy and support it. Not only do they need to be on the same page as the customer, but also the same page as their fellow team mates. Rather than strict lines of communication there should be ever changing clusters of conversation as those most involved with each story arrive at a shared understanding. It can be left to a daily stand-up to allow the most salient points of each story to permeate out to the rest of the team.

Pair Programming has taken off as a technique for improving the quality of output from developers, but there is no reason why pairing should be restricted to two team members of the same skill-set. A successful shared understanding comes from the diversity of perspectives that each contributor brings, and that comes about more easily when there is a fluidity to team interactions. For example, pairing a developer with a tester will help the tester improve their coding skills, whilst the developer improves their ability to test. Both will improve their knowledge of the problem. If there are business analysts within the team too this can add even further clarity as "the three amigos" cover more angles.

One only has to look to the recent emergence of the DevOps role to see how the traditional friction between the development and operations teams has started to erode. Rather than have two warring factions we've acknowledged that it’s important to incorporate the operational side of the story much earlier into the backlog to avoid those sorts of late surprises. With all bases covered from the start there is far more chance of reaching the nirvana of continuous delivery.

Debbie doesn't just do development any more than Terry just does testing or Alex only does Architecture. The old-fashioned attitude that we are expected to be autonomous to prove our worth must give way to a new ideal where collaboration is king and the output of the team as a whole takes precedence over individual contributions.

Thursday, 28 April 2016

Stand-Up and Deliver

At the ACCU Conference last year I gave a five minute lightning talk titled “The Daily Stand-Up”, which was a short comedy routine of one-liners around programming/IT/science related topics. This year I was back again, but this time I had more material so could take the opportunity to do more than one lightning talk (they have three slots – one per evening).

In the the subsequent year I’ve used my Agile on the Beach and Equal Experts Christmas party “performances” as an excuse to trawl all 14 thousand of my tweets looking for material. After separating the wheat from the chaff (or rather the needles from the haystack) I unearthed plenty enough for two sets.

It also felt as though the name “The Daily Stand-Up” had been done now and that I should try and organise the material into separate themes which I ended up doing on the way down to this year’s Pipeline conference. As a result I came up with two new titles and grouped the material appropriately, along with creating a holding slide with a suitably pithy picture.

Here then are my two sets (61 one-liners in total) for the ACCU 2016 Conference lightning talks – “Continuous Delivery” and “Becoming a Bitter Programmer” [1].

Continuous Delivery

“I visited the opticians the other day after I started seeing printers, keyboards and mice out the corner of my eye. She said it’s okay, it’s just peripheral vision.”

“We recently needed to break into my bosses account so I thought I’d try a dictionary attack – I just kept hitting the system administrator over the head with a large copy of the OED until he let me in.”

“My wife and I have been together for twenty-five years so I thought I ought to get her a token ring. Turns out you can only get 100BASE-TX these days.”

“If you think keeping up with the Kardashians is hard, you should try JavaScript frameworks!”

“I once listened to an audio book about structs that only contained values of primitive types. It was a POD cast.”

“When Sherlock Holmes talks about a ‘three pipe problem’, does he mean one that requires grep, awk, sed and sort?”

“Is the removal of a dependency injection framework from a Java codebase known as spring cleaning?”

“Was the Tower of Pisa built using lean manufacturing?”

“I know it’s all the rage these days but I reckon the writing’s on the wall for Kanban.”

“The last time I put my phone into airplane mode it promptly assumed the crash position.”

“When I turned thirty-five I blew a whole load of cash on a monster gaming rig. I think I was suffering from a half-life crisis.”

“When creating a diagram of a micro-services architecture I never know where to draw the line.”

“If we want to adopt an agile release train, does that mean we need to start using Ruby on Rails?”

“The problem with the technical debt metaphor is that in the end everyone begins to lose interest.”

“If you send data using the Kermit protocol is the transmission speed measured in ribbits per second?”

“How do you change the CMOS battery in a virtual machine?”

“For Christmas my wife bought me some jigsaws of famous British computer scientists. When she asked how I was getting on, I replied that it was Turing complete.”

“When working from home I like to get my kids involved in my coding. I call it Au Pair Programming.”

“My son was being hassled by his school friends to share his music collection via BitTorrent. I said he shouldn’t give in to peer-to-peer pressure.”

“Our local pub has started selling a new draught beer called Git. The only way to get served is to make a pull request.”

“When they collected their Turing Award I recognised Diffie on the left and Hellman on the right, but who was the man in the middle?”

“I decided it was about time I upgraded to fibre, so I’ve started eating All Bran for breakfast.”

“If you’re looking to purchase some recreational drugs do you first consult Trip Advisor?”

“Should a microphone type be mutable?”

“When software developers work at New Scotland Yard do they have to use a special branch?”

“I recently visited the dentist and he told me I had a scaling problem. I said that’s a problem as I don’t have room for any more teeth.”

“We recently had a poll for the best escape character. I picked Steve McQueen’s Hicks.”

“I’m never quite sure, is it Heisenbergs uncertainty principle?”

“I was utterly convinced that I had found the answer to my SSL problem. In fact I thought it was a dead cert.”

“When writing software for estate agents, should you use property based testing?”

“The other day I struggled to upload a picture of Marcel Marceau. Every time I tried the server returned the error: 415 unsupported mime type.”

“Probably the hardest problem in computer science is dealing with nans. They always seem to want you to come round and fix their computer.“

Becoming a Bitter Programmer

“Is a cross functional team just a bunch of grumpy Lisp programmers?”

“I’ve given up playing poker with functional programmers, they just spend all their time folding.”

“My company recently adopted mob programming. They hired a bunch of beefy guys to stand around with baseball bats to make sure you did an 80 hour week.”

“Some people are like the char type in C++, they seem to get promoted for no apparent reason.”

“Is it any wonder modern programmers are obese when they depend so heavily on syntactic sugar?”

“The product owner asked me why all our Cucumber tests only covered the happy paths. I told him they’re rose tinted specs.”

“Every time I try and generate a digital signature I just make a hash of it.”

“C++ comes with complexity guarantees – if you use C++ it’s guaranteed to be complex.”

“Some people say C# & Java programmers overuse reflection, but given the quality of their code I’d say they’re not reflecting enough.”

“Talking of C# & Java, is it just me or are they harder to distinguish these days than The Munsters and The Addams Family?”

“Our team isn’t very good at this agile stuff – our successes are just stories and our failures are all epics.”

“Anyone who says you shouldn’t use synchronisation objects is just being racist.”

“Whenever I find a static in multi-threaded code it makes my hair stand on end.”

“If you don’t keep your promises, how will you know what the future holds?”

“Our zero tolerance approach to floating-point comparison errors didn’t work out quite as we’d hoped.”

“If technical debt is the result of poor quality code, does that make bad programmers loan sharks?”

“Don’t bother upgrading your database, the SQL is never as good as the original.”

“I blame Facebook for the declining quality of SQL code, young programmers are obsessed with likes.”

“My current system has five-nines reliability - it usually works about 45% of the time.”

“Our DR approach is less active/passive and more passive/aggressive. When anything fails we just sit around loudly tutting until someone goes and fixes it.”

“The company said it was moving all our hardware to EC2. It turned out they’d bought a dingy office in London near the Bank of England.”

“Some people just don’t know when to stop bashing Windows.”

“When I found out my new job was effectively working with legacy code I started spitting Feathers!”

“I reckon my team has taken a hypocritic oath – the comments say one thing but the code does something completely different.”

“You can always tell when you’ve got a Christmas or birthday present from an enterprise software developer – there is an excessive amount of wrapping.”

“C# supports impure methods and mutable types, but doesn’t support tail recursion or currying. Does that make it a dysfunctional programming language?”

“Finding the square root of a negative number is just not natural.”

“My team is using homeopathic unit testing. They write one test and dilute it with a thousand lines of production code.”

“The Wildlife Trust has declared our codebase a conservation area on account of the number of bugs.”

[1] The latter title was inspired by the recent book from the compere of the ACCU Conference lightning talks – Pete Goodliffe – which is called Becoming a Better Programmer.

Sunday, 14 February 2016

Get Considered Harmful

[I’ve tried really hard to come up with a better title than this, e.g. Thesaurus Driven Development, but none of them quite conveyed my feelings. Surely we must all be entitled to use the “??? Considered Harmful” template at least once in our careers? Well, this is mine.]

They say that one of the hardest problems in computer science is naming things [1]. We struggle over what to call our functions, classes, variables, filenames, servers, etc. in the hope that they adequately convey what it is that they do or contain.

In my own programming endeavours I’ve found the use of the Thesaurus a vital ingredient in staving off the boring repetition that so often litters the names in a codebase. In fact one of the latest additions to my “In The Toolbox” column for C Vu touched on this very topic (see “Dictionary & Thesaurus”).

Bueller, Bueller, Anyone…

Whilst the travesty that is appending the word “Service” as a suffix to some noun in the hope of making it sound “behavioural” ( and therefore also suitable for repurposing as an equivalently named interface) is a common blight, but it’s not the one I’m interested in this time.

No, in this instance the word that is the target of my ire is “get”. It’s often as if no other word exists in the English language for helping to name what a function (or method) returning a value does. I sometimes wonder if the all the effort is expended by programmers trying to decide what it is that they’re trying to obtain that they have no time or energy left to describe how they will be obtaining it.

The word has become so utterly bland and devoid of any usefulness that, by definition, it could be used for absolutely anything and therefore it is a de-facto choice that could never be wrong. Of course unless a complete lack of descriptiveness is one of your criterion for a good name. Sadly it is one of mine.

A Smorgasbord of Alternatives

As I mentioned in that C Vu article, and more recently on Twitter, I can think of a whole bunch of words that could almost certainly describe more accurately what it is that “some function” might be trying to achieve. Here once again is a similar, off-the-cuff list of choices:

build, create, allocate, make, acquire, format, locate, request, fetch, retrieve, find, calculate, derive, pull, process, format, transform, generate.

Within that simple list of words there are clearly a number of categories under which we could group our choices. For example if I’ve got a factory function then I might choose: create, build, make or allocate.

If the factory is some kind of object pool I might not want to use the create/destroy or allocate/free pair as they potentially suggest too much and so I might go for the slightly looser acquire/release pair.

If the creation is for a computed sequence of items rather than a single value or entity, then generate might be more applicable.

Another group centres around the kind of calls made on an external (or remote) service, e.g. request, fetch, retrieve, locate, find, query and search. Some of these words are equally fitting for searching a local collection too and therefore the degree of transparency may have a bearing on the choice. There is an old programming adage “Don’t hide the network”, and naming is one clue we can use to do that.

The third obvious group I can make out are words for describing the more common notion of what a traditional function is: the transformation of one value into another. Typically we put one value in and get the same value out, but changed in an interesting way, e.g. a different representation. Or maybe we do get a different value out, but with the same representation. Or even a change in both, but nevertheless one that is still similar in more ways than not [2].

For this style of (pure) function I would naturally look to words like: transform, convert, format, parse, calculate, derive, etc. Words like execute or process, which might also convey the notion of some transformation come (in my mind) with thoughts of side-effects too, rather than the purity of a mathematical function.

Accessor or Factory?

If you take a very rigid stance that says every function can either return a value it already holds, or can create a new one, then every function must either be an accessor or a factory. It’s easy to see how this kind of myopia leads to a codebase littered with “factory” classes when they’re basing their naming on a simplistic view of the function’s implementation.

The flipside to being overly generic is being overly specific. When you have a concrete method it’s easy to name it after what it does. But when the method is part of an interface which can, by definition, be implemented in many different ways we run the risk of potentially sending the reader down the wrong path.

Which is worse though: forcing the reader to look inside the box because it has no meaningful label, or asking them to suspend their disbelief when the contents eventually turn out to be blander than advertised? Personally I’d rather be surprised to find a simple calculation when I wasn’t expecting it rather than a network hop to a remote machine.

Blandness Over Configuration

I do wonder if part of the problem might not lie with modern tooling. In our desire to embrace the notion of Convention over Configuration we restrict ourselves to using only those words that the tool can recognise. Those words are very likely to be the ones that have the broadest stroke exactly because we expect them to cater for a wide degree of diversity – it’s the same issue as with interfaces, but on an even wider scale.

Another source of inspiration for such vocabulary are books on Design Patterns, and that usually means the seminal one from the Gang of Four. One might argue that the raison d'ĂȘtre of the Design Patterns movement was to create a common vocabulary, and it was, but that doesn’t mean you have to use the pattern names and examples in your own types and functions.

Haven’t We Been Here Before?

If any of this coming from me sounds at all familiar it might be because I’ve trodden a similar path before right back in the infancy of this very blog, see “Standard Method Name Verb Semantics”. Back then I was particularly interested in trying to overcome the horribly overloaded nature of certain words, of which “get” happened to be just one.

In a sense this blog post is a step backwards. Rather than trying to overcome the subtitles of whether Get or Find is more suitable for a method that can also return the absence of a value [3], or Allocate versus Acquire for an object pool, I’m just trying to convince programmers to use almost anything [4] other than “get” once again.

[1] The other one of course is cache invalidation (and off-by-one errors).

[2] For example I once saw an F# example that “mapped” a URL into a JSON document. While a lawyer might argue that it’s a perfectly valid “function”, I’d contend that it is hiding a huge amount of complexity. This is both a blessing and a curse depending on whether or not it’s your turn with the support pager.

[3] The answer of course is probably “either”, but return an Option instead of a value / null reference; i.e. make the return type the defining characteristic.

[4] By applying Cunningham’s Law the function stands a chance of being renamed when the mistake is noticed, whereas giving it a dreary pretty much guarantees it’ll never be changed.

Friday, 15 January 2016

Man Cannot Live by Unit Testing Alone

Way back in May 2012 I wrote a blog post titled “Beware the Complacency Unit Testing Brings”. This was a reaction to a malaise that I began to see developing as the team appeared to rely more heavily on the feedback it was getting from unit tests. This in turn appeared to cause some “trivial” bugs that should also have been picked up early, to be detected somewhat later.

This post looks at a couple of other examples I’ve seen in the past of problems that couldn’t have be solved by unit testing alone.

Unit Tests Are Self-Reinforcing

Myself and a colleague once had a slightly tortuous conversation with a project manager about our team’s approach to testing. There was a “suggestion” that as the organisation began to make more decisions based on the results of the system we had built, the more costly “a mistake” could become. We didn’t know where this was coming from but the undertone had a suggestion about it of “work harder”.

Our response was that if the business was worried about the potential for losses in millions due to a software bug, then they should have no problem funding a few tens of thousands of pounds of hardware to give us the tools we need to automate more testing. To us, if the risks were high, then the investment should be too, as this helps us to ensure we keep the risks down to a minimum. In essence we advocated working smarter, not harder.

His response was that unit tests should be fast and easy to run, and therefore he questioned why we needed any more hardware. What he failed to understand about unit testing was its self-reinforcing nature [1]. Unit tests are about a programmer verifying that the code they wrote works as they intended it to. What it fails to address is that it meets the demands of the customer. In the case of an API “that customer” is possibly just another developer on the same team providing another piece of the same jigsaw puzzle.

As if to prove a point this scenario was beautifully borne out not long after. Two developers working on either side of the same feature (front-end and back-end) both wrote their parts with a full suite of unit tests and pushed to the DEV environment only to discover it didn’t work. It took a non-trivial amount of time of the three of us (the two devs in question and myself) before I happened to notice that the name of the configuration setting which the front-end and back-end were using was slightly different. Each developer had created their own constant for the setting name, but the constant’s value was different and hence the back-end didn’t believe it was ever being provided.

This kind of integration problem is common. And we’re not talking about junior programmers here either, both were smart and very experienced developers. They were also both TDD-ers and it’s easy to see how this kind of problem occurs when your mind-set is focused around the simplest thing that could possibly work. We always look for the mistake in our most recent changes and both of them created the mismatched constant right back at the very beginning, hence it becomes “out of mind” by the time the problem is investigated [2].

Performance Tests

Unit tests are about verifying functional behaviour, so ensuring performance is not in scope at that point. I got a nice reminder of this not long afterwards when I refactored a stored procedure to remove some duplication, only to send performance through the roof. The SQL technique I used was “slightly” less performant (I later discovered) and it added something like another 100 ms to every call to the procedure.

Whilst all the SQL unit tests passed with flying colours in it’s usual timescale [3], when it was deployed into the test environment, the process it was part of nosedived. The extra 100 ms in the 100,000 calls [4] that the process made to the procedure started to add up and a 30 minute task now took over 8 hours!

Once again I was grateful to have “continuous” deployments to a DEV environment where this showed up right away so that I could easily diagnose and fix it. This just echoes what I wrote about recently in “Poor Performance of log4net Context Properties”.

A Balance

The current backlash against end-to-end testing is well justified as there are more efficient approaches you can take. But we must remember that unit testing is no panacea either. Last year we had these two competing views going head-to-head with each other: Why Most Unit Testing is Waste and Just Say No to More End-to-End Tests. It’s hard to know what to do.

As always the truth probably lies somewhere in between, and shifts either way depending on the kind of product, people and architecture you’re dealing with. The testing pyramid gets trotted out as the modern ideal but personally I’m still not convinced about how steep the sides of it should be for a monolith versus a micro-service, or a thick client versus a web API.

What I do know is that I find value in all different sorts of tests. One size never fits all.

[1] This is one of the things that pair and mob programming tackles because many eyes help make many kinds of mistakes less common.

[2] Yes, I could also go on about better collaboration and working outside in from a failing system test, but this didn’t deserve any massive post mortem.

[3] Database unit tests aren’t exactly speedy anyway so they increased the entire test suite time by an amount of time that could easily have been passed off as noise.

[4] Why was this a sequential operation? Let’s not go there...

Tuesday, 12 January 2016

Tribalism or Marketing?

I had a brief conversation with Paulmichael Blassuci (@pblasucci) on Twitter after someone re-tweeted the following from him:

“Open Q: what will cause #SoftDev, as an industry, to stop thinking in tribal terms? (e.g. "scala dev", "linux guru", "sql server admin")”

Naturally I wanted to understand what it was he was observing, as in my experience this wasn’t the case, or perhaps I just didn’t understand what he was getting at. I’m not going to pretend that I have any significant grasp of psychology, philosophy, sociology or any (social) science for that matter, and so I expected this to quickly go way over my head. I probably only persevered because I’ve finally got around reading Gerry Weinberg’s seminal classic “The Psychology of Computer Programming” which he wrote way back in the 1970’s and has been on my reading list for far too long.

Twitter is not the best medium for holding any sort of proper conversation but I got enough out of it to start me thinking about how I describe myself, such as in the profiles I have on various sites like StackOverflow, my blog, Linked-In, etc. In all those cases I seem to do exactly what @pblasucci was observing – I appear to be pigeonholing myself with a specific subset of the programming community. But I too wondered why that was, as I don’t consciously try to feel closer to one group or another, although it’s possible I might have unconsciously chosen to try and distance myself from certain other groups.

What’s in a Name?

The debate about how to describe ourselves professionally seems to be never ending. Are we programmers, computer scientists, software developers, software engineers, solution architects, etc? At the recent “bake-off” I wrote about in “Choosing a Supplier – The Hackathon” the various teams all introduced themselves as software engineers and solution architects. When it came to our team we all just described ourselves simply as “a dev”.

When people outside the industry ask me what I do for a living I find it easiest to describe myself simply as “a programmer”. There is little point in being any more specific with them about which “area” I tend to work in unless they try and dig deeper. The person on the street probably hasn’t got the foggiest idea about the differences between developing video games, mobile apps, web sites, back office tools, etc. And given that a large part of my working life has been spent on financial middle-office and back-office type systems I don’t really even have a customer facing product that I could tangentially associate myself with [1].

My father-in-law wrote software back in the days when “programmers” were seen as just code monkeys that turned the analyst’s carefully worked out flowchart into computer code. When they started playing that role too they became known as analyst/programmers. Hence to avoid confusion (I was in my first job doing the whole lot: analysis, design, implementation, test, deployment, etc.) I said I was a “software engineer”. When I slip up on Twitter and say that I’m a “programmer” he still likes to remind me of my place in the software development hierarchy :o).

Going Freelance

To other people inside the industry I think I stopped being just “a” programmer and started qualifying it a bit more when I went freelance (i.e. contracting). At this point I stopped being interested in “the company” per-se or career progression and just wanted to get paid a decent wage for doing what I loved – programming.

As a contractor you are essentially seen as a mercenary (which is where the term “free-lance” comes from). You are primarily hired for your expertise in a particular language or technology and when the project using that finishes, so do you. You and the company part ways and move on to the next gig. Only, we all know software projects often have a lifetime somewhat longer than this simplified view suggests.

Perhaps in an ideal world we’d all just be “journeyman programmers” and would pick up whatever extra skills we needed on the job, whether they be big or small. This happens for permanent employees, but occasionally even for freelancers too. For instance, my introduction to the world of C# came about because the technology stack of the project I interviewed for switched from C++ to C# right at the last moment. Even though I was after a contract position they still asked me if I wanted to pick up C# and .Net on-the-fly. They were more interested in the experience I brought to the project and clearly thought I’d have no trouble learning the language and framework relatively quickly.

By-and-large though this doesn’t happen because the expectation is that you’re hiring a temporary worker that can “hit the ground running”. This means you’re already expected to be well versed in the primary language and toolchain (the so called “must haves”), but may not have much knowledge of the ancillary tooling. Knowledge of the problem domain is whole other ball game as that can often be traded off against someone’s technical abilities.

Marketing

And so whereas the programmer in the world of permanent employment (where the employer is happy to invest time and money in their education) probably thinks of themselves as more of a generalist, the independent programmer has less of a luxury and cannot. We have to perform a degree of marketing to ensure that we don’t get continually glossed over by the recruiters every time we switch contracts. Their searches (and those of the hiring organisation’s HR department) are often (at least initially) driven by simple keyword matches. These days the client can easily look you up too on the internet and, if you have a presence, it really helps if you fit the pigeon-hole they’re looking to fill. This is invariably described by the main programming language you have the most recent experience of [2].

This all becomes so much easier when you get older and have far more experience under your belt. When the mechanics of programming start falling into the realms of Unconscious Competence it’s a lot easier to focus on the problem you actually need to solve. Hence it’s easier for a client to take a punt on someone less well versed in one particular toolchain if they have have other experiences they can readily apply. But first you have to get passed the HR wall to even be considered for that, if you don’t have a way in via a recommendation.

More Than a Language

The need to define ourselves through the programming language with which we have the most recent experience seems a little simplistic, after all it’s just a language, right? Not really. A programming language probably always has, and still is, linked heavily with the entire toolchain to write, build, test and deploy the software. Compare the Java, Eclipse, Maven, JUnit toolset to its C#, Visual Studio, NUnit counterpart. The front-end world of HTML, CSS, JavaScript, Node, is another. An average programmer just doesn’t move from one world to another without taking some time to pick things up, and a large part of what needs picking up are the libraries that support each culture. As the ports of xUnit to CppUnit and JUnit to NUnit show, getting-by in a language is not the same as writing code idiomatically. That said the problem is probably eased somewhat these days as languages cluster around a “VM”, for example the JVM has Java, Scala, Groovy, etc. whereas .Net has C#, F#, PowerShell, etc. which reduces the impedance mismatches somewhat.

Conclusion

Hence I guess the outcome of this thought exercise is that I see the world through the eyes of the recruitment process. My life as a run-of-the-mill freelance programmer means that I generally describe myself based on the major traits that I expect clients will look for. Although I’ve dabbled in the likes of Python, Ruby, D, F#, Go, etc. over the years, I would never expect to be hired (as a contractor) to work on a production codebase using one of these languages as my skills are too weak [3]. Learning them has brought the benefits of a multi-paradigm education (other ways of seeing the world and solving problems) which ultimately makes me more marketable as a programmer. But this is still within the confines of my most defining skills – the ones that help me put (nice) food on the table.

Maybe though this is exactly what he meant by tribalism.

[1] I started out working on PC shrink-wrapped graphics applications, but that was a long time ago. More recently since moving away from finance I now have something more tangible once again to point the kids too (although it’s still far from being “rock-and-roll”).

[2] Interestingly my own GitHub repo has over a decade of C++ code in it but nothing for C# and .Net which has been my primary professional language for the last 5 years.

[3] Even so they can often be found in the tooling used within the build pipeline, system administration, analysis, etc. on the projects I find myself involved in.

Tuesday, 22 December 2015

The Cost of Not Designing the Database Schema

The tale I wrote about in “Single Points of Failure - The SAN” didn’t entirely conclude at the point the issue was identified and apparently resolved. Whilst the vast majority of problems disappeared there was still a spike every now and then that caused the simple web service we wrote to take hundreds of milliseconds to respond, way more than a gen 2 garbage collection would take. We also logged when garbage collections occurred and they were never in sight when this glitch showed up.

After taking some time off I ended up joining the team who were responsible for calling that tactical web service and so I became privy to the goings-on upstream. It turned out the remaining blips were often occurring when an early morning batch process was run. It made little sense at the time that it could affect an entirely unrelated service, but with what I now knew about the SAN I felt the evidence pointed to a smoking gun. But how to truly explain it?

More Performance Woes

One of the changes being made when I joined this team was increased visibility (for the team) about how the services they owned were behaving in production. One service in particular was beginning to show signs of trouble and with the Christmas period looming it was felt something needed to be done about it pronto.

Interestingly the investigation of timeouts caused me to start correlating data with the other service we had had problems with earlier. On one particular day this daily batch process was delayed by a couple of hours and on that very same day the unexplained timeouts in the downstream service shifted too. Whilst correlation does not imply causality, the smoke from the gun was thickening. But it still didn’t make sense how the problem was “jumping the cracks”.

The investigation for my current team’s service turned to the Oracle database and it unearthed some stats that showed the database was making quite a few reads to satisfy the most common query type – retrieving the transactions for an account.

The Mists Begin to Clear

I started to apply the “5 Whys” technique to see if I could piece together a coherent picture that would address the immediate concern, but might also encompass the other one too. The question I started with was this:

“Why are the upstream service HTTP requests timing out?”
  1. Because they are waiting for a database connection. Why?
  2. Because each query is taking much longer. Why?
  3. Because the database is constantly hitting the SAN. Why?
  4. Because the database has to read so many pages. Why?
  5. Because the table being queried is badly organised.
Switching to the problem of unexplained timeouts in the other service for a moment it all started to make sense. This batch process that runs in the early morning generates a huge amount of “non-cacheable” reads (essentially a table scan) which is saturating the SAN and therefore causing the similar SAN related problems to what we had before.

Sadly my hypothesis was never acknowledged or discussed outside the team as they had stopped asking questions when they realised the database query was taking too long. However within the team it was accepted as highly plausible so I felt comfortable that at least we had some closure, and more importantly a theory to consider if things showed up again.

The temporary solution to the database problem was to stick a whole load more RAM in it to vastly improve caching and therefore reduce query times enough during the day to avoid the bottlenecks for now.

I posited that this change would also fix (or at least heavily reduce) the problems of unknown timeouts in the other service because Oracle would need to perform far less physical reads, and therefore the load on the SAN would also be reduced. This is exactly what I observed, so the gun was smoking even more now.

Addressing the Root Cause

Fundamentally the problem was down to the database having to do way more I/O work than should be necessary to satisfy the query. The table in question is essentially a set of transactions for an account which are being queried by the account’s ID.

The table was implemented as a simple heap with an index for the account ID. Whilst this meant that the transactions for an account could be found by the index, due to the heap structure the transactions were spread right across the table’s entire set of pages. Essentially the database did a few reads of the table index to find the rows in question and then (pathologically speaking) did one read per-row to get the data itself. Hence, for accounts with many transactions that was a huge number of random I/O’s.

I wasn’t there when the table was designed and so I have no knowledge about what the rationale was. Maybe it was just “the simplest thing that would possibly work” and they thought they’d have time to address scalability later? Or maybe they expected a different read / write pattern? Either way it’s not the structure I would have expected out-of-the-box for this kind of table.

Given that the table stores data for an account, and the key for that account is the primary means of lookup, we should be looking to keep all the data for an account close together. Hence using a table physically structured around the account ID (a “clustered index” on SQL Server and “index-organised table” on Oracle) will provide fast access and excellent locality of reference because all the pages for each account will be stored together. This way the database only has to navigate the index to the start of the specific account’s data and then do a few sequential page reads to get the rest.

No Time to Fix It

The problem with modern businesses is that they run 24x7 these days and so there is no time for downtime and maintenance. So whilst a differently organised table may well now be the best approach, the cost of implementing that change may be too high. Due to the current volume of data, taking the database offline and rebuilding it was not considered possible given the current state of the business and market.

Instead the DBAs decided to add a covering index that could be built online which included all the data so the query optimiser could satisfy the main query solely from the index. Essentially they created the clustered table via an index. Of course every write now had to update the table, original index and the new one. It should have been possible at that point to drop the original index, but I’m not sure if that happened as they’d also have to prove it wasn’t being used by another query.

Back to the SAN

In the meantime I was asked to investigate some other unexplained timeouts that occurred well outside the morning batch processing window. Knowing what we did now about the database and the SAN someone questioned whether the DBAs were already implementing this new index in production?

They weren’t but they were testing the approach in the QA environment. The correlation again was very strong and so someone investigated what the topology was for the databases in the QA environment and they discovered that some of the storage pools shared a portion of the SAN with production which was clearly unintentional. Oops.

Early Warning Indicators

Hindsight is a wonderful thing and it’s good that they were gaining visibility of their service’s behaviour, but that was only able to identify immediate glitches. There also needs to be some element of trend analysis to spot when things are beginning to head south.

For me the stance on instrumentation is that you measure everything you can afford to. Any lengthy computation or external I/O (i.e. anything that could block) should be recorded so that you can get a handle on what operations are behaving strangely now, and how they are changing over time as the service ages and adapts to new loads. It’s pretty easy to add too (see “Simple Instrumentation”).

Without some form of trend analysis you become like a slow-boiled frog that isn’t noticing how the surroundings are changing. All of a sudden what once took milliseconds now takes tens of milliseconds but you haven’t noticed it creep up. Everything appears to be normal right up to the point that performance drops off the cliff and you’re fire-fighting to bring it back under control.


You also cannot just monitor everything and expect to make sense of it all when a crisis hits. The data by itself is no use if you don’t understand how it relates to the moving parts of the system – you need to know why certain things change together, or not. From this you can build a heartbeat so that you really know how the system is evolving over time.