Friday 20 October 2017

Good Stories Assure the Architecture

One of the problems a team can run into when they adopt a more agile way of working is they struggle to frame their backlog in the terms of user focused stories. This is a problem I’ve written about before in “Turning Technical Tasks Into User Stories” which looked at the problem for smaller units of work. Even if the team can buy into that premise for the more run-of-the-mill features it can still be a struggle to see how that works for the big ticket items like the system’s architecture.

The Awkward Silence

What I’ve experienced is that the team can start to regress when faced with discussions around what kind of architecture to aim for. With a backlog chock full of customer pleasing functionality the architectural conversations might begin to take a bit of a back seat as the focus is on fleshing out the walking skeleton with features. Naturally the nervousness starts to set in as the engineers begin to wonder when the architecture is going to get the attention it rightly deserves. It’s all very well supporting a handful of “friendly” users but what about when you have real customers who’ve entrusted you with their data and they want to make use of it without a moments notice at any hour of the day?

The temptation, which should be resisted, can be to see architectural work as outside the scope of the core backlog – creating a separate backlog for stuff “the business does not understand”. This way can lead to a split in the backlog, and potentially even two separate backlogs – a functional and a non-functional one. This just makes prioritisation impossible. Also burying the work kills transparency, eventually erodes trust, and still doesn’t get you the answers you really need.

Instead, the urge should be to frame the architectural concerns in terms the stakeholder does understand, so that the business can be more informed about their actual benefits. In addition, when “The Architecture” is a journey and not a single destination there is no longer one set of benefits to aim for there are multiple trade-offs as the architecture evolves over time, changing at each step to satisfy the ongoing needs of the customer(s) along the way. There is in essence no “final solution” there is only “what we need for the foreseeable future”.

Tell Me a Story

So, what do I mean by “good stories”? Well, the traditional way this goes is for an analyst to solicit some non-functional requirements for some speculative eventual system behaviour. If we’re really lucky it might end up in the right ballpark at one particular point in the future. What’s missing from this scene is a proper conversation, a proper story – one with a beginning, a middle, and an end – where we are today, the short term and the longer term vision.

But not only do we need to get a feel for their aspirations we also need quantifiable metrics about how the system needs to perform. Vague statements like “fast enough” are just not helpful. A globally accessible system with an anticipated latency in the tens of milliseconds will need to break the law of physics unless we trade-off something else. We also need to know how those exceptional events like Cyber Monday are to be factored into the operation side.

It’s not just about performance either. In many cases end users care that their data is secure, both in-flight (over the network) and at rest, although they likely have no idea what this actually means in practice. Patching servers is a technical task, but the bigger story is about how the team responds to a vulnerability which may make patching irrelevant. Similarly database backups are not the issue it’s about service availability – you cannot be highly available if the loss of an entire data centre potentially means waiting for a database to be restored from scratch elsewhere.

Most of the traditional conversations around non-functional requirements focus entirely on the happy path, for me the conversation doesn’t really get going until you start talking about what needs to happen when the system is down. It’s never a case of “if”, but “when” it fails and therefore mitigating these problems features heavily in our architectural choices. It’s an uncomfortable conversation as we never like discussing failure but that’s what having “grown up” conversations mean.

Incremental Architecture

Although I’ve used the term “story” in this post’s title, many of the issues that need discussing are really in the realm of “epics”. However we shouldn’t get bogged down in the terminology, instead the essence is to remember to focus on the outcome from the user’s perspective. Ask yourselves how fast, how secure, how available, etc. it needs to be now, and how those needs might change in response to the system’s, and the business’s growth.

With a clearer picture of the potential risks and opportunities we are better placed to design and build in small increments such that the architecture can be allowed to emerge at a sustainable rate.

Friday 13 October 2017

The User-Agent is not Just for Browsers

One of the trickiest problems when you’re building a web service is knowing who your clients are. I don’t mean your customers, that’s a much harder problem, no, I literally mean you don’t know what client software is talking to you.

Although it shouldn’t really matter who your consumers are from a technical perspective, once your service starts to field requests and you’re working out what and how to monitor it, knowing this becomes far more useful.

Proactive Monitoring

For example the last API I worked on we were generating 404’s for a regular stream of requests because the consumer had a bug in their URL formatting and erroneously appended an extra space for one of the segments. We could see this at our end but didn’t know who to tell. We had to spam our “API Consumers” Slack channel in the hope the right person would notice [1].

We also had consumers sending us the wrong kind of authorisation token, which again we could see but didn’t know which team to contact. Although having a Slack channel for the API helped, we found that people only paid attention to it when they noticed a problem. It also appeared, from our end, that devs would prefer to fumble around rather than pair with us on getting their client end working quickly and reliably.

Client Detection

Absent any other information a cloud hosted service pretty much only has the client IP to go on. If you’re behind a load balancer then you’re looking at the X-Forwarded-For header instead which might give you a clue. Of course if many of your consumers are also services running in the cloud or behind the on-premise firewall they all look pretty much the same.

Hence as part of our API documentation we strongly encouraged consumers to supply a User-Agent field with their service name, purpose, and version, e.g. MyMobileApp:Test/1.0.56. This meant that we would now have a better chance of talking to the right people when we spotted them doing something odd.

From a monitoring perspective we can then use the User-Agent in various ways to slice-and-dice our traffic. For example we can now successfully attribute load to various consumers. We can also filter out certain behaviours from triggering alerts when we know, for example, that it’s their contract tests passing bad data on purpose.

By providing us with a version number we can also see when they release a new version and help them ensure they’ve deprecated old versions. Whilst you would expect service owners to know exactly what they’ve got running where, you’d be surprised how many don’t know they have old instances lying around. It also helps identify who the laggards are that are holding up removal of your legacy features.

Causality

A somewhat related idea is the use of “trace” or “correlation” IDs, which is something I’ve covered before in “Causality - A Mechanism for Relating Distributed Diagnostic Contexts”. These are unique IDs for diagnosing problems with requests and it’s useful to include a prefix for the originating system. However that system may not be your actual client if there are various other services between you and them. Hence the causality ID covers the end-to-end where the User-Agent can cover the local client-server hop.

You would think that the benefit of passing it was fairly clear – it allows providers to proactively help consumers fix their problems. And yet like so many non-functional requirements it sits lower down their backlog because it’s only optional [2]. Not only that but by masking themselves it actually hampers delivery of new features because you’re working harder than necessary to keep the existing lights on.

 

[1] Ironically the requests were for some automated tests which they didn’t realise were failing!

[2] We wanted to make the User-Agent header mandatory on all non-production environments [3] to try and convince our consumers of the benefits but it didn’t sit well with the upper echelons.

[3] The idea being that its use in production then becomes automatic but does not exclude easy use of diagnostic tools like CURL for production issues.

Thursday 12 October 2017

Don’t Hide the Solution Structure

Whenever you join an existing team and start work on their codebase you need to orientate yourself so that you have a feel for the system’s architecture and design. If you’re lucky there is some documentation, perhaps nice diagrams to give you an overview. Hopefully you also have an extensive suite of tests to tell you how the system behaves.

More than likely there is nothing or very little to go on, and if it’s a truly legacy system any documentation could well be way out of date. At this point you pretty much only have the source code to work from. Whilst this is the source of truth, the amount of code you need to read to become au fait with all the various high-level concepts depends in part on how well it’s laid out.

Static Structure

Irrespective of whether you like to think of your layers in terms of onions or brick walls, all code essentially gets organised on disk and that means the solution structure is hierarchical in nature. In the most popular languages that support namespaces, these are also hierarchical and are commonly laid out on disk to reflect the same hierarchy [1].

Although the compiler is happy to just hoover up source code from the entire solution and largely ignore the relative position of the callers and callees there are useful conventions, which if honoured, allow you to reason and refactor the code more easily due to lower coupling. For example, defining an interface in the same source file as a class that implements it suggests a different inheritance use than when the interface sits externally further up the hierarchy. Also, seeing code higher up the hierarchy referencing types deeper down in an unrelated branch is another smell, of an abstraction potentially depending on an implementation detail.

Navigating the Structure

One of the things I’ve noticed in recent years whilst pairing is that many developers appear to navigate the source code solely through their IDE, and within the IDE by using features like “go to definition (implementation)”. Some very rarely see the solution structure because they hide it to gain more screen real estate for the source file of current interest [2].

Hence the only time the solution structure is visible is when there is a need to add a new source file. My purely anecdotal evidence suggests that this will be added without a great deal of thought as the code can be easy located in future directly by the author through its class name or another reference; they never have to consider where it “logically” resides.

Sprawling Suburbs

The net result is that namespaces and packages suffer from urban sprawl as they slowly accrete more and more code. This newer code adds more dependencies and so the package as a whole acquires an ever increasing number of dependencies. Left unchecked this can lead to horrible cyclic dependencies that are a nightmare to resolve.

I recently had the opportunity to revisit the codebase for a greenfield system I had started a few years before. We initially partitioned the code into a few key assemblies to get ourselves going and so I was somewhat surprised to still see the same assemblies a few years later, albeit massively overgrown with extra responsibilities. As a consequence even their simple home-grown tools had bizarre dependencies dragged in through bloated shared libraries [3].

Take a Stroll

So in future, instead of taking the Underground (subway) through your codebase every day, stop, and take a stroll every now-and-then around the paths. The same rules about cohesion within the methods of a class also apply at the higher levels of design – classes in a namespace, namespaces in an assembly, assemblies in a solution, etc. Then you’ll find that as the system grows it’s easier to refactor at the package level [3].

(For more on this topic see my older post “Who’s Maintaining the 100 Foot View?”.)

 

[1] Annoyingly this is not a common practice in the C++ codebases I’ve worked on.

[2] If I was being flippant I might suggest that if you really need the space the code may be too complicated, as I once did on Twitter here.

[3] I once dragged in a project’s shared library for a few useful extension methods to use in a simple console app and found I had pulled in an IoC container and almost a dozen other NuGet dependencies!

[4] In C# the internal access modifier has zero effect if you stick all your code into one assembly.

Wednesday 11 October 2017

Every Commit Needs the Rationale to Support It

Each and every change to a codebase should be performed for a very specific reason – we shouldn’t just change some code because we feel like it. If you follow a checklist (mental or otherwise), such as the one I described in “Commit Checklist”, then each commit should be as cohesive as possible with any unintentional edits reverted to spare our blushes.

However, whilst the code can say what behaviour has changed, we also need to say why it was changed. The old adage “use the source Luke” is great for reminding us that the only source of truth is the code itself, but changes made without any supporting documentation makes software archaeology [1] incredibly difficult in the future.

The Commit Log

Take the following one line change to the JSON serialization settings used when persisting to a database:

DateTimeZoneHandling = DateTimeZoneHandling.Utc;

This single-line edit appeared in a commit all by itself. Now, any change which has the potential to affect the storage or retrieval of the system’s data is something which should not be entered into lightly. Even if the change was done to make what is currently a default setting explicit, this fact still needs to be recorded – the rationale is important.

The first port of call for any documentation around a change is probably the commit message. Given that it lives with the code and is (usually) immutable it stands the best chance of remaining intact over time. In the example above the commit message was simply:

“Bug Fix: added date time zone handling to UTC for database json serialization”

In the same way that poor code comments have a habit of simply stating what the code does, the same malaise can affect commit messages by merely restating what was changed. Our example largely suffers from this, but it also teases us by additionally mentioning that it was done to fix a bug. Suddenly we have so many more unanswered questions about the change.

Code Change Comments

In the dim and distant past it was not unusual to use code comments to annotate changes as well as to describe the behaviour of the code. Before the advent of version control features like “blame” (aka annotate) it was non-trivial to track down the commit where any particular line of code changed. As such it seemed easier to embed the change details in the code itself rather than the VCS tool, especially if the supporting documentation lived in another system; you could just use the Change Request ID as the comment.

As you can imagine this sorta worked okay at first but as the code continued to change and refactoring became more popular these comments became as distracting and pointless as the more traditional kind. It also did nothing to help reduce the overheard of tracking the how-and-why in different places.

Feature Trackers

The situation originally used to be worse than this as new features might be tracked in one place by the business whilst bugs were tracked elsewhere by the development team. This meant that the “why” could be distributed right across time and space without the necessary links to tie them all together.

The desire to track all work in one place in an Enterprise tool like JIRA has at least reduced the number of places you need to look for “the bigger picture”, assuming you use the tool for more than just recording estimates and time spent, but of course there are lightweight alternatives [2]. Hence recording the JIRA number or Trello card number in the commit message is probably the most common approach to linking these two sides of the change.

As an aside, one of the reasons many teams haven’t historically put all their documentation in their source code repo is because it’s often been inaccessible to non-developer colleagues, either due to lack of permissions or technical ability. Fortunately tools like GitHub have started to bridge this divide.

Executable Specifications

One of the oldest problems in software development has been keeping the supporting documentation and code in sync. As features evolve it becomes harder and harder to know what the canonical reason for any change is because the current behaviour may be the sum of all previous related requirements.

An ever-growing technique for combating this has been to express the documentation, i.e. the requirements, in code too, in the form of tests. At a high level these are acceptance tests, with more technical behaviours expressed as unit or integration tests.

This brings me back to my earlier example. It’s incredibly rare that any code change would be committed without some kind of corresponding change to the automated tests. In this instance the bug must have manifested itself in the persistence layer and I’d expect at least one new test to be added (or an existing one fixed) to illustrate what the bug is. Hence the rationale for the change is to fix a bug, and the rationale can largely be described through the use of one or more well written tests rather than in prose.

Exceptions

There are of course no absolutes in life and fixing a spelling mistake should not require pages of notes, although spelling incorrectly on purpose probably does [3].

The point is that there is a balance to be struck if we are to trade-off the short and long term maintenance of the system. It might be tempting to rely on tribal knowledge or the product owner’s notes to avoid thinking about how the rationale is best expressed, but finding a way to encode that information in executable form, such as through tests, provides both the present reviewer and the future software archaeologist with the most usable representation.

 

[1] See my “Software Archaeology” article for more about spelunking a codebase’s history.

[2] I’ve written about the various tools I’ve used in the past in  “Feature Tracking”.

[3] The HTTP “referer” header being a notable exception, See Wikipedia.