Thursday, 27 October 2016

Unmatched REST Resources – 400, 404 or 405?

There is always a tension in programming between creating something that is hard to misuse but at the same time adheres to standards to try and leverage the Principle of Least Surprise. One area I personally struggle with this conflict is how to communicate to a client (of the software kind) that they have made a request for something which doesn’t currently exist, and almost certainly will never exist.

As a general rule when someone requests a resource that doesn’t exist then you should return a 404 (Not Found). And this makes perfect sense when we’re in production and all the bugs have been ironed but during development when we’re still exploring the API it’s all too easy to make a silly mistake and not realise that it’s due to a bug in our code.

An Easy Mistake

Imagine you’re looking up all orders for a customer, you might design your API something like this:

GET /orders/customer/12345

For a starter you have the whole singular noun vs plural debate which means you’ll almost definitely try this by accident:

GET /order/customer/12345

or make the inverse mistake

GET /orders/customers/12345

By the standard HTTP rules you should return a 404 as the resource does not exist at that address. But does it actually help your fellow developers to stick to the letter of the law?

Frameworks

What makes this whole issue much thornier is that if you decide you want to do the right thing by your fellow programmers you will likely have to fight any web framework you’re using because they usually take the moral high ground and do what the standard says.

What then ensues is a fight between the developer and framework as they try their hardest to coerce the framework to send all unmatched routes through to a handler that can return their preferred non-404 choice.

A colleague who is also up for the good fight recently tried to convince the Nancy .Net framework to match the equivalent of “/.*” (the lowest weighted expression) only to find they had to define one route for each possible list of segments, i.e. “/.*”, “/.*/.*”, “/.*/.*/.*”, etc. [1].

Even then he still got some inconsistent behaviour. Frameworks also make it really easy to route based on value types which gives you a form of validation. For example if I know my customer ID is always an integer I could express my route like this:

/orders/customer/{integer}

That’s great for me but when someone using my API accidentally formats a URL wrong and puts the wrong type of value for the ID, say the customer’s name, they get a 404 because no route matches a non-integer ID. I think this is a validation error and should probably be a 400 (Bad Request) as it’s a client programmer bug, but the framework has caused it to surface in a way that’s no different to a completely invalid route.

Choice of Status Code

So, assuming we want to return something other than Not Found for what is clearly a mistake on the client’s part, what are our choices?

In the debates I’ve seen on this 400 (Bad Request) seems like a popular choice as the request, while perhaps not technically malformed, is often synonymous with “client screwed up”. I also like Phil Parker’s suggestion of using 405 (Method Not Allowed) because it feels like less abuse of the 4XX status codes and is also perhaps not as common as a 400 so shows up a bit more.

 

[1] According to this StackOverflow post it used to be possible, maybe our Google fu was letting us down.

PUT vs POST and Idempotency

In RESTful APIs there is often a question mark around the use of PUT versus POST when creating and updating resources. The canonical example is the blogging engine where we wish to add new posts and comments. The default choice appears to be POST which I guess is because we tend to shy away from that contentious discussion about whether PUT or POST is more suitable.

As I’ve understood it, when the client can determine the address of the resource you can use PUT whereas when only server knows (i.e. generates it) then a POST is more appropriate. Hence you could say that the question boils down to whether or not the client can create a unique ID for the resource or has to leave that to the server.

POST

If we look at the blogging engine it’s probably easier on the client if the server just takes care of it:

POST /blog/create
{
  title: “My New Blog Post”
}

Here the server returns the URL for the freshly minted blog post. An obvious choice would probably be to generate a GUID for it and use that as the permanent ID:

GET /blog/1234-1234-1234-1234
{
  title: “My New Blog Post”
}

Idempotency

The problem with this approach is that we now have a problem if the request fails. We don’t know if the blog post was created because we don’t know its ID yet. If we retry the request, unless the server has a cache of the request, we’ll end up with a duplicate if the server completed on its side. A common solution is to include a client generated ID for the request that the server can use to detect when a request is being replayed:

POST /blog/create
{
  requestId: “9876-9876-9876-9876”
  title: “my new blog post”
}

But wait, haven’t we just solved the client generated ID problem? If we need to come up with a unique ID for the request for the purposes of idempotency, why don’t we just PUT the resource with that ID in the first place?

PUT /blog/9876-9876-9876-9876
{
  title: “My New Blog Post”
}

Natural Keys

When the client is just a single machine and it’s creating the resource itself from scratch it has far more latitude in choosing the resource’s ID, but if you’re transforming data in a distributed system it can get a little more tricky.

If your event comes in the form of an upstream message you cannot just use a GUID because when a message gets replayed (which it will, eventually) you’ll end up generating a duplicate as the request IDs will be different. Hence you need to look for something in the upstream message that can be used as a more natural key for the resource.

Going back to our blogging engine example we already had one in the guise of the blog post’s title:

PUT /blog/my-new-blog-post
{
  title: “My New Blog Post”
}

Yes, if the title changes then the ID will no longer match, but that was only a convenience anyway. Hopefully there is enough data in the request itself to make an ID clash extremely unlikely, e.g.

PUT /blog/2016-10-01-chris-oldwood-my-new-blog-post
{
  author: “Chris Oldwood”
  title: “My New Blog Post”
  created: “2016-10-01T08:34:00”
}

Mutable Resources

If the resource is immutable then you only need to guard against duplicate attempts to create it, but if it’s mutable then you may already be considering using PUT to mutate it [1]. In this case you’ll probably already have come up with a versioning scheme that will allow you to detect concurrency conflicts from multiple sources, e.g. ETag. In this scenario the request will not contain an “original version” tag to compare against so it’s a conflict just the same as when one occurs when updating later.

Degrees of Idempotency

It might sound like an oxymoron to say that there are varying levels of idempotency, but there can be. Essentially we’re making the usual time/space trade-off as we balance the need to verify the details of a potential duplicate or erroneous request against the need to persist more state to allow the comparison to be made at an indeterminate point in the future. Also depending on whether you are doing a POST or a PUT means there are assumptions about what a duplicate can be.

For example at the weakest end of the spectrum if you’re doing a POST with a simple unique request ID you might just say that if you’ve seen it before, irrespective of the content, then it’s a duplicate request. Also if you are happy to miss a replay, say, one month later, then you can even put a time-to-live (TTL) on the unique request ID to save space. Either way you’re treating idempotency very much as a temporal anomaly, which essentially is what it usually is, a fast or slow retry [2].

If you’re looking for a little extra piece of mind you might decide to generate a hash of the salient headers and content too which would allow you to detect if your request ID might have been reused with different content, either through programmer naivety or an unexpected change in the system. This is not a duplicate but a bad request. Sadly textual formats like XML and JSON have many equally valid representations [3] so a straight text hash is imprecise, but that may well be the pragmatic choice.

At the other end of the spectrum you might not be happy to silently discard false positives and so you need to persist enough about the request to be sure it really is a duplicate. This likely involves doing a proper comparison of the attributes in the request, which implies that you must still have them. If you save each change as an event rather than just mutating the state to provide an updated snapshot, then you’ll have what you need, but the extra storage is going to cost you. Of course there are other benefits to storing event streams but that’s a different story.

Just Pedantry?

Hence one advantage with using a PUT over a POST is that you push the responsibility onto the client to define what it means to be talking about the same resource. This is done by elevating the traditional unique request ID (or some other data) to become the permanent resource ID. Consequently idempotency starts to become implicit in the design rather than appearing more like an afterthought.

In scenarios where requests come from a single source this is largely an academic exercise but when you have the potential for the same request to come via multiple sources (or be replayed at different times), then I think it’s useful to try and formalise the way we address resources to push the issue to the forefront.

 

[1] For partial updates you might choose to use the less well known PATCH verb.

[2] See “When Does a Transient Failure Stop Being Transient”. I’ve recently heard the term “effectively once” to describe this notion of compensating for the inability to actually guarantee “once only” delivery.

[3] My current system receives XML messages in both compact and pretty printed versions. Why, who knows? Perhaps one of the upstream system nodes has a debugging flag left switched on?

Tuesday, 25 October 2016

Every Software System is Bespoke

Contrary to what some business folk, project managers and programmers may believe, every software system we build is essentially bespoke. They may well think this LOB application, REST API or web site is very similar to many others that have already been built but that is almost certainly because the feature set and user interface have been dreamt up by copying what has already gone before rather than thinking about what they actually need this time around.

It’s an easy trap to fall into, after all, isn’t the point of hiring experienced people because you want them to leverage all that knowledge they already have to quickly build your system? But what knowledge do they actually bring with them to each new product or project? Unless each one is only a couple of weeks long, which is almost non-existent in my experience [1], then there are way too many different variables to consider this new venture the same as any predecessor.

Okay, so they may be “similar”, but not the same. At least, not unless you and your organisation have absolutely zero desire to learn anything new about the process and tools used to create software. And that’s essentially my point – the industry moves so incredibly fast that something, probably many things, change between each project. But even then that assumes that no learning occurs during the project itself and, once again, unless it’s only a few weeks long that is also highly unlikely to happen.

In the last few years I’ve been mostly working in the enterprise arena on web services using the classic enterprise-grade offerings. The enterprise is generally renowned for its glacial pace and yet in that time the technology stack alone has moved along in leaps and bounds. For example that first API was in C# 4 / .Net 3.5 and now we’re looking at C# 6 / .Net Core with one eye looking at running on Linux machines too. Service hosting has changed from IIS / MVC to self-hosting / OWIN, with Nancy in there somewhere. The database too has switched from the relative safety of Oracle & SQL Server to MongoDB & Couchbase, and in one instance [2]has been a hybrid of the relational and document paradigms. Even the ancillary tooling like the VCS, CI product and testing & mocking frameworks have all changed as well, either to a different product or have received a non-trivial upgrade.

At the same time as the technology stack is evolving so too is the development process. The various organisations I’ve been working in recently have all undergone, or should I say are still undergoing, a major change to a more agile way of working. This in itself is not a one-time switch but a change in mind-set to one of continual learning and therefore by definition is subject to relentless change. Admittedly those changes become more gradual as the bigger problems are addressed but even so the way the teams around you change can still have an affect on your own ways of working – the DevOps message may have yet to reach the parts of the organisation you interact with.

Even if the toolchain and process largely stay the same the way we apply the technology changes too as what was once “best practice” gets replaced by a new “best practice”, thereby making somewhat of a mockery of the whole notion. Maybe once before we were happy to return null references but now we wish to use an Optional type instead, or we realise the inappropriate nature of the Singleton pattern in highly testable codebase.

With all the change going on around us you might rightly question what being “experienced” actually means in this industry, if we apparently can’t carry over much from one project to the next. Clearly this is a little extreme though as there is plenty we do carry over. In reality, although everything eventually does change, it does not all change at the same time. Hence at the beginning I said that no system is ever the same, it can be very similar, but will be far from identical.

As experienced programmers what we carry over are the battle scars. What this should lead us to do is ask those questions that the less experienced know not what to ask in the first place, and often only discover themselves the hard way. We should never assume that just because we did anything one particular way before that that was the only way, or will even still be the best way in this new setting.

It might be a good way to start out but we should always be looking for ways to improve it, or detect when the problem has diverged such that our first-order approximation is no longer fit for purpose. It’s all too easy to try and solve problems we’ve had in the past, the second time around, and then completely fail to notice we’re really solving a different problem, or at least one with enough differences that we should let go of the past instead. In the end you may be right and you actually converge on a similar design to before, congratulations, you were right this time; as long as you didn’t sacrifice delivering more important stuff just to satisfy your hunch.

By all means bring and share your experiences on your next venture, but be careful you do not get blindsided by them. Only solve today’s problems with yesterday’s solutions if that really is the best thing to do. You might be surprised just how much has changed in the world of programming since then.

 

[1] I’ve spent longer than that just trying to fix a single bug before!

[2] See “Deferring the Database Choice” which also highlights the design process changes too.

Friday, 21 October 2016

When Mocks Became Production Services

We were a brand new team of 5 (PM + devs) tasked with building a calculation engine. The team was just one part of a larger programme that encompassed over a dozen projects in total. The intention was for those other teams to build some of the services that ours would depend on.

Our development process was somewhat DSM-like in nature, i.e. iterative. We built a skeleton based around a command-line calculator and fleshed it out from there [1]. This skeleton naturally included vague interfaces for some of the services that we knew we’d need and that we believed would be fulfilled by some of the other teams.

Fleshing Out the Skeleton

Time marched on. Our calculator was now being parallelised and we were trying to build out the distributed nature of the system. Ideally we would like to have been integrating with the other teams long ago but the programme RAG status wasn’t good. Every other team apart from us was at “red” and therefore well behind schedule.

To compensate for the lack of collaboration and integration with the other services we needed we resorted to building our own naïve mocks. We found other sources of the same data and built some noddy services that used the file-system in a dumb way to store and serve it up. We also added some simple steps to the overnight batch process to create a snapshot of the day’s data using these sources.

Programme Cuts

In the meantime we discovered that one of the services we were to depend on had now been cancelled and some initial testing with another gave serious doubts about its ability to deliver what we needed. Of course time was marching on and our release date was approaching fast. It was fast dawning on us that these simple test mocks we’d built may well have to become our production services.

One blessing that came out of building the simple mocks so early on was that we now had quite a bit of experience on how they would behave in production. Hence we managed to shore things up a bit by adding some simple caches and removing some unnecessary memory copying and serialization. The one service left we still needed to invoke had found a more performant way for us to at least bulk extract a copy of the day’s data and so we retrofitted that into our batch preparation phase. (Ideally they’d serve it on demand but it just wasn’t there for the queries we needed.)

Release Day

The delivery date arrived. We were originally due to go live a week earlier but got pushed back by a week because an important data migration got bumped and so we were bumped too. Hence we would have delivered on time and, somewhat unusually, we were well under budget our PM said [2]. 

So the mocks we had initially built just to keep the project moving along were now part of the production codebase. The naïve underlying persistence mechanism was now a production data store that needed high-availability and backing up.

The Price

Whilst the benefits of what we did (not that there was any other real choice in the end) were great, because we delivered a working system on time, there were a few problems due to the simplicity of the design.

The first one was down to the fact that we stored each data object in its own file on the file-system and each day added over a hundred-thousand new files. Although we had partitioned the data to avoid the obvious 400K files-per-folder limit in NTFS we didn’t anticipate running out of inodes on the volume when it quickly migrated from a simple Windows server file share to a Unix style DFS. The calculation engine was also using the same share to persist checkpoint data and that added to the mess of small files. We limped along for some time through monitoring and zipping up old data [3].

The other problem we hit was that using the file-system directly meant that the implementation details became exposed. Naturally we had carefully set ACLs on the folders to ensure that only the environment had write access and our special support group had read access. However one day I noticed by accident that someone had granted read access to another group and it then transpired that they were building something on top of our naïve store.

Clearly we never intended this to happen and I’ve said more about this incident previously in “The File-System Is An Implementation Detail”. Suffice to say that an arms race then developed as we fought to remove access to everyone outside our team whilst others got wind of it [4]. I can’t remember whether it happened in the end or not but I had put a scheduled task together than would use CALCS to list the permissions and fail if there were any we didn’t expect.

I guess we were a victim of our success. If you were happy with data from the previous COB, which many of the batch systems were, you could easily get it from us because the layout was obvious.

Epilogue

I have no idea whether the original versions of these services are still running to this day but I wouldn’t be surprised if they are. There was a spike around looking into a NoSQL database to alleviate the inode problem, but I suspect the ease with which the data store could be directly queried and manipulated would have created too much inertia.

Am I glad we put what were essentially our mock services into production? Definitely. Given the choice between not delivering, delivering much later, and delivering on time with a less than perfect system that does what’s important – I’ll take the last one every time. In retrospect I wish we had delivered sooner and not waited for a load of other stuff we built as the MVP was probably far smaller.

The main thing I learned out of the experience was a reminder not to be afraid of doing the simplest thing that could work. If you get the architecture right each of the pieces can evolve to meet the ever changing requirements and data volumes [5].

What we did here fell under the traditional banner of Technical Debt – making a conscious decision to deliver a sub-optimal solution now so it can start delivering value sooner. It was the right call.

 

[1] Nowadays you’d probably look to include a slice through the build pipeline and deployment process up front too but we didn’t get any hardware until a couple of months in.

[2] We didn’t build half of what we set out to, e.g. the “dashboard” was a PowerShell generated HTML page and the work queue involved doing non-blocking polling on a database table.

[3] For regulatory reasons we needed to keep the exact inputs we had used and couldn’t guarantee on being able to retrieve them later from the various upstream sources.

[4] Why was permission granted without questioning anyone in the team that owned and supported it? I never did find out, but apparently it wasn’t the first time it had happened.

[5] Within reason of course. This system was unlikely to grow by more than an order of magnitude in the next few years.

Thursday, 20 October 2016

Confusion Over Waste

When looking at the performance of our software we often have to consider both first-order and second-order effects. For example when profiling a native application where memory management is handled explicitly we can directly see the cost of allocations and deallocations because this all happens at the moment we make them. In contrast the world of garbage collected languages like C# exhibit different behaviour. The cost of memory allocations here are minimal because the algorithm is simple. However the deallocation story is far more complex, and it happens at a non-deterministic time later.

A consequence of this different behaviour is that it is much harder to see the effects that localised memory churn is having on your application. For example I once worked on a C# data transformation tool where the performance was appalling. Profiling didn’t immediately reveal the problem but closer inspection showed that the garbage collector was running full tilt. Looking much closer at the hottest part of the code I realised it was spending all it’s time splitting strings and throwing them away. The memory allocations were cheap so there were no first-order effects, but the clean-up was really expensive and happened later and therefore appeared as a second-order effect which was harder to trace back.

Short Term Gains

We see the same kind of effects occurring during the development process too. They are often masked though by the mistaken belief that time is being saved, it is, but only in the short term. The problem is the second-order effects of such time saving is actually lost later, and when it’s more precious.

This occurs because the near term activity is being seen as wasteful of a certain person’s time, on the premise that the activity is of low value (to them). But what is being missed is the second-order effects of doing that, such as the learning about the context, people and product involved. When crunch time comes that missed learning suddenly has to happen at the later time when potentially under time pressure or after money has already been spent; then you’re heading into sunk costs territory.

In essence what is being perceived as waste is the time spent in the short term, when the real waste is time lost in the future due to rework caused by the missed opportunity to learn sooner.

All Hail “Agile”

Putting this into more concrete terms consider a software development team where the developer’s time is assumed to be best spent designing and writing code. The project manager assumes that having conversations, perhaps with ops or parts of the business is of low value, from the developer’s perspective, and therefore decides it’s better if someone “less expensive” has it instead.

Of course we’re all “agile” now and we don’t do that anymore. Or do we? I’ve worked in supposedly agile teams and this problem still manifests itself, maybe not quite to the same extent as before, but nonetheless it still happens and I believe it happens because we are confused about what the real waste is that we’re trying to avoid.

Even in teams I’ve been in where we’ve tried to ensure this kind of problem is addressed, it’s only addressed locally, it’s still happening further up the food chain. For example a separate architecture team might be given the role of doing a spike around a piece of technology that a development team will be using. This work needs to happen inside the team so that those who will be developing and, more importantly, supporting the product will get the most exposure to it. Yes, there needs to be some governance around it, but the best people to know if it even solves their problem in the first place is the development team.

Another manifestation of this is when two programme managers are fed highlights about potential changes on their side of the fence. If there is any conflict there could be a temptation to resolve it without going any lower. What this does is cut out the people that not only know most about the conflict, but are also the best placed to negotiate a way out. For example instead of trying to compensate for a potential breaking change with a temporary workaround, which pushes the product away from its eventual goal, see if the original change can be de-prioritised instead. If a system is built in very small increments it’s much easier to shuffle around the high priority items to accommodate what’s happening around the boundaries of the team.

Time for Reflection

How many times have you said, or heard someone else say, “if only you’d come to us earlier”. This happens because we try and cut people out of the loop in the hope that we’ll save time by resolving issues ourselves, but what we rarely do is reflect on whether we really did save time in the long run when the thread eventually started to unravel and the second-order effects kicked in.

Hence, don’t just assume you can cut people out of the loop because you think you’re helping them out, you might not be. They might want to be included because they have something to learn or contribute over-and-above the task at hand. Autonomy is about choice, they might not always want it, but if you don’t provide it in the first place it can never be leveraged.