Thursday 27 October 2016

PUT vs POST and Idempotency

In RESTful APIs there is often a question mark around the use of PUT versus POST when creating and updating resources. The canonical example is the blogging engine where we wish to add new posts and comments. The default choice appears to be POST which I guess is because we tend to shy away from that contentious discussion about whether PUT or POST is more suitable.

As I’ve understood it, when the client can determine the address of the resource you can use PUT whereas when only server knows (i.e. generates it) then a POST is more appropriate. Hence you could say that the question boils down to whether or not the client can create a unique ID for the resource or has to leave that to the server.

POST

If we look at the blogging engine it’s probably easier on the client if the server just takes care of it:

POST /blog/create
{
  title: “My New Blog Post”
}

Here the server returns the URL for the freshly minted blog post. An obvious choice would probably be to generate a GUID for it and use that as the permanent ID:

GET /blog/1234-1234-1234-1234
{
  title: “My New Blog Post”
}

Idempotency

The problem with this approach is that we now have a problem if the request fails. We don’t know if the blog post was created because we don’t know its ID yet. If we retry the request, unless the server has a cache of the request, we’ll end up with a duplicate if the server completed on its side. A common solution is to include a client generated ID for the request that the server can use to detect when a request is being replayed:

POST /blog/create
{
  requestId: “9876-9876-9876-9876”
  title: “my new blog post”
}

But wait, haven’t we just solved the client generated ID problem? If we need to come up with a unique ID for the request for the purposes of idempotency, why don’t we just PUT the resource with that ID in the first place?

PUT /blog/9876-9876-9876-9876
{
  title: “My New Blog Post”
}

Natural Keys

When the client is just a single machine and it’s creating the resource itself from scratch it has far more latitude in choosing the resource’s ID, but if you’re transforming data in a distributed system it can get a little more tricky.

If your event comes in the form of an upstream message you cannot just use a GUID because when a message gets replayed (which it will, eventually) you’ll end up generating a duplicate as the request IDs will be different. Hence you need to look for something in the upstream message that can be used as a more natural key for the resource.

Going back to our blogging engine example we already had one in the guise of the blog post’s title:

PUT /blog/my-new-blog-post
{
  title: “My New Blog Post”
}

Yes, if the title changes then the ID will no longer match, but that was only a convenience anyway. Hopefully there is enough data in the request itself to make an ID clash extremely unlikely, e.g.

PUT /blog/2016-10-01-chris-oldwood-my-new-blog-post
{
  author: “Chris Oldwood”
  title: “My New Blog Post”
  created: “2016-10-01T08:34:00”
}

Mutable Resources

If the resource is immutable then you only need to guard against duplicate attempts to create it, but if it’s mutable then you may already be considering using PUT to mutate it [1]. In this case you’ll probably already have come up with a versioning scheme that will allow you to detect concurrency conflicts from multiple sources, e.g. ETag. In this scenario the request will not contain an “original version” tag to compare against so it’s a conflict just the same as when one occurs when updating later.

Degrees of Idempotency

It might sound like an oxymoron to say that there are varying levels of idempotency, but there can be. Essentially we’re making the usual time/space trade-off as we balance the need to verify the details of a potential duplicate or erroneous request against the need to persist more state to allow the comparison to be made at an indeterminate point in the future. Also depending on whether you are doing a POST or a PUT means there are assumptions about what a duplicate can be.

For example at the weakest end of the spectrum if you’re doing a POST with a simple unique request ID you might just say that if you’ve seen it before, irrespective of the content, then it’s a duplicate request. Also if you are happy to miss a replay, say, one month later, then you can even put a time-to-live (TTL) on the unique request ID to save space. Either way you’re treating idempotency very much as a temporal anomaly, which essentially is what it usually is, a fast or slow retry [2].

If you’re looking for a little extra piece of mind you might decide to generate a hash of the salient headers and content too which would allow you to detect if your request ID might have been reused with different content, either through programmer naivety or an unexpected change in the system. This is not a duplicate but a bad request. Sadly textual formats like XML and JSON have many equally valid representations [3] so a straight text hash is imprecise, but that may well be the pragmatic choice.

At the other end of the spectrum you might not be happy to silently discard false positives and so you need to persist enough about the request to be sure it really is a duplicate. This likely involves doing a proper comparison of the attributes in the request, which implies that you must still have them. If you save each change as an event rather than just mutating the state to provide an updated snapshot, then you’ll have what you need, but the extra storage is going to cost you. Of course there are other benefits to storing event streams but that’s a different story.

Just Pedantry?

Hence one advantage with using a PUT over a POST is that you push the responsibility onto the client to define what it means to be talking about the same resource. This is done by elevating the traditional unique request ID (or some other data) to become the permanent resource ID. Consequently idempotency starts to become implicit in the design rather than appearing more like an afterthought.

In scenarios where requests come from a single source this is largely an academic exercise but when you have the potential for the same request to come via multiple sources (or be replayed at different times), then I think it’s useful to try and formalise the way we address resources to push the issue to the forefront.

 

[1] For partial updates you might choose to use the less well known PATCH verb.

[2] See “When Does a Transient Failure Stop Being Transient”. I’ve recently heard the term “effectively once” to describe this notion of compensating for the inability to actually guarantee “once only” delivery.

[3] My current system receives XML messages in both compact and pretty printed versions. Why, who knows? Perhaps one of the upstream system nodes has a debugging flag left switched on?

No comments:

Post a Comment