Tuesday 24 January 2017

Weak and Strong Idempotency

In my recent post “PUT vs POST and Idempotency” I touched on the different degrees of idempotency. In retrospect I think it’s better to describe our idempotency guarantee as weak or strong. Either way this post looks at why you might choose one over the other.

Weak Idempotency

At one end of the spectrum we can choose to classify our unique operations with a single scalar value such as a large number, UUID or complex string value (with other embedded values like the date & time). In this scenario we assume that a simple value represents the identity of the operation in much the same way that a reference (or pointer) is the unique address of a value in languages like C# and Java (or C++).

In SQL terms your write with an idempotency check might look something like this:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

For a document oriented database like MongoDB where you might store everything in a single object to achieve atomic updates you could do an equivalent find-and-modify-where style query. If the data model requires multiple writes you might still be able to order them carefully to achieve the desired effect (See “Observable State versus Persisted State”).

Strong Idempotency

In contrast, at the other end, we define an idempotent operation based not only on some unique operation identifier but also the arguments to the operation itself, i.e. the request body in HTTP. In this instance we are not willing to accept that a replay can be distinguished solely by it’s handle but also wish to ensure the actual details match too, just like a “deep” value equality comparison.

In SQL terms we potentially now have two steps, one to attempt the write and, if a conflict occurs, a subsequent read to validate it:

insert into Operations values(...)
where not exists
(
  select 1
  from Operations
  where OperationId <> @operationId
)

select *
from Operations
where OperationId = @operationId

If replays are rare (and I’d expect them to be) you should find this adds only a little extra overhead but also allows you to report broken requests properly.

Identity and Equality

Although it might seem like these weak and strong guarantees are directly analogous to the concept of identity and equality in normal programming there is a subtle difference.

Whilst in the weak case we are happy for the unique operation identifier to act as our sole definition of equality, in the strong guarantee both the identifier and the input values [1] take part in the comparison. In a programming language we tend to choose either the reference based comparison or the value based approach but not both [2].

Failure Modes

In theory it should be adequate for us to take the simpler approach, and as the provider of a service it’s almost certainly slightly cheaper for us to do that. However the downside is that it makes it harder to diagnose missed updates caused by the service erroneously treating them as replays. But that’s just an edge case right?

During development when your clients are coding against your API for the first time they may make a simple mistake and not correctly generate unique requests. This is not quite as uncommon as you may think.

Another potential source of mistaken uniqueness can come from transforming requests, e.g. from a message queue. If you’re integrating with a legacy system what you may think is unique might in fact turn out to be an artefact of the small data set you sampled and it could turn out only to be mostly unique [3].

Answering these kinds of support queries (“what happened to my request?“) can easily chip away at a team’s time. The more effort you put into making it harder for consumers to incorrectly use your API the less time you’ll need to support them.

Debug Switch

Although the need to verify a replay should be pretty infrequent you could easily wrap the validation logic in a feature switch so that development and test environments provide better diagnostics whilst production has less work to do. But you should measure before considering dropping such a low-cost feature.

Monitoring

What will help both the service and it’s consumers is if you can afford to capture when replays do occur so that the monitoring can alert if there appears to be a sudden rise in them. For example a client may go into an failure state that keeps replaying the same message over and over again. Alternatively the rise might suggest some configuration problem where upstream traffic is being duplicated. From a functional point of view the entire system could be behaving correctly but wasting resources in the process that could lead to a more catastrophic failure state.

Although a client should never need to behave differently from a functional perspective when a replay occurs, the flip side to the service tracking the event is for the client to report it as well. They could do this if, say, the return code changes in the replay case. In HTTP for example there is 200 OK and 201 Created which could be used to tell the client what happened.

Middle Ground

As I described in my previous post there can also be points between the two extremes. One of the problems with modern document oriented databases is that they have limits on the size of a single document. Depending on the anticipated depth of history this may exceed the practical document limit and therefore require some compromises to be made.

If no false replays can be tolerated than perhaps the model needs splitting across documents which removes the ability to do simple atomic updates. If the history grows slowly and data retention is negotiable then maybe the fine details of much older operations can be discarded to make room for the new.

Brave New World

The new era of databases continues to bring interesting challenges and trade offs due to their differing goals from a traditional big iron RDBMS. Pushing back on some requirements (e.g. “Don’t Be Afraid to Throw Away Data”) is essential if we are to balance their constraints with correctness.

 

[1] You may store server derived properties as well, like a “created at” timestamp. These are of course ignored in any comparison.

[2] Of course breaking an invariant like the relationship between Equals() and GetHashCode() is a popular mistake which can make objects appear to go missing in maps & sets.

[3] I know of one organisation where many people mistakenly believed their customer identifiers were unique, but we discovered they were only truly unique if you took their legacy back-office sources into account too.

No comments:

Post a Comment