Thursday, 26 November 2015

The Cost of Not Starting

The idea of emergent design is uncomfortable to those at the top and it’s pretty easy to see why. Whilst there are no real physical barriers to overcome if the software architecture goes astray, there is the potential for some significant costs if the rework is extensive (think change of platform / paradigm / language). In times gone by there was a desire to analyse the problem to death in an attempt to try and ensure the “correct” design choices were made early and would therefore (theoretically) minimise rework.

In a modern agile world however we see the fallacy in that thinking and are beginning to rely more on emergent design as we make better provision for adapting to change. It’s relatively easy to see how this works for the small-scale stuff, but ultimately there has to be some up-front architectural choices that will shape the future of the system. Trying to minimise the number of up-front choices to remain lean, whilst also deciding enough to make progress and learn more, is a balancing act. But the cost of not actually starting the work and even beginning to learn can definitely be dear if the desire is to move to a newer platform.

A Chance to Learn

I recently had some minor involvement in a project that was to build a new, simple, lookup-style data service. Whilst the organisation had built some of these in the past they have been on a much older platform, and given the loose timescale at the project’s inception it was felt to be a great opportunity to try and build this one on a newer, more sustainable platform.

Essentially the only major decision to make up-front was about the platform itself. There had already been some inroads into both Java and .Net, with the former already being used to provide more modern service endpoints. So it seemed eminently sensible to go ahead and use it again to build a more SOA style service where it owns the data too. (Up to that point the system was a monolith where data was shared through the database.)

Due to there being an existing team familiar with the platform they already knew plenty about how to build Java-based services, so there was little risk there, aside from perhaps choosing a RESTful approach over SOAP. Where there would be an opportunity to learn was in the data storage area as a document-oriented database seemed like a good fit and it was something the department hadn’t used before.

Also as a result of the adapter-style nature of the work the team had done before they had never developed a truly “independent” service, so they had a great opportunity to try building something more original in an ATDD/BDD manner. And then there was the chance to make it independently serviceable too which would give them an initial data point on moving away from a tightly-coupled monolithic architecture to something looser [1].

Just Enough Design

In my mind there was absolutely no reason why the project could not be started based on the knowledge and decisions already made up to that point. The basic platform had been chosen and therefore the delivery team was known and so it would be possible to begin scheduling the work.

The choice of protocol and database were yet to be finalised, but in both cases they would be relying heavily on integrating a 3rd party product or library – there was little they had to write themselves. As such the risk was just in evaluating and choosing an approach, and they already had experience with SOAP and their existing database to fall back on if things didn’t pan out.

Admittedly the protocol was a choice that would affect the consumer, but the service was a simple data access affair and therefore there was very little complexity at this stage. The database was going to be purely an implementation detail and therefore any change in direction here would be of no interest to the consumers.

The only other design work might be around what is needed to support the various types of automated tests, such as test APIs. This would all just come out “in the wash”.

Deferring the Decision to Start

The main reason for choosing the project as a point of learning was its simplicity. Pretty much everything about it allowed for work in isolation (i.e. minimal integration) so that directions could be explored without fear of breaking the existing system, or development process.

What happened was that some details surrounding the data format of the 3rd party service were still up in the air. In a tightly-coupled system where the data is assumed to be handled almost verbatim, not knowing this kind of detail has the potential to cause rework and so it is seen as preferable to defer any decision it affects. But in a loosely-coupled system where we decide on a formal service contract between the consumer and producer that is independent of the underlying implementation [2], we have less reason to defer any decisions as the impact will be minimal.

As a consequence of delaying doing any actual development on the service the project reached a point well passed the Last Responsible Moment and as such a decision was implicitly made for it. The looming deadline meant that there was no time or resources to confidently deliver the project on time and so it was decided that it would be done the old way instead.

Cost versus Value

One of the reasons I feel that the decision to do it the old way was so easy to make was down to the cost based view of the project. Based solely on the amount of manpower required, it likely appears to be much cheaper to deliver when you’ve done similar work before and have a supply of people readily available. But that only takes the short-term cost into account – the longer term picture is different.

For a start it’s highly likely that the service will have to be rewritten on a newer platform at some point in the future. That means some of the cost to build it will be duplicated. It’s possible many of the same learning's could be done on another project and then leveraged in the rebuild, but what are the chances they’ll have the same deadline luxuries next time?

In the meantime it will be running on a platform that is more costly to run. It may only add a small overhead, but when you’re already getting close to the ceiling it has the potential to affect the reliably of the entire monolithic system. Being done on the old platform also opens the door to any maintenance being done using the “culture” of that platform, which is to tightly-couple things. This means that when the time finally comes to apply The Strangler Pattern it won’t just be a simple lift-and-shift.

Whilst it might be easy to gauge and compare the short-term costs of the two approaches it’s pretty hard to put a tangible value on them. Even so it feels as though you could make a judgment call as to whether doing it on a newer platform was “worth” twice or three times the cost if you knew you were going to be gaining a significant amount of knowledge about how to build a more sustainable system that can also be continuously delivered.

Using Uncertainty as a Driver

One of Kevlin Henney’s contributions to the book “97 Things Every Software Architect Should Know” discusses how we can factor uncertainty into our architecture and design so that we can minimise the disruption caused when the facts finally come to light.

In this particular case I see the uncertainty around the external data format as being a driver for ensuring we encapsulate the behaviour behind a service and instead formalise a contract with the consumer to shield them from the indecision. Whilst Kevlin might have largely been alluding to design decisions the notion “use uncertainty as a driver” is also an allegory for “agile” itself.

Eliminating Waste

There is undoubtedly an element of poetic justice in this tale. The reason we have historically put more effort into our analysis is to try and avoid wasting time and money on building the wrong thing. In this instance all the delays waiting for the analysis and design phases to finish meant that there was no time left to do it “right” and so we will in all likelihood end up generating more waste by doing it twice instead.

Also instead of moving forward the knowledge around building a more sustainable platform we now know no more than we do today, which means maintenance will continue to be more costly too, both in terms of time & money and, potentially more importantly, morale.

[1] Whilst a monolithic architecture is very likely to be tightly-coupled, it doesn’t have to be. The problem was not being monolithic per-se, but being tightly-coupled.

[2] Yes, it’s possible that such as change could cause a major re-evaluation of the tech stack, but if that happens and we had no way of foreseeing it I’m not sure what else we could have done.

No comments:

Post a Comment