Monday, 12 November 2018

Feeling Isolated

By and large I think I’ve been fairly lucky with my time as a contract programmer. Virtually all the teams I’ve worked in and systems I’ve worked on have been pretty decent. None of them are going to change the world but they’ve been enjoyable, which is probably why I’ve ended up working on them for a decent length of time [1].

I can only say “virtually all” because one contract sadly fell way short of the mark. Although I was technically part of a team it only really felt that way from a managerial perspective, even though we shared a codebase. I felt somewhat isolated both physically and mentally. Aside from the morning stand-up I could easily have gone the rest of the day without speaking to my teammates if I had chosen to do so.

Physical Isolation

I started the contract on a separate floor from the rest of my team with a couple of other recent joiners [2]. We were the only people on that floor with the air conditioning on full blast so we had to wear our coats in the afternoon to stay warm. None of the rest of my team had an office pass that could access the floor either, should they want to talk face-to-face while getting us up to speed.

Even when they moved us onto the same floor a month later we were still on the opposite side of the room. In the next desk shuffle I got to swap colleagues although they were working on an entirely separate area of the system with a totally different bunch of people so we had little need to collaborate per-se, only to make small talk. Also the two desks next to me only seemed to be used for a game of Tower of Hanoi by the office movers given how the occupants came and went.

Even my “customer”, at least, the one I knew about, because they were paying for the project, was situated in a different country and spoke a different language. Although their English was way better than any knowledge I have of a second language I quickly discovered why most communication was via email or IM instead of vocally.

Project Isolation

Being an enterprise scale organisation the work was all about projects, and who was sponsoring how many “resources”. Nowhere was this more apparent than the Scrum Board with its project-oriented swim-lanes. Each swim-lane had the names of the team members assigned to that project, and as the stand-up proceeded it walked down the board a project at a time with each member of the sub-team providing an update.

It was fairly apparent right from the moment I started, just by reading the body language of the team members, that there was often little real interest in what the rest of the team was doing. Those that did, cut across projects to some degree because they tended to nurse the build system, deployments and monitoring. A couple of team members never attended our stand-up because they already attended a different one that encompassed their project.

To be fair some of the apathy at the stand-up was almost certainly down to its excessive length. And with little reason for attending except to provide a status update for the managers it’s no surprise those mostly on the periphery zoned out. Sometimes the only common goal of the team seemed to be to not break the system.

Code Isolation

During my short stint I effectively had one feature to work on. There were a couple of other minor tweaks to begin with but ultimately my project was one feature (nay, user story) and it took 5 months to deliver. That one feature involved making a change in an area of the codebase that nobody else knew except one of the tech leads who I soon discovered was leaving. In fact, taking away his days off after the announcement of his departure, I effectively had 3 days for any handover.

Not only were there no docs to work from there were no tests either. The only real knowledge about how any of the service was expected to behave had left firmly inside the head of the author. This pretty much just left doing a spot of software archaeology with the VCS in the hope that the commit messages might contain some extra clues. Many features had been tracked in a feature tracking tool but there were not enough licenses to go round so I had to hassle a teammate to look things up. Even then it often wasn’t worth it as there were no useful details; it felt like the ticket was just there to “tick a box”.

The code relied heavily on the caller “doing the right thing” so any understanding only made sense if you already knew what the caller was supposed to do, and that relied heavily on knowledge of the problem domain and the organisation’s other systems. (At the interview I made it perfectly clear that I still knew little about the problem domain, despite the many years I have worked in it [3].)

Methodology Isolation

Ever since I had my epiphany [4] around testing all those years ago I have become a firm believer in TDD and automated testing as the preferred approach to the sustainable delivery of quality software. Being told early in the project that “you won’t have time to write tests”, despite being asked in the interview about what your approach is, did not bode well.

It soon became apparent that the previous approach had been to rush something out and rely on manual, end-to-end testing and the customer doing things “right”. Validation was almost entirely left to the underlying maths library and so bizarre errors manifested and needed investigating by the developers due to a lack of basic error handling and reporting [5].

With no way of knowing if I had broken anything, because I didn’t know for sure what anything was supposed to do, my only recourse was to write new code with tests and then refactor later when someone (potentially me) could be sure that it was safe to do so. For existing code that I had to change or understand I would write a barrage of tests first to try and ensure I didn’t accidentally break anything. In some cases it was hard to know what was “by design” and what was “by accident”.

Clearly not everyone took this approach, as you can see in “It Compiles, Ship It!”. My pessimism paid off though once the edge cases and little extras started appearing as I could turn around a fix or improvement (safely) in minutes due to my suite of automated unit and regression tests.

Environment Isolation

Sadly, despite my ability to push through changes quickly into the integration test environment, it still took weeks for them to actually appear in the production environment. When my first task, a handful of lines of boilerplate code, took 6 weeks to make it into production I assumed continuous delivery was not something they cared about.

On the contrary, for one aspect of the business, releases were very frequent. It was just that I was on the other side and due to some (IMHO) poor architecture and deployment decisions my part of the distributed system was tightly-coupled to another (major) system’s release cycle.

While it might seem great having my own integration test environment to play with, I ran into issues no one else knew about and I had no idea who was really using it and for what. Once again that information pretty much departed with the author.

Parting Thoughts

On reflection I have to look at my own behaviour first and ask myself whether I was at least partly responsible for feeling left out. Once we moved onto the same floor it was definitely easier to wander over and ask people questions, which I did. However when the response is “well I worked all this out by myself originally” and “that’s more than anyone ever gave me” I think it’s not entirely unfair to assume that knowledge sharing isn’t high on some people’s agenda.

I believe I was as welcoming as I normally am and was happy to help out where possible, given the limited knowledge I had acquired. I guess that culturally there was such a large drive for autonomy that the idea of just chatting about stuff to see what improvements in the system or process would be beneficial just wasn’t on the cards. A couple of times what should have been a constructive comment or question definitely came out of me more as a snide remark which is never a good sign. I’ve been trying hard to be more aware of any sarcasm, which unfortunately comes all too easily to me, and so not add to any unnecessary negativity but I know I failed a few times.

Ultimately I think it says a lot about an organisation that rejects your approach because “they are not a start-up” when your application of that approach has only ever been in large enterprises and none of them has ever had an issue with it before. On the contrary they have often been grateful for the insights and improvements that I’ve brought.

Maybe if I was a lot younger I’d not have known any better and stuck it out a bit more but these days I know it’s just not worth the effort. I feel comfortable that I left the place in a better state than I joined it by documenting various things and writing tests for the code I wrote. After a slightly rocky start my customer seemed pretty pleased with everything I delivered, which I guess is largely what matters most.

As ever, my main regret is leaving behind some people that I wish I could have gotten to know better. Maybe I will, in another life, one where the benefits of collaboration are more positively encouraged.

 

[1] Mostly my tenure has been measured in years, not months.

[2] Only one of which was left when I called it a day – the other two barely lasted a month or so.

[3] See “Problem Domain Expert or Technical Expert or Even Both” for more on this recurring theme.

[4] See “My [Unit] Testing Epiphany” and my more recent ACCU / Agile on the Beach talk “A Test of Strength” for what lead to my enlightenment.

[5] Poor error messages is a popular topic of mine, see “Terse Exception Messages”. Also “The Perils of DateTime.Parse()” covers one specific example.

Thursday, 8 November 2018

Proxy Weirdness – Socket Closed on 404

While investigating the issue that led to the discovery of the strange default behaviour of the .Net HttpClient class which I wrote up in “Surprising Defaults – HttpClient ExpectContinue” we also unearthed some other weirdness in a web proxy that sat between our on-premise adapter and our cloud hosted service.

Web proxies are something I’ve had cause to complain about before (see “The Curse of NTLM Based HTTP Proxies”) as they seem to interfere in unobvious ways and the people you need to consult with to resolve them are almost always out of reach [1]. In this particular instance nobody we spoke to in the company’s networks team knew anything about it and trying to identify if it’s your on-premise proxy and not something broken with any of the other intermediaries that sit between you and the endpoint is often hard to establish.

The Symptoms

Whilst trying to track down where the “Expect: 100-Continue” header was coming from, as we didn’t initially believe it was from our code, we ran a WireShark trace to see if we could capture the traffic from, and to, our box. What was weird in the short trace that we captured was that the socket looked like it kept closing after every request. Effectively we sent a PUT request, the response would come back, and immediately afterwards the socket would be closed (RST).

Naturally we put this on the yak stack. Sometime later when checking the number of connections to the TIBCO server I used the Sysinternals’ TCPView tool to see what the service was doing and again I noticed that sockets were being opened and closed repeatedly. As we had 8 threads concurrently processing the message queue it was easy to see 8 sockets open and close again as in TCPView they go green on creation and red briefly on termination.

At least, that appeared to be true for the HTTP requests which went out to the cloud, but not for the HTTP requests that went sideways to the internal authentication service. However they also had an endpoint hosted in the cloud which our cloud service used and we didn’t see that behaviour with them (i.e. cloud-to-cloud), or when we re-configured our on-premise service to use it either (i.e. on-premise-to-cloud). This suggested it was somehow related to our service, but how?

The HttpClient we were using for both sets of requests were the same [2] and so we were pretty sure that it wasn’t our fault, this time, although as the old saying goes “once bitten, twice shy”.

Naturally when it comes to working with HTTP one of the main diagnostic tools you reach for is CURL and so we replayed our requests via that to see if we could reproduce it with a different (i.e. non-.Net based) technology.

Phased Switchover

While the service we were writing was new, it was intended to replace an existing one and so part of the rollout plan was to phase it in slowly. This meant that all reads and writes would go to both versions of the service but only the one where any particular customer’s data resided would succeed. The consumers of the service would therefore get a 404 from us if the data hadn’t been migrated, which in the early stages of development applied to virtually every request.

A few experiments later to compare the behaviour for requests of migrated data versus unmigrated data and we had an answer. For some reason a proxy between our on-premise adapter and our web hosted service endpoint was injecting a “Connection: Close” header when a PUT or DELETE [3] request returned a 404. The HttpClient naturally honoured the response and duly closed the underlying socket.

However it did not have this behaviour for a GET or HEAD request that returned a 404 (I can’t remember about POST). Hence the reason we didn’t see this behaviour with the authentication service was because we only sent GETs, and anyway, they returned a 200 with a JSON error body instead of a 404 for invalid tokens [4].

Epilogue

I wish I could say that we tracked down the source of the behaviour and provide some closure but I can’t. The need for the on-premise adapter flip-flopped between being essential and merely a performance test aid, and then back again. The issue remained as a product backlog item so we wouldn’t forget it, but nothing more happened while I was there.

We informed the network team that we were opening and closing sockets like crazy, which these days with TLS is somewhat more expensive and therefore would generate extra load, but had to leave that with them along with an offer of help if they wanted to investigate it further, as much for our own sanity.

It’s problems like these which cause teams to deviate from established conventions because ultimately one is within their control while the other is outside it and the path of least resistance is nearly always seen as a winner from the business perspective.

 

[1] I’m sure they’re not hidden on purpose but unless you have a P1 incident it’s hard to get their attention as they’re too busy dealing with actual fires to worry about a bit of smoke elsewhere.

[2] The HttpClient should be treated as a Singleton and not disposed per request, which is a common mistake.

[3] See “PUT vs POST and Idempotency” for more about that particular choice.

[4] The effects of this style of API response on monitoring and how you need to refactor to make the true outcome visible are covered in my recent Overload article “Monitoring: Turning Noise into Signal”.