We all know that some prequels are a bit of a disappointment, but this should be a fairly interesting tale to you full of woe and tech debt. Names changed to protect the guilty and innocent alike. If any.
OK, so I would say that I left the SQL Server community and moved on to other pastures around 5 years ago. What happened to provoke such a move you may -hopefully!- ask?
Well, picture the scene: There I was one day minding my own business, working with a data team, joyfully moving data in and out of a data warehouse on SQL Server and tackling fun challenges when there was a tap on my shoulder by the CIO. “Apparently you know something about SQL Server, can I borrow you for 10 minutes ?” and with that, I didn’t see the data team for 3 months.
What I found when I got there, was the most catastrophic of failures that one could ever imagine. This was real, “when computers go bad” territory, the stuff of real nightmares that no one wants to be associated with.
It’s hard enough to understand a system that is working, let alone one with this much fire going on. To summarise the architecture, SQL Server was used as a backing for what I can only describe as micro-services-based architecture. It was either genius or insanity.
Spoiler alert: It doesn’t turn out to be the former.
You would have never seen so many extended stored procedures in your life. If that wasn’t enough, the large stored procedures, which were mainly string handling, concatenation and parsing, -you know, the stuff SQL Server is so good at handling-, were not managed in any particular way, i.e. no source control and automated testing was pretty much non-existent.
Not only that, but, -wait for it!- the XPs made HTTP calls out from SQLServer to other APIs, which, mostly proxied the calls back down to SQLServer!
Ironically, perhaps, Microsoft is adding exactly this capability now.
This was a simplified view of how things really were. The SQL Servers ( a mixture of standard and enterprise editions ) are the blue cylinders split into 3 groupings, which I’ll address as A, S and C. The API layers are in yellow. Comms to the servers are as you would expect, standard SQL TDS streams and the comms between the SQLServers themselves was by Service Broker and if that wasn’t enough the S servers had 2-way Service brokers. All very complicated, but when a request came through, it is was routed to A or S servers depending on the payload. The C servers (out on the right) were used to encrypt and decrypt data, and to access this the A and S servers would make HTTP calls out to the C Servers API layer.
With me? Phew! You can breathe now! And unbelievable as that is, *that* is genuinely a simplified version.
The code base of the stored procedures between the A, S and C servers was not the same either, this meant a complete rabbit hole of finding which was the real version of the truth of the code. It was similar enough to lull you into a false sense of security but not quite similar enough to be able to merge them all together.
The processing was such, that the customer would make 1 call to the API, which would then return a unique ID. This was from an identity column, so all the servers had to be synced to different ranges of IDs, and all hell would break loose if they overlapped. The customer would then use that same ID to make another call to the API which would then be used on the S servers, and most of the time Service Broker won the race of data but sometimes it would not, this was the normal BAU state of affairs.
Everything combined, it was an actual Chinese juggling act for the support staff to Keep the show on the road.
Now the brown stuff hits the fans.
As anyone who has ever worked with Service Broker will know, it is a tricky beast to will into submission right, it can seem overly complicated but once you wrap your head around it, as with everything else, it’s navigable, you can see why the complexity and you can create a solid solution.
However, most implementations I’ve seen over the years, have been of the One-Message-One-Dialog, So something like :
DECLARE @dh UNIQUEIDENTIFIER;
BEGIN DIALOG CONVERSATION @dh
FROM SERVICE [Initiator]
TO SERVICE N‘Target’;
SEND ON CONVERSATION @dh;
END CONVERSATION @dh;
That is bad, very, very, bad. If you have something simplistic that is not business critical, -and ideally a list of what is and isn’t which is where the real rub is!- it’ll do, but for a robust 24/7, X Thousand TPS system, forget it.
The clue to the problem here is that Service Broker is designed to be a dialogue-based messaging system – cry all you want about it being overblown and complicated if you must, this is not the simpler technology you are after. So, what does a dialogue mean? Well, it’s a two-way conversation and a long-running two-way conversation at that. The cost of initiating and cleanly closing a dialogue is immense compared to the SENDing. Remus Rusanu ( who, as far as I’m aware, was the architect of Service Broker) makes the point here in much more detail than is worth repeating.
Another anti-pattern is the “Queue in a table”, again don’t do that, Service Broker is complex so as to resolve those problems that occur.
Bottom line: If you want to queue inside SQL Server, learn service broker and make a proper implementation that reuses the conversations.
Suffice it to say that our implementation wasn’t that, and due to a series of unfortunate events, the queues on the ‘A’ servers started to grow, a snowball that was to cause the later avalanche was gently and without much notice, pushed from the top of the mountain.
What did this mean for Service Broker as the snowball gathered inertia on its now unstoppable and doomed journey to the bottom of the mountain?
Put simply, as requests came in to send messages, there was a lot of activity on the sys.conversations table, why was that? Well, Service Broker could now not remove the old conversations and as it couldn’t remove them, it couldn’t start them, and as it couldn’t start them, it couldn’t remove them………
This is not IMHO a Service Broker design flaw, this is an implementation design flaw. As the world burnt and the conversations backed up more and more, mitigations were made, apps were created to extract the messages from the queues and play them directly against the destination servers, replication was added between the source tables and the destination servers, all of which helped, but did nothing to keep the head above water.
What other help did you have in attempting to solve the problem? I hear you cry. Well, we had the SQLServer product team onsite, and we had some of the best UK consultants on site, but the inertia was unstoppable.
Then, at 3 AM on what felt like an hourly, update call with Microsoft, everyone was out of ideas.
“Right, I said, how long to get us to the point where we can “new broker” systematically and on each machine, truncate the queues and get us back to the BAU state? This was not a simple job, but now everyone had an agreed course of action. The queues now held the single version of truth for some transactions, as the “cleanup” jobs had already purged the versions on the ‘A’ servers to preserve space on the databases, another one of the many spinning plates that had to be kept on the air. So close to the wind was this run that on one occasion a USB drive had to be purchased from BestBuy and used on a production server!
After much delay due to copying all the data out of queues, and other machines that were not even on our radar going pop, we ( again, working overnight) slowly and diligently cleared the queues. At this time all the traffic was running via replication, a barely good enough solution to keep up with the business case, and our trials using synthetic transactions proved that the comms were now good.
The big day finally came. We allowed traffic to trickle into the Service Broker, at this time, we still had the bad implementation, but we were back online and over the next few hours, we allowed more and more traffic to flow through Service Broker.
“Is it operating correctly now ?” said the CIO, “I don’t know, I’ve only ever seen this thing broken” was my comment.
But there it was, we were stable again.
All fixed, now what?
Before I knew it, I was asked to take architectural control of the platform and we set about stabilizing the system for the longer term. The first port of call was of course, source control (SSDT) and building a CI/CD pipeline, then was a proper implementation of service broker, this was surprisingly quick given that there are numerous examples of how to do it, the only mistake being that we did introduce same transactional control here, when there was none in the main product which caused issues. But once bedded in, there was no more worrying about queues, operational control was easier, and everything felt less taut and fragile.
We started to turn the department around, we devised a method using the strangler pattern to migrate the logic from SQL Server into APIs hosted on Service Fabric (for my money, a more straightforward hosting platform than k8s, but never got real traction in the market ) and good progress was made.
However, the decision to close the platform had already been made and that’s what happened, unfortunately, all the staff were let go and a small team was kept on to allow the graceful sunsetting.
As you can imagine, this has been a genuine turning point for me from many points of view – as a developer, as a systems architect, as a leader, and as a human too but I vowed not to let the above ever happen again if I can avoid it.
So, a good few years on, where does that leave me now, well SSMS is now a stranger to me, and I spend most of my time in either strategy and architecture calls or in getting my hands dirty guiding teams to increase productivity and quality in the .NET stack generally. Lessons were aplenty from these years, running multiple teams, running multiple uncooperative teams, reducing tech debt, keeping the focus on the basics, SOLID principles, living and breathing agile, being metrics and data-driven and last but not least, stabilizing systems for the long term good and health of organisations.