One Patch, Twelve Hours Down: The Software End-Of-Life Lesson I'll Never Forget
This is a story from early in my career, when I was a junior system administrator at a commercial bank. The events here happened in 2020 — six years ago at the time of writing — and most of the technical specifics have faded, but the lessons haven't. If anything, they've gotten sharper with experience.
I'm writing this one in the same spirit as the WAF post: a war story, not a tutorial. The lesson at the end applies to anyone running anything past end of life, which, if you're honest about your environment, is probably more software than you'd like to admit.
What we had
Our entire internet egress went through a single Microsoft Forefront TMG 2010 server. Every staff member's outbound traffic be it web browsing, third-party API calls, software updates or anything you can think of passed through it. We used it as a forward proxy with URL filtering - which sites users could reach, which categories were blocked, which destinations needed authentication. It had been deployed long before I joined the bank, running on Windows Server 2008 R2 in a Hyper-V VM, with replicas going to our DR site on a frequent schedule.
Microsoft had announced the end of TMG years earlier. By 2020, the product was in its final months of extended support — Microsoft would officially stop shipping security updates for it in April that year. We knew this. We had not done anything about it.
The Saturday night patch
Our standard operating procedure for Windows servers was the same as it had been for years - apply patches once a month, in a window starting Saturday at midnight and ending early Sunday morning. The TMG server was patched alongside everything else. Reboot, wait for services to come up, verify, sign off, go home.
Just like always, the patches installed without complaint. The server rebooted.
The TMG service did not start.
The Microsoft Forefront TMG service was configured to start automatically at boot. After a reboot, the management console should reconnect to a running service within a minute or two. This time, the service simply wasn't running. The service control manager reported it as stopped.
I tried starting it manually. It failed.
I tried starting it from the command line with verbose output. It failed.
I checked the event logs. There were errors, but nothing that pointed at a clear root cause, and the service dependencies were deep enough that anything and evrything could have been the actual problem.
This was not, in itself, panic-inducing. We had been here before with other services after monthly patches — usually the answer was to uninstall the offending update, reboot, and move on. I uninstalled the most recently applied patches. Rebooted.
The service still did not start.
The twelve hours
What followed was about twelve hours of working through everything we could think of, and then everything my boss could think of, and then everything the small group of teammates who had been pulled in could think of.
We uninstalled patches in different orders. We rebooted between each attempt. We re-registered TMG's components. We tried repairing the TMG installation from the original media. We tried restarting in safe mode. We checked file permissions on the TMG installation directory, the IIS components, the SQL Server Express instance that backed the configuration. We confirmed the network adapters were healthy, the routing table looked right, the firewall rules were in place.
Nothing brought the service back.
The patch we had applied wasn't even a TMG patch. TMG hadn't received a patch that month. What we had applied was a Windows Server 2008 R2 cumulative update — the kind of update that touches dozens of OS components, any one of which TMG depended on. Somewhere inside that bundle can be a change to a low-level library, a registry key, a service dependency, a permission or just anything that TMG's startup sequence didn't know how to handle. And because TMG was a discontinued product, there was no support article describing the issue, no hotfix from Microsoft, no community thread with the answer. We were on our own with software that no one was meaningfully maintaining anymore.
By the time the sun came up on Sunday, staff would be off duty for most of the day, but Monday morning was coming fast. Without TMG, no one in the bank would have internet access. That included internal applications that called external APIs, software pulling licensing checks, and basic operations like checking email through external services.
We pivoted from "fix it" to "get something running."
The replica trap
This is the part of the story that still bothers me when I think about it, because the failure was so well-engineered.
We had a disaster recovery setup. It was thoughtfully designed. Hyper-V replicas of the TMG VM were being shipped to a DR server at a high frequency — recent enough that if the production VM was lost, we would lose only minutes of state. By the standards of the time, this was good DR.
We started failing over to the most recent replica.
The replica's TMG service also refused to start.
It took me a moment to understand why, and then it was obvious in a way that felt almost insulting. The replicas weren't snapshots of TMG as it had been before the patch. They were replicas of TMG as it was now — that is, replicas of the patched, broken VM. Hyper-V Replica had been doing exactly what it was designed to do: it had faithfully copied the production state to the DR site, including the OS patches I had just applied. The DR copy was an accurate reproduction of the disaster.
We walked the replicas backward in time. The previous one: also patched, also broken. The one before that: same. We had to go days back — past the boundary of the most recent patch cycle — to find a snapshot from before the update had landed. That one started cleanly.
The catch: it was an older snapshot. Several days of rule changes were missing from it. The recovered TMG was functional, but it was the TMG of a week ago, not the TMG of yesterday. We brought it up anyway. By Sunday afternoon, traffic was flowing, and we would spend the coming days reconstructing the missing rules from change request and memory.
The aftermath
The conversation on Monday morning was short. We weren't going to try to fix the original TMG VM. We weren't going to keep operating on the recovered older replica indefinitely. We had been carried this far by luck and a DR setup that almost worked — neither was a strategy.
A FortiGate 1101E (One device I so much fell in love with because I took ownership of this project) was ordered as the replacement. Procurement took weeks, partly because the device had to be shipped internationally from the United States, and partly because the budget approval needed to go up the chain. (The chain moved unusually fast in this case, which is what happens when "the device that controls all our internet access is currently running on a snapshot from last week" is in the justification.) The FortiGate was attractive because it was modern, actively supported, and offered things TMG never could.
In parallel, the bank started an upgrade campaign for the Windows Server 2008 R2 fleet. Within a few months, the vast majority of them had been moved to Windows Server 2012 R2. This wasn't a formal policy change — the bank's official patching and EOL practices didn't really change. It was just an unspoken acknowledgement that running anything on 2008 R2 was now uncomfortable, and people moved as quickly as their backlogs allowed.
The TMG VM itself was eventually shut down and archived. To my knowledge, no one ever figured out exactly which component the Windows patch had broken, and at that point, no one cared.
What I took from this
Several lessons, in roughly the order I learned them.
The OS patch can kill the layered product. TMG was a stack of components sitting on top of Windows — IIS, SQL Server Express, COM libraries, kernel-mode drivers — and any of those getting touched by an OS patch could break TMG without "TMG" appearing in the patch notes anywhere. If you're running a layered product on top of an OS that's still receiving patches, you're exposed to every OS patch as if it were a patch to that layered product.
Replica frequency is not the same as recovery depth. Our DR strategy optimized for low RPO — losing as little data as possible if the production system failed. That's the metric we measured ourselves on. What we hadn't measured was recovery depth — how far back in time we could go if the production system was, itself, the source of the disaster. A 5-minute-old replica is worthless if 5 minutes ago was already broken. After this, we started thinking about backup and replication as two different things: replication for high availability, and retained, point-in-time backups — kept far enough back that you can step over the patch cycle — for disaster recovery. The two solve different problems, and the second one is the one that saves you when the first one has faithfully replicated the disaster.
Patching software past end of life is not safer than not patching it. This is the one that took longest to internalize, because it cuts against the standard security advice that you should always apply patches. However, once a product reaches EOL, the patches you're still receiving aren't patches to the product. They're patches to the platform underneath it. The vendor isn't testing those patches against the EOL product anymore. Each one is a small unsupervised experiment. The "safe" thing to do — counterintuitively — is to freeze the EOL system at its last-known-good state, isolate it as much as possible, and treat the migration to a supported product as the actual security work.
Don't wait for EOL to plan the migration. We knew TMG was being discontinued years before this happened. We had years to plan and execute a replacement. We didn't, because there was always a more pressing problem on the list. The cost of that procrastination was a twelv-hour outage at 2 AM and a week of frantic rule reconstruction. These days, when I see a product approaching EOL — anywhere from 12 to 24 months out — I treat the migration as a now-problem, not a later-problem. The work is the same either way; the only variable is whether you do it on your schedule or on the patch's.
I learned more from that one Saturday night than from probably six months of routine work. My boss, who was unreasonably calm throughout the whole thing — the kind of calm that comes from having seen worse — said something I've remembered since: the systems that scare you most are the ones nobody's been forced to think hard about in a long time.