// latest posts

Cloud & DevOps Insights

Practical notes on platform engineering, Kubernetes, AWS, Terraform, and the realities of managing infrastructure at scale.

One Patch, Twelve Hours Down: The Software End-Of-Life Lesson I'll Never Forget

May 23, 2026· dsamist

This is a story from early in my career, when I was a junior system administrator at a commercial bank. The events here happened in 2020 — six years ago at the time of writing — and most of the technical specifics have faded, but the lessons haven't. If anything, they've gotten sharper with experience. I'm writing this one in the same spirit as the [WAF post](https://blog.semmanuel.com/posts/22): a war story, not a tutorial. The lesson at the end applies to anyone running anything past end of life, which, if you're honest about your environment, is probably more software than you'd like to admit. --- ## What we had Our entire internet egress went through a single Microsoft Forefront TMG 2010 server. Every staff member's outbound traffic be it web browsing, third-party API calls, software updates or anything you can think of passed through it. We used it as a forward proxy with URL filtering - which sites users could reach, which categories were blocked, which destinations needed authentication. It had been deployed long before I joined the bank, running on Windows Server 2008 R2 in a Hyper-V VM, with replicas going to our DR site on a frequent schedule. Microsoft had announced the end of TMG years earlier. By 2020, the product was in its final months of extended support — Microsoft would officially stop shipping security updates for it in April that year. We knew this. We had not done anything about it. --- ## The Saturday night patch Our standard operating procedure for Windows servers was the same as it had been for years - apply patches once a month, in a window starting Saturday at midnight and ending early Sunday morning. The TMG server was patched alongside everything else. Reboot, wait for services to come up, verify, sign off, go home. Just like always, the patches installed without complaint. The server rebooted. The TMG service did not start. The Microsoft Forefront TMG service was configured to start automatically at boot. After a reboot, the management console should reconnect to a running service within a minute or two. This time, the service simply wasn't running. The service control manager reported it as stopped. I tried starting it manually. It failed. I tried starting it from the command line with verbose output. It failed. I checked the event logs. There were errors, but nothing that pointed at a clear root cause, and the service dependencies were deep enough that anything and evrything could have been the actual problem. This was not, in itself, panic-inducing. We had been here before with other services after monthly patches — usually the answer was to uninstall the offending update, reboot, and move on. I uninstalled the most recently applied patches. Rebooted. The service still did not start. --- ## The twelve hours What followed was about twelve hours of working through everything we could think of, and then everything my boss could think of, and then everything the small group of teammates who had been pulled in could think of. We uninstalled patches in different orders. We rebooted between each attempt. We re-registered TMG's components. We tried repairing the TMG installation from the original media. We tried restarting in safe mode. We checked file permissions on the TMG installation directory, the IIS components, the SQL Server Express instance that backed the configuration. We confirmed the network adapters were healthy, the routing table looked right, the firewall rules were in place. Nothing brought the service back. The patch we had applied wasn't even a TMG patch. TMG hadn't received a patch that month. What we had applied was a Windows Server 2008 R2 cumulative update — the kind of update that touches dozens of OS components, any one of which TMG depended on. Somewhere inside that bundle can be a change to a low-level library, a registry key, a service dependency, a permission or just anything that TMG's startup sequence didn't know how to handle. And because TMG was a discontinued product, there was no support article describing the issue, no hotfix from Microsoft, no community thread with the answer. We were on our own with software that no one was meaningfully maintaining anymore. By the time the sun came up on Sunday, staff would be off duty for most of the day, but Monday morning was coming fast. Without TMG, no one in the bank would have internet access. That included internal applications that called external APIs, software pulling licensing checks, and basic operations like checking email through external services. We pivoted from "fix it" to "get something running." --- ## The replica trap This is the part of the story that still bothers me when I think about it, because the failure was so well-engineered. We had a disaster recovery setup. It was thoughtfully designed. Hyper-V replicas of the TMG VM were being shipped to a DR server at a high frequency — recent enough that if the production VM was lost, we would lose only minutes of state. By the standards of the time, this was *good* DR. We started failing over to the most recent replica. The replica's TMG service also refused to start. It took me a moment to understand why, and then it was obvious in a way that felt almost insulting. The replicas weren't snapshots of TMG as it had been *before* the patch. They were replicas of TMG as it was *now* — that is, replicas of the patched, broken VM. Hyper-V Replica had been doing exactly what it was designed to do: it had faithfully copied the production state to the DR site, including the OS patches I had just applied. The DR copy was an accurate reproduction of the disaster. We walked the replicas backward in time. The previous one: also patched, also broken. The one before that: same. We had to go *days* back — past the boundary of the most recent patch cycle — to find a snapshot from before the update had landed. That one started cleanly. The catch: it was an older snapshot. Several days of rule changes were missing from it. The recovered TMG was functional, but it was the TMG of a week ago, not the TMG of yesterday. We brought it up anyway. By Sunday afternoon, traffic was flowing, and we would spend the coming days reconstructing the missing rules from change request and memory. --- ## The aftermath The conversation on Monday morning was short. We weren't going to try to fix the original TMG VM. We weren't going to keep operating on the recovered older replica indefinitely. We had been carried this far by luck and a DR setup that almost worked — neither was a strategy. A FortiGate 1101E (One device I so much fell in love with because I took ownership of this project) was ordered as the replacement. Procurement took weeks, partly because the device had to be shipped internationally from the United States, and partly because the budget approval needed to go up the chain. (The chain moved unusually fast in this case, which is what happens when "the device that controls all our internet access is currently running on a snapshot from last week" is in the justification.) The FortiGate was attractive because it was modern, actively supported, and offered things TMG never could. In parallel, the bank started an upgrade campaign for the Windows Server 2008 R2 fleet. Within a few months, the vast majority of them had been moved to Windows Server 2012 R2. This wasn't a formal policy change — the bank's official patching and EOL practices didn't really change. It was just an unspoken acknowledgement that running anything on 2008 R2 was now uncomfortable, and people moved as quickly as their backlogs allowed. The TMG VM itself was eventually shut down and archived. To my knowledge, no one ever figured out exactly which component the Windows patch had broken, and at that point, no one cared. --- ## What I took from this Several lessons, in roughly the order I learned them. **The OS patch can kill the layered product.** TMG was a stack of components sitting on top of Windows — IIS, SQL Server Express, COM libraries, kernel-mode drivers — and any of those getting touched by an OS patch could break TMG without "TMG" appearing in the patch notes anywhere. If you're running a layered product on top of an OS that's still receiving patches, you're exposed to every OS patch as if it were a patch to that layered product. **Replica frequency is not the same as recovery depth.** Our DR strategy optimized for low RPO — losing as little data as possible if the production system failed. That's the metric we measured ourselves on. What we hadn't measured was *recovery depth* — how far back in time we could go if the production system was, itself, the source of the disaster. A 5-minute-old replica is worthless if 5 minutes ago was already broken. After this, we started thinking about backup and replication as two different things: replication for high availability, and *retained, point-in-time backups* — kept far enough back that you can step over the patch cycle — for disaster recovery. The two solve different problems, and the second one is the one that saves you when the first one has faithfully replicated the disaster. **Patching software past end of life is not safer than not patching it.** This is the one that took longest to internalize, because it cuts against the standard security advice that you should always apply patches. However, once a product reaches EOL, the patches you're still receiving aren't patches to *the product*. They're patches to the platform underneath it. The vendor isn't testing those patches against the EOL product anymore. Each one is a small unsupervised experiment. The "safe" thing to do — counterintuitively — is to freeze the EOL system at its last-known-good state, isolate it as much as possible, and treat the migration to a supported product as the actual security work. **Don't wait for EOL to plan the migration.** We knew TMG was being discontinued years before this happened. We had years to plan and execute a replacement. We didn't, because there was always a more pressing problem on the list. The cost of that procrastination was a twelv-hour outage at 2 AM and a week of frantic rule reconstruction. These days, when I see a product approaching EOL — anywhere from 12 to 24 months out — I treat the migration as a now-problem, not a later-problem. The work is the same either way; the only variable is whether you do it on your schedule or on the patch's. --- I learned more from that one Saturday night than from probably six months of routine work. My boss, who was unreasonably calm throughout the whole thing — the kind of calm that comes from having seen worse — said something I've remembered since: *the systems that scare you most are the ones nobody's been forced to think hard about in a long time.*

Read more →

Vendor-Recommended Doesn't Mean Safe to Apply: A WAF Security Story

May 16, 2026· dsamist

This post is a story about a certain time with an organization I once worked with whose infrastrcuture was hosted on AWS. On a certain day, our team got an email from the AWS Security Incident Response team. The subject line was the kind that makes you sit up. They had detected a distributed attack on one of our CloudFront distributions, lasting about 45 minutes, against a high-traffic content platform we operate. The attack had since subsided, but they had attached a `rule.json` containing a WAF rule they recommended we apply to block the traffic if it came back. The rule was simple: block any request matching a specific JA4 fingerprint. ```json { "Name": "block-ja4-AMSSEC-XXXXX", "Statement": { "ByteMatchStatement": { "SearchString": "t13d1517h2_8daaf6152771_b6f405a00624", "FieldToMatch": { "JA4Fingerprint": { "FallbackBehavior": "NO_MATCH" } }, "TextTransformations": [{ "Priority": 0, "Type": "NONE" }], "PositionalConstraint": "EXACTLY" } }, "Action": { "Block": {} } } ``` It came from AWS. Our site had been attacked. The obvious move was to apply it. We didn't. And after a week of investigation, AWS Support eventually agreed we shouldn't. This post is the story of what happened in between. --- ## A quick primer on JA4 fingerprints If you've never come across JA4 before, let me quickly mention that it's a hash of the TLS Client Hello a connecting client sends to your server. Different TLS clients (Chrome on Windows, curl, Python's `requests`, a botnet's custom client) produce different Client Hello shapes: different cipher suites in different orders, different extensions, different ALPN values. JA4 turns those differences into a fingerprint string you can match in your WAF. The format looks like this: ``` t13d1517h2_8daaf6152771_b6f405a00624 └─┬─────┘ └──────┬─────┘ └──────┬─────┘ │ │ │ │ │ └── Hash of extensions and signature algorithms │ └── Hash of cipher suites └── TLS metadata: protocol version, ciphers count, ALPN, etc. t13 = TLS 1.3 d = destination/server 15 = 15 cipher suites 17 = 17 extensions h2 = HTTP/2 ALPN ``` JA4 is genuinely useful for blocking custom-built attack clients, because attackers rarely take the time to make their tooling fingerprint-identical to a real browser. But it has a known weakness: **MODERN BROWSERS ALL LOOK VERY SIMILAR AT THE TLS LAYER**, and so do many popular HTTP libraries. The same JA4 fingerprint can match Chrome, Edge, Firefox, and Postman simultaneously. That's the trap we were about to walk into. --- ## Step one: deploy in Count mode Except for extreme, urgent and known issue, one thing I have learnt is that before flipping any new WAF rule to Block, always deploy it in `Count` mode first. Count mode tells WAF to log every request the rule *would have* blocked, without actually blocking anything. It's free, harmless, and tells you whether the rule does what you think it does. We applied AWS's rule scoped to the host header of the affected site, with `Action: Count`: ```yaml - Action: Count: {} Name: ja4-count-AMSSEC-XXXXX Priority: 8 Statement: AndStatement: Statements: - ByteMatchStatement: FieldToMatch: SingleHeader: Name: host PositionalConstraint: EXACTLY SearchString: app.example.com TextTransformations: - Type: NONE Priority: 0 - ByteMatchStatement: SearchString: t13d1517h2_8daaf6152771_b6f405a00624 FieldToMatch: JA4Fingerprint: FallbackBehavior: NO_MATCH TextTransformations: - Priority: 0 Type: NONE PositionalConstraint: EXACTLY VisibilityConfig: SampledRequestsEnabled: true CloudWatchMetricsEnabled: true MetricName: count-ja4-AMSSEC-XXXXX ``` Within 24 hours, the `CountedRequests` metric on this rule had crossed 17 million. That's when things started feeling off. --- ## Three findings that killed the rule We ran a series of Athena queries against the WAF logs to break down what was actually matching. ### Finding 1: The fingerprint matches normal browser traffic ```sql SELECT httprequest.clientip, httprequest.country, httprequest.uri, httprequest.httpmethod, count(*) AS request_count FROM wafLogTable WHERE year = 'xxxx' AND month = 'xx' AND day IN ('xx', 'xx') AND ja4fingerprint = 't13d1517h2_8daaf6152771_b6f405a00624' GROUP BY 1, 2, 3, 4 ORDER BY request_count DESC LIMIT 100; ``` The results showed traffic from across legitimate location, hitting normal user-facing URLs, with a wide spread of source IPs — most of them sending only a handful of requests each. This is what *legitimate* traffic to a popular public-facing site looks like. Decoding the JA4 itself confirmed the suspicion: `t13d1517h2` is TLS 1.3 with HTTP/2, which is the default for modern Chrome, Edge, and Firefox. In other words, the fingerprint matched a property of "Chrome on Windows," not a property of "the attacker." ### Finding 2: The fingerprint predates the attack If a fingerprint is genuinely tied to an attack, you'd expect to see it appear or spike *around* the attack window. We checked the 5 days before the reported attack date: ```sql SELECT day, count(*) AS request_count FROM wafLogTable WHERE year = 'xxxx' AND month = 'xx' AND day IN ('xx', 'xx', 'xx', 'xx', 'xx') AND ja4fingerprint = 't13d1517h2_8daaf6152771_b6f405a00624' GROUP BY day ORDER BY day ASC; ``` The daily counts in the *quiet* period before the attack were already in the millions. This fingerprint had been present in our baseline traffic for as long as we could see in the logs. It wasn't introduced by the attackers. They just happened to use a TLS stack that produced the same fingerprint as every Chrome user on the internet. ### Finding 3: We were about to block our own monitoring This was the part that made me actually stop and re-read the query. Among the IPs matching the fingerprint, three stood out because they were sending steady, low-volume, evenly-paced requests from a specific public range. A quick lookup against AWS's published IP ranges identified them as the public probes used by one of AWS's own synthetic monitoring services, which our team uses to detect uptime issues on the platform. So if we'd flipped the rule to Block, the first thing we would have blocked was AWS's monitoring of our own site. Every health check would have started failing. Our pager would have lit up. And we would have spent the next hour debugging "why is the site down?" while the site was, in fact, completely fine. --- ## Going back to AWS At this point we had enough evidence to push back. I wrote up the findings and sent them back to the AWS Security engineer who had originally proposed the rule: > Thank you for the details provided regarding the DDoS event on our CloudFront distribution. > > We have applied the recommended WAF rule in Count mode as advised, and have conducted a thorough investigation of the traffic matching the provided JA4 fingerprint. Unfortunately, we are unable to proceed with blocking based on this indicator alone. Here are our findings: > > 1. **The fingerprint is too broad.** The JA4 corresponds to a standard TLS 1.3 / HTTP/2 configuration used by modern browsers (Chrome, Edge, Firefox) and common HTTP client libraries. Over a 24-hour period, we observed over 17 million requests matching this fingerprint, the vast majority of which are legitimate user traffic. > > 2. **The fingerprint predates the attack.** Analysis of our WAF logs confirms that this fingerprint was already present in our normal baseline traffic from at least 5 days before the reported attack, coming from a wide range of IPs across multiple European countries. > > 3. **Our own AWS infrastructure matches this fingerprint.** Our synthetic monitoring probes also match this fingerprint. Blocking on this rule would disrupt our own site availability monitoring. > > Could you please provide a more specific indicator to isolate the attack traffic? For example: a combination of JA4 fingerprint + specific source IP ranges or ASNs, a more unique fingerprint, or URI patterns and request rate thresholds specific to the attack. A few days later, an AWS WAF Support specialist took over the case. Her reply was honest: the original recommendation had come from the Security Incident Response team's own traffic analysis during the attack window, but they didn't have additional indicators to share. And since several days had passed, the attackers had likely rotated their technique anyway. A static fingerprint block would now be both **too broad** (catching legitimate traffic) and **too narrow** (an attacker today would look different). What she recommended instead was much more useful: 1. **Tighten the existing managed rules.** Specifically, set `AWSManagedIPDDoSList` inside the `AWSManagedRulesAmazonIpReputationList` group from `Count` to `Block`. This list is maintained by AWS threat intelligence and targets IPs actively participating in DDoS activity *right now* — it self-updates, unlike a static fingerprint. 2. **Lower the rate-based rule thresholds.** Our existing per-IP limit was xxx,xxx requests per 5 minutes, which is far too permissive against a distributed attack. Real human users on this platform rarely exceed a few hundred requests in that window. 3. **Increase Anti-DDoS sensitivity from LOW to MEDIUM.** We had been running it at the most permissive level. 4. **Enable AWS WAF Bot Control at Targeted inspection level.** This adds active detection for coordinated automated traffic, which is most of what a distributed attack looks like. 5. **Set `HostingProviderIPList` to Challenge.** Most legitimate users don't come from datacenter IP ranges. A CAPTCHA challenge filters out automated traffic from cloud and VPS providers without blocking the rare real user behind one. We deployed those changes over the following week, with each rule going through the same `Count → validate → Block` cycle. None of them would have broken our monitoring. --- ## What I took away from this A few things worth writing down so I remember them next time. **Count mode is non-negotiable.** Every new WAF rule, every time, no exceptions, even when the rule came from AWS themselves. The cost of a day in Count mode is logging volume. The cost of a bad Block rule is your site going down for as long as it takes you to realize *YOU* did it, not the attackers. **A vendor recommendation is an input, not an instruction.** AWS Security wasn't wrong to send us the rule — they're acting on their own traffic data and trying to help. But the data *they* see during a 45-minute attack window is much less than the data *we* see in our logs across days and weeks. The party with more context has to do the validation work. That's us. **Static indicators decay fast.** A JA4 fingerprint, an IP list, a specific URI pattern — these are all snapshots of how an attacker behaved at one point in time. They're useful for the next 24 hours, sometimes less. Dynamic protections (managed rule groups that self-update, rate-based rules, behavioral challenges) are what actually carry weight over time. We had been under-using them. **Block your own monitoring once, and you'll never forget to check again.** If we had skipped Count mode, the first metric to go red after deploying the Block rule would have been our own synthetic checks. Which we would naturally have assumed meant the site was actually down. The detection-and-response confusion alone would have cost us at least an hour before anyone thought to look at WAF logs. The DDoS attack itself, in the end, was the easy part. AWS Shield absorbed most of it. The harder part was responding to the *response* — making sure that whatever we put in place to "prevent it next time" didn't quietly become a bigger outage than the attack ever was.

Read more →

WELCOME

May 14, 2026· dsamist

For the longest time, I have always wanted a place where I could document the things I learn while working across I.T infrastrcuture and technology systems — cloud infrastructure, system administration, Kubernetes, AWS, Terraform, databases, automation and lots more, along side the beautiful chaos that comes with managing technology at scale. So this blog is finally that place. This has actually been on my mind for a while now, but laziness combined with some unknown factor has limited me. However, I guess it's time to at least start something. Not because I know everything — far from it — but because I have realized that some of the best lessons in tech come from the little things: * the weird errors that make no sense at 2 AM, * the deployment that suddenly breaks after a “minor” upgrade, * the database issue that appears out of nowhere, * the debugging session that teaches you more than an entire course, * the workflow that behaves differently in production, * or the random Eureka 💡 moment (this has really been my best in all the years) after hours of frustration. A lot of technical knowledge gets lost in chats, tickets, notebooks, and unfinished drafts. I wanted a place where experiences, lessons, mistakes, fixes, research, and practical insights from real-world engineering and system management could be properly documented and shared. This blog will contain thoughts, notes, and practical experiences around: * Cloud & DevOps * Kubernetes * AWS * Terraform * Platform Engineering * Databases & System Administration * Automation * CI/CD * Monitoring & Observability * Infrastructure at scale * And occasionally, the realities of navigating tech and continuous learning. And while this started as a personal initiative, I intentionally designed the platform to accept contributions from others as well. Technology grows through shared knowledge, and I strongly believe there are countless engineers, administrators, researchers, and builders with valuable experiences worth documenting. Whether you are a software engineer, cloud engineer, DBA, AI engineer, system administrator, platform engineer, researcher, or simply someone trying to grow in tech, I hope you find something useful here — and perhaps someday contribute your own insights too. If anything here helps even one person debug faster, think better, or feel less alone while battling technical problems, then this blog has served its purpose. Welcome to the journey!

Read more →

// contribute

Have something to share?

No account needed. Submit a post on cloud, DevOps, or platform engineering — I'll review it and publish it if it's a good fit.

Write a Post →

// newsletter

Stay in the loop

Get an email when a new post is published. No spam, unsubscribe anytime.