- cloudflare
- outage
- infrastructure
- postmortem
Introduction
On November 18, 2025, a significant chunk of the internet stopped working for about three hours. If you tried to access a site powered by Cloudflare between 11:20 AM and 2:30 PM UTC, you probably saw an error page instead of the website you wanted.
This wasn't a cyber attack. It wasn't a hardware failure. It was a database permission change that made a configuration file twice as big as expected, which then broke the software that reads it.
Here's what happened, why it matters, and what we can learn from it.
What Cloudflare Does
Before we get into the outage, it helps to understand what Cloudflare actually does. When you visit a website, your request doesn't go directly to that website's server. Instead, it often goes through Cloudflare first. Cloudflare sits between you and the website, handling things like:
- Blocking malicious bots and attacks
- Caching content to make sites faster
- Protecting against DDoS attacks
- Managing SSL certificates
Millions of websites use Cloudflare (including this one), which means when Cloudflare goes down, a massive portion of the internet becomes unreachable. That's exactly what happened on November 18.
The Chain of Events
At 11:05 AM UTC, someone made what seemed like a reasonable change to improve security in Cloudflare's ClickHouse database system. ClickHouse is a database that Cloudflare uses to store and query massive amounts of data quickly.
The change was about permissions. The goal was to make queries more secure by having them run under specific user accounts instead of a shared system account. This would let Cloudflare enforce better access controls and limits.
Sounds good, right? The problem was what this change did to an existing query that nobody thought to check.
The Bot Management System
Cloudflare has a Bot Management system that uses machine learning to figure out if incoming traffic is from real humans or automated bots. This system needs a configuration file that tells it what "features" to look for when analyzing traffic. A feature is just a characteristic of the request that helps determine if it's a bot or not.
Every few minutes, a query runs against the ClickHouse database to generate this configuration file. The query looks something like this:
SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;
Notice something? This query doesn't filter by database name. It just grabs columns from a specific table.
Before the permission change, this query would only return columns from one database. After the permission change, users suddenly had access to see metadata from additional databases. The query started returning duplicate entries because it was now seeing the same columns from multiple database locations.
This made the configuration file twice as big as it was supposed to be.
When Hard Limits Bite You
Cloudflare's software has a hard limit on how many features can be in that configuration file. The limit was set to 200 features, which seemed reasonable since they were only using about 60 features.
The limit exists for performance reasons. The system preallocates memory based on the expected number of features. It's an optimization to make things faster.
When the doubled configuration file showed up with more than 200 features, the software hit that limit and panicked. In programming terms, this means it threw an error and stopped working.
The specific error was in Rust code that looks like this:
let features = load_features()?;
if features.len() > MAX_FEATURES {
panic!("Too many features!");
}
When this panic happened, the entire request processing system crashed. Instead of serving websites, Cloudflare started returning HTTP 5xx errors, the infamous "something went wrong on the server" messages.
Why It Was Hard to Debug
What made this particularly nasty to diagnose was that the problem fluctuated. The configuration file was being regenerated every five minutes. During the gradual rollout of the database permission changes, sometimes the query would run on an updated database node and produce the bad file, sometimes it would run on an older node and produce a good file.
This meant Cloudflare's network would fail, then recover, then fail again every five minutes. The engineering team initially thought they were under a DDoS attack because the pattern was so unusual for an internal error.
Making things worse, Cloudflare's status page went down at the same time due to an unrelated issue. This made the team suspect they might be dealing with a coordinated attack on both their infrastructure and their status page.
They weren't. It was just unfortunate timing.
The Ripple Effects
When Cloudflare's core traffic routing broke, it didn't just affect websites. It broke other Cloudflare services that depend on the core proxy system:
- Workers KV, a key-value storage system, started returning errors
- Cloudflare Access, which handles authentication, failed for most users
- Turnstile, their CAPTCHA replacement, stopped loading
- The Cloudflare Dashboard became mostly inaccessible because it uses Turnstile for login
Even after the team identified the problem and started fixing it around 2:30 PM UTC, there was a long tail of recovery. Services that had entered a bad state needed to be restarted. A backlog of login attempts overwhelmed the dashboard. It wasn't until 5:06 PM UTC that everything was completely back to normal.
What This Teaches Us
This outage is a perfect example of how complex systems fail. No single thing was wrong. A permission change was reasonable. The query worked fine before. The hard limit on features made sense. But when you combine them in an unexpected way, the whole system falls apart.
The Bigger Picture
Cloudflare is a critical piece of internet infrastructure. When they go down, millions of websites become unreachable. This isn't the first major outage they've had, but it's the first one since 2019 that affected core traffic routing at this scale.
What's interesting is that this wasn't caused by complexity for complexity's sake. The permission change was trying to improve security. The hard limit was trying to improve performance. The frequent configuration updates were trying to keep the Bot Management system responsive to new threats.
These were all reasonable engineering decisions. The failure came from how they interacted in ways nobody predicted.
This is the reality of building and operating systems at internet scale. You can't predict every interaction. You can't test every scenario. All you can do is build systems that fail gracefully, give yourself tools to recover quickly, and learn from what breaks.
Conclusion
Three hours of downtime across a significant portion of the internet because a database permission change made a configuration file twice as big as expected. That's the kind of failure that keeps infrastructure engineers up at night.
The good news is that Cloudflare caught it, fixed it, and published a detailed postmortem explaining exactly what happened. They're not hiding behind vague "technical difficulties" language. They're being specific about what broke and why.
That transparency matters. Every engineer who reads their postmortem learns something. Every company that operates similar systems can check if they have similar vulnerabilities. The entire industry gets a little bit better at not making the same mistakes.
But it's also a reminder that no matter how good your engineering team is, no matter how much redundancy you build, complex systems will find new and creative ways to fail. The best you can do is be ready to respond when they do.
And maybe check what happens to your queries when you change database permissions.
Affiliate Links
Using these referral links helps support my work. I only recommend products I use and fully trust. Any of the financial links do not constitute financial advice, recommendations or endorsements.
