Cloudflare Outage: The Infrastructure Concentration Risk Hidden in Technical Excuses

Published:Nov 18, 2025, 06:31pm ESTUpdated:Mar 18, 2026, 12:18pm EDTReading Time:21 min read

Tech

Published:Nov 18, 2025, 06:31pm EST

Updated:Mar 18, 2026, 12:18pm EDT

Reading Time:21 min read

Finished reading? Continue your journey in Tech with these hand-picked guides and tutorials.

Share on X Share on LinkedIn Share on Facebook

TrueSolvers Toolbox

Boost your workflow with our browser-based tools

100% Private

File Encryptor

Military-grade encryption for your documents.

Image Converter

Convert between JPG, PNG, and WebP.

Image Resizer

Resize images for social media & web.

View all 10+ tools

Write for Us

Share your expertise with our readers. TrueSolvers accepts in-depth, independently researched articles on technology, AI, and software development from qualified contributors.

Get Started Editorial Policy

What Actually Failed on November 18 (and Why the Timeline Matters)

The official story is that a configuration file exceeded expected size. The actual story is more specific, and more instructive.

On November 17, Cloudflare's engineers made a routine database permission change in their ClickHouse system, granting users explicit rather than implicit access to underlying metadata tables. The query that generated bot management feature files now returned data from both the default schema and the underlying shard schema simultaneously, more than doubling the rows in its output. Cloudflare's bot management module carried a hard-coded limit of 200 machine learning features. When the oversized feature file arrived, a dormant bug in the Rust code attempted to unwrap a value that wasn't there a type of unhandled error that causes an immediate crash. The file regenerated every five minutes. Through Cloudflare's Quicksilver configuration distribution system, the crash propagated across 330-plus data centers globally before any human operator could intervene. Cloudflare's post-mortem documents all of this in detail, including the confirmation that the failure had no connection to any cyberattack or malicious activity.

Services running on Cloudflare's newer FL2 proxy engine experienced hard crashes returning HTTP 500 errors. Services on the legacy proxy engine didn't crash they defaulted to zero bot threat scores, which caused bot-blocking rules to drop legitimate traffic. Either path meant failure. Independent monitoring from Cisco ThousandEyes confirmed the failures were occurring at the server level, not the network level, which meant Cloudflare's network infrastructure was physically available; the service sitting on top of it was not.

The detail that stands out in Cloudflare's own account is what happened to the dashboard. Cloudflare's bot management failure took with it Turnstile, its no-CAPTCHA bot solution. Turnstile is used in the Cloudflare dashboard login flow. At the precise moment customers most needed to log in and make emergency configuration changes, the system protecting against automated access was also preventing legitimate human access. The protective system was fighting its own operators.

Full service restoration wasn't confirmed until 17:06 UTC. Core traffic routing recovered earlier, around 14:30 UTC, but the downstream services affected by the bot management failure required additional hours to stabilize. The total disruption window, from first failure to complete resolution, was nearly six hours.

What happened across both the November and December 2025 incidents is qualitatively different from an ordinary single point of failure. A single point of failure means one component breaks and its function is lost. Here, the failure of one component disabled the tools designed to repair it, extending the damage window and reducing human operators' ability to intervene. That's an architectural property of the overall system, not a bug in any one component, a pattern of compounding infrastructure failures where protective systems become obstacles to recovery, is something we have tracked across multiple Cloudflare incidents as the dependency risks have deepened.

The Market Structure That Turns One Failure Into Everyone's Problem

The question worth asking isn't why Cloudflare's bot management system had a latent bug. Every system at sufficient scale carries latent bugs. The question is why a latent bug in one company's bot management module affected ChatGPT, X, Spotify, Canva, and thousands of other services simultaneously.

The Internet Society's Pulse tracker has been measuring CDN market concentration using the Herfindahl-Hirschman Index, the same metric the US Department of Justice and Federal Trade Commission use to assess antitrust risk in mergers. The data shows the CDN market HHI rising from 2,448 to 3,410 for the top 10,000 most-visited websites since June 2021. An HHI above 2,500 places a market in the "highly concentrated" category by standard economic definition. The CDN market has been there for years and is moving further in that direction.

Five providers Cloudflare, Amazon, Google, Akamai, and Fastly host the landing pages for 60% of the most popular domains. The same Internet Society research, drawing on academic analysis of third-party dependencies, found that 89% of the top 100,000 websites critically depend on at least one external DNS, CDN, or Certificate Authority provider. The reach of the top three providers in each category extends further than their direct customer counts suggest, because indirect dependencies a site's payment processor, analytics platform, or identity provider sharing the same CDN infrastructure can amplify a single provider's effective reach by roughly 25 times its apparent market share.

This concentration has a structural amplifier that market share numbers don't fully capture. When a provider bundles DNS resolution, content delivery, DDoS mitigation, edge computing, and bot management into a single integrated platform, a configuration error in any one of those functions can propagate through all of them simultaneously. The Internet Society calls this a breakdown of the "separation of concerns" principle. What it means in practice: Cloudflare isn't just a CDN that failed. It's a platform where a bot management failure could simultaneously affect routing, access controls, and dashboard availability which it did.

Cloudflare's market share figure is visible and frequently cited. The more alarming number is the HHI trajectory. Concentration at 3,410 with an upward trend represents a market that is becoming more concentrated, not stabilizing. Every rational economic decision a company makes choosing Cloudflare for cost efficiency, reliability track record, and the convenience of consolidating DNS, CDN, and security under one provider, individually adds to a collective exposure that no individual company is weighing when it makes that choice. The Atlantic Council's analysis of cloud infrastructure risk names this "compounded dependence": the aggregate of individually rational choices producing systemic fragility that no single actor intended and no single actor can individually solve.

The Engineering Assumption That Broke Twice in Eight Weeks

Three weeks after the November 18 incident, a second Cloudflare failure affected 28% of applications globally for approximately 25 minutes. The technical specifics differed. The institutional cause did not.

In its Code Orange: Fail Small post, Cloudflare identified what both incidents had in common: configuration changes deployed instantaneously to data centers across hundreds of cities, without the staged rollout protections applied to formal software releases. Software updates at Cloudflare go through gated deployments, canary testing, and automated rollbacks. Configuration changes had been treated as categorically lighter than code cheaper to push, faster to apply, safe enough to send everywhere at once. The Quicksilver system makes this physically possible: a configuration update can reach every Cloudflare data center globally within seconds. When that configuration is correct, this speed is valuable. When it isn't, it means a problem propagates at machine speed, well ahead of any human response.

The Code Orange designation is significant on its own terms. Cloudflare had previously declared a Code Orange only once in its history. The declaration halts all other engineering priorities and redirects the company's engineering capacity to a single program. Their remediation plan centers on Health Mediated Deployment: a new system that applies the same staged, monitored rollout logic to configuration changes that software releases already receive.

The AWS Parallel

Less than a month before the November Cloudflare incident, AWS experienced an outage in its US-EAST-1 region that lasted approximately 15 hours. The root cause was a DNS race condition in DynamoDB's internal management system: two automated processes running concurrently wrote conflicting versions of endpoint DNS records, causing DynamoDB service endpoint records to effectively disappear. Services that depended on DynamoDB couldn't locate it. ThousandEyes documented more than 140 AWS services affected across more than 60 countries; separate tracking by Ookla recorded over 17 million user reports globally.

The structural parallel to the Cloudflare incidents is the automation gap. The DynamoDB DNS race condition was not a software release; it was an automated management process operating without the safeguards applied to deliberate human-initiated changes. IAM authentication the system engineers needed to access management consoles to diagnose and remediate the outage was itself affected by the DynamoDB failure, recreating the same circular dependency problem that Cloudflare's Turnstile failure produced. Engineers couldn't access the tools they needed because the tools depended on the infrastructure that had failed.

Three separate major providers. Three separate root causes. The same underlying assumption in each case: automated processes and configuration changes exist in a safety tier below formal code releases and therefore don't require the same staged validation. At the scale these providers operate, that assumption doesn't hold. A configuration update affecting one service can propagate across hundreds of locations faster than any human team can assess whether it's safe, creating an exposure window between deployment and detection that scales directly with the size of the infrastructure.

"Not an Attack" Is the Most Alarming Part of These Post-Mortems

Cloudflare's post-mortem explicitly states the November 18 incident had no connection to a cyberattack or malicious activity. The same was true of the December 5 Cloudflare outage. The same was true of the AWS October incident. The same was true of the CrowdStrike failure in July 2024.

CrowdStrike provides the clearest financial measurement of what non-adversarial infrastructure failure actually costs. Parametrix Insurance estimated $5.4 billion in direct losses to Fortune 500 companies from the July 2024 outage, with healthcare absorbing roughly $1.94 billion and banking $1.15 billion. Fewer than 8.5 million devices were affected less than 1% of active Windows machines globally. The losses reached $5.4 billion not because of the device count but because those devices were concentrated in the critical operational infrastructure of large enterprises. The insured portion of those losses: between $540 million and $1.08 billion, or 10-20% of total damages.

The "not an attack" framing matters because it defines the boundary of what security investment can protect against. Organizations can add firewall layers, improve access controls, harden their configurations, and achieve genuine security improvements none of which would have prevented the Cloudflare outage, the AWS outage, or the CrowdStrike failure. The November 18 incident began with a database permission change that was technically correct in its intent and incorrect in its downstream effects. There was no vulnerability to patch, no attack surface to reduce. The exposure was operational.

The insurance data suggests a market that has not yet priced systemic concentration risk. When $5.4 billion in losses generates between $540 million and $1.08 billion in insured claims, the financial signal to cloud and CDN providers is muted. Cyber insurance premiums reflect the risk of adversarial events far better than they reflect the risk of operational failures at concentrated providers. Without actuarial pressure driving providers toward resilience investment, the financial incentives all point toward continued concentration rather than away from it.

The Regulatory Gap Is Narrowing in Europe and Stalling in the United States

The policy environment for internet infrastructure risk has changed more in the past eighteen months than in the prior decade, but the changes are uneven.

The EU's Digital Operational Resilience Act (DORA) took full effect in January 2025. It creates specific requirements for financial sector organizations around their cloud provider dependencies: concentration risk assessments, exit strategy planning, and ICT provider oversight. The Network and Information Security Directive 2 (NIS2) explicitly covers digital infrastructure providers including DNS operators and data center operators as entities within scope, a meaningful expansion from the original NIS framework. The effect is that European financial firms now face genuine regulatory scrutiny of the concentration risks their cloud choices create.

In the United States, the picture is less clear. The Cyber Incident Reporting for Critical Infrastructure Act (CIRCIA) final rule, which would create mandatory reporting requirements covering cloud infrastructure incidents, has been delayed to May 2026. CISA's December 2025 update to its Cybersecurity Performance Goals explicitly calls out risks from third-party providers with deep system access, but CPG guidance is voluntary. The Atlantic Council has documented Congressional proposals to designate major cloud providers as systemically important financial market utilities the same regulatory category applied to financial clearinghouses but these proposals have not advanced.

The gap that stands out is not the absence of regulatory attention but the mismatch between the regulatory tool and the actual risk. DORA addresses how organizations manage their cloud dependencies. It doesn't address the concentration itself. An organization can achieve full DORA compliance while continuing to route all traffic through a single CDN provider, as long as they have documented the risk and planned an exit strategy. NIS2 creates obligations for digital infrastructure providers to maintain resilience, but its mechanism for addressing market-level concentration is limited. Whether DORA and CIRCIA will meaningfully slow the trajectory of CDN concentration, or simply require more documentation alongside it, is genuinely uncertain at this stage.

What has changed is the acknowledgment. For most of the past decade, the regulatory conversation treated cloud infrastructure as a procurement concern rather than a systemic risk concern. The current language in DORA, NIS2, and even CISA's guidance frames it explicitly as systemic. That's a different starting premise, even if the regulatory tools available to act on it are still catching up.

What Organizations Can Actually Do When the Risk Is Structural

The honest framing is that no individual organization can solve concentration risk through its own procurement decisions. Cloudflare serves 20% of web traffic precisely because the economics of scale and integration create genuine advantages that competitors can't easily replicate at equivalent cost. Telling organizations to "diversify their CDN" is correct advice that most organizations can't fully act on without accepting meaningful performance and cost trade-offs.

That said, the November 18 incident sorted organizations into two groups: those who had tested what to do when Cloudflare was unavailable, and those who discovered their options in real time during the outage. Independent monitoring confirmed that some organizations executed DNS failover to origin infrastructure within the first thirty minutes. Others waited three to six hours. The difference wasn't security investment or architectural sophistication it was whether someone had previously run through the failure scenario and knew which steps to take in what order.

The Dependency Audit Most Companies Haven't Run

The most valuable preparation most organizations haven't done is mapping what actually depends on what not their own services, but the services their services depend on. Payment processors, identity providers, monitoring tools, collaboration platforms, and CRM systems all carry their own CDN and cloud dependencies. Organizations that believed they had diversified across multiple vendors discovered during the November outage that their vendors' vendors all shared the same underlying Cloudflare infrastructure. The failure cascaded through the entire dependency tree, not just the first layer.

Cloudflare's own Code Orange program identified this as a direct priority: break glass emergency procedures that relied on Cloudflare infrastructure were unavailable during the outage itself. Any emergency procedure that depends on the failed infrastructure to execute isn't an emergency procedure; it's a hypothetical. Viable recovery paths have to be reachable through out-of-band channels that don't share infrastructure with the primary failure.

The organizations that recovered fastest shared a specific characteristic: they had defined, in advance, what "recovery" looked like for each service tier, who owned each decision, and what the acceptable performance trade-off was when operating on origin infrastructure without CDN caching and protection. That's not a technical architecture problem. It's an operational readiness problem that doesn't require solving concentration risk to address.

Multi-CDN architecture remains the technically correct long-term answer for organizations where availability is a hard requirement. The practical constraint is that truly active-active multi-CDN configurations require significant engineering investment and introduce their own complexity. For most organizations, a more achievable target is a tested, documented failover path to a secondary provider that can handle critical traffic, with regular drills to confirm the path actually works when needed.

The structural risk Cloudflare's outage exposed isn't going away. The CDN market is more concentrated than it was five years ago, the regulatory tools to address concentration are still developing, and the economic incentives that drive businesses toward major integrated platforms haven't changed. What organizations control is their readiness when the next outage arrives and given the frequency of incidents in 2025, planning on there not being a next one is no longer a reasonable assumption.

FAQ

Why did so many unrelated services go down when Cloudflare failed?

Most users experience Cloudflare's role as invisible it sits between their browser and the websites they visit, handling content delivery, bot filtering, and DDoS protection without any visible indication. When that layer fails, requests never reach the underlying servers, even though the servers themselves are fully operational. The breadth of the November outage reflects how many services had concentrated their infrastructure behind a single provider.

Does using multiple CDN providers protect against this kind of outage?

Active-active multi-CDN configurations where traffic is distributed across two providers simultaneously provide meaningful protection, but they require significant engineering investment and ongoing maintenance. A more practical starting point for most organizations is a documented, tested failover path to a secondary provider, combined with a third-party dependency audit to identify where hidden Cloudflare exposure exists in vendor chains.

Is Cloudflare more failure-prone than other major infrastructure providers?

The November and December 2025 incidents are notable but not unique to Cloudflare. AWS experienced a 15-hour US-EAST-1 outage in October 2025. Microsoft Azure had its own major disruptions in 2025. CrowdStrike's July 2024 failure caused $5.4 billion in Fortune 500 losses. The pattern is consistent across providers: infrastructure at sufficient scale encounters failure modes that smaller systems don't, and the concentration of services behind a few providers amplifies the blast radius of any individual failure.

What is Cloudflare doing to prevent recurrence?

Following the November and December 2025 outages, Cloudflare declared a Code Orange only the second time in the company's history. Their remediation plan centers on Health Mediated Deployment, a new system that applies the same staged, monitored rollout logic to configuration changes that software releases already receive. They have also identified circular dependencies in emergency procedures specifically, tools used during incident response that themselves depended on Cloudflare infrastructure as a priority to resolve.

Why doesn't the government regulate internet infrastructure the way it regulates utilities?

The EU has moved in this direction: the Digital Operational Resilience Act (DORA) took full effect in January 2025, creating specific requirements for financial sector cloud dependencies, and NIS2 now covers digital infrastructure providers directly. The US has moved more slowly: CIRCIA's mandatory reporting requirements are delayed to 2026, and broader regulatory frameworks for cloud infrastructure concentration remain in the proposal stage. The gap between internet infrastructure's systemic importance and its regulatory treatment reflects both the speed at which the concentration developed and the political difficulty of applying utility-style regulation to technology companies.

Share Article

TrueSolvers Toolbox

Write for Us

Share Article

TrueSolvers Toolbox

Write for Us

Cloudflare Outage: The Infrastructure Concentration Risk Hidden in Technical Excuses

What Actually Failed on November 18 (and Why the Timeline Matters)

The Market Structure That Turns One Failure Into Everyone's Problem

The Engineering Assumption That Broke Twice in Eight Weeks

The AWS Parallel

"Not an Attack" Is the Most Alarming Part of These Post-Mortems

The Regulatory Gap Is Narrowing in Europe and Stalling in the United States

What Organizations Can Actually Do When the Risk Is Structural

The Dependency Audit Most Companies Haven't Run

FAQ

Written By

Share Article

TrueSolvers Toolbox