How we built Cloudflare’s data platform and an AI agent on top of it

Cloudflare processes more than a billion events every second. Our network spans 330+ cities in 120+ countries. Behind every HTTP request, every Worker invocation, every R2 read operation, there is data, and a lot of it.

For years, that data was not very easy to access. It lived in dozens of production databases, ClickHouse clusters, Kafka streams, Google Cloud buckets, BigQuery datasets, and a long tail of pipelines. To answer a simple question like "How many domains that signed up today are in the Top 100 by traffic?", an analyst at Cloudflare had to know which system to ask, what credentials to use, what query language to write, and whether the data they were looking at was sampled, fresh, or seven-days stale. As a result, it was difficult to glean informed insights from the data.

To solve this problem, we built two in-house tools: Town Lake, Cloudflare's unified data analytics platform, and Skipper, an AI data agent that runs on top of it. Town Lake is a single SQL interface to everything Cloudflare knows, and Skipper is how anyone at Cloudflare can ask questions in plain English and get correct, auditable answers back in seconds.

This is the story Continue reading

The AI Agent Accountability Gap: Why Network Policies, API Gateways, And RBAC Are Not Enough

In The Five Pillars of AI Agent Accountability: A Diagnostic Framework for Engineering Leaders, we walked through each pillar of AI agent accountability (traceability, authorization provenance, identity and ownership, policy at scale, and human oversight) and argued that most enterprises today sit at Level 0 or Level 1 of the Accountability Maturity Model.

The most common reaction we get when we share that framework is some version of: “We’re already covered. We have network policies. We have an API gateway. We have RBAC.”

This article is for that reaction.

Enterprises aren’t starting from zero. Most have invested in security, networking, and identity infrastructure that works well for traditional workloads. The problem isn’t a lack of tools. It’s that existing tools were designed for model outputs, not autonomous actions; a world where services are deterministic, communication patterns are predictable, and humans make all the decisions.

Agentic AI breaks every one of those assumptions. Here’s where the most common approaches each leave a critical accountability gap.

Network policies: the wrong abstraction level

Kubernetes Network Policies are essential for securing any cluster. They restrict which pods can communicate with which other pods at the network level, and they should absolutely Continue reading

Iran’s Internet is partially restored, Cloudflare Radar data shows

On Tuesday, May 26, Iran’s vice president announced that Internet access had started to be restored in the country after being cut off almost three months ago, following the launch of U.S. and Israeli attacks on February 28.

Cloudflare Radar data confirms increased activity and indicates a partial restoration of the Internet in Iran. In this blog post, we’ll examine a range of data points that provide a lens into this prolonged shutdown – and the signs that Iran’s citizens are increasingly able to connect once again. As the situation continues to unfold, Radar will have the latest data on Iran’s connectivity.

The first shutdown

Iranian citizens have experienced two national Internet shutdowns this year. The first began on January 8 around 16:30 UTC (20:00 local time), and we explored the impact seen over the first few days in a blog post. Traffic from Iran remained near zero until January 21, when a small amount of traffic returned, only to disappear a little over 24 hours later. A similar brief restoration also occurred on January 25, before traffic recovered more fully beginning on January 27.

The second shutdown

In late February, as military strikes on Iran escalated, a second Continue reading

Chesterton’s Fence

Chesterton's Fence

Imagine yourself walking down a country lane, lush green grass around you, no farm animals anywhere, when suddenly you see a fence right in the middle of the path. You think, now, that’s a bit silly, that fence is blocking the path, somebody should have this fence removed. And by thinking that you’d fall right into the predicament known as Chesterton’s Fence. That is, you see something that you instinctively feel does not belong and you want to remove it. And perhaps that is exactly what needs to be done, but not before you ask a very important question, “why”? Why is the fence here? What function does it serve? Who put it there? What were they trying to achieve?

Chesterton's Fence

In any complex system, and most of the systems we work with these days are complex, problems often arise as a result of relationships and interactions between components. Our systems contain many components, some with special optimizations, some acting as local stabilizers, that might appear inefficient and unintuitive. Other components, or parts of the system seem to serve no apparent purpose at all.

Any given component is usually self-contained and can be understood, reasoned about, modified and improved by one Continue reading

DNS Centrality

As a collection of inter-twined markets, aspects of the Internet have been prone to excessive market distortions where one, or a small clique, of providers in market sector become completely dominant to the extent that there is no effective competition and no possibility of admitting additional market entrants. This form of market dominance is often termed "centrality". How centralised is the DNS?

The Case for VM and Container Consolidation in 2026

Two platforms, two teams, two procurement relationships, all doing one job. There’s a reason it ended up this way. There isn’t a reason it has to stay this way.

Ask anyone at a typical enterprise why the VM platform and the container platform are separate, and they’ll give you a sensible answer. The VM estate has been there for fifteen years. It runs the workloads the business depends on. Kubernetes got stood up later, when application teams started building microservices, and giving them their own environment made more sense than retrofitting one onto VMware. Two platforms, two teams, two roadmaps.

That’s how most enterprises got here.

The reasoning was sound at the time. The question is whether it still is.

This is the consolidation question most enterprises haven’t actually revisited, and it’s the one quietly absorbing more of your budget each year.

Figure 1. The current state most enterprises operate today.
Figure 1. The current state most enterprises operate today.

Why VM and container platforms ended up separate

If you operate both platforms, you know the shape of this already. There’s a VMware team: vSphere admins, network engineers who know NSX, storage specialists, plus a separate procurement relationship for the underlying virtualisation stack. Then there’s a Kubernetes team: platform Continue reading

PP111: New HPE Mist Features Validate NAC Changes, Enable Inline Microsegmentation (Sponsored)

HPE has announced new features in its Juniper Mist portfolio. On today’s sponsored Packet Protector, we dig into those features, including a dry run option that lets organizations test and refine Network Access Control (NAC) policies before pushing them out, a policy validation feature that can identify shadow NAC rules, and a microsegmentation capability aimed... Read more »

NB576: IBM Gets Big Bucks to Build Quantum Chip Fab; AT&T Sues to Hang Up on Copper Phone Lines

Take a Network Break! We sound the alarm about a critical vulnerability in an on-prem Azure stack. On the news side, AI NetOps startup Selector adds public cloud observability to its portfolio, Versa Networks adds zero trust capabilities to its AI assistant, and IBM gets a billion-dollar investment to build a foundry to fabricate quantum... Read more »

Worth Reading: Your Code Is Worthless

Did you manage not to stumble on a dramatic post explaining how someone generated 10,000 lines of code with AI while wasting time on your LinkedIn feed? Congratulations, you’re lucky.

However, as Nathaniel Fishel explained in his Your Code Is Worthless article, the “lines of code” is a useless vanity metric that sounds great in a LinkedIn self-promotion, but doesn’t matter when one has to maintain the product one has shipped to the customers. Add the natural laziness, and you have a perfect storm. As he wrote:

RustRadio UI improved

This is just a short followup to the last RustRadio post. If you came for more rants about C, you’ll be disappointed.

I’ve never been that interested in writing UI code, including HTML. You can see the “programmer art” in the screenshots linked from www.habets.pp.se.

And then the slightly different tech section, that doesn’t serve much of a purpose now that we have github.

I’ve not been happier with GTK, QT, and the others either.

But [RustRadio][rustradio] needs a UI.

I feel like the browser is the most stable and portable UI. So I’d already decided on that. So now I have to manually do a bunch of DOM manipulation, to create an interactive UI? Or worse, learn the React/Angular/Whatever flavor of the day, that will be obsolete by next afternoon? Gag me with a spoon.

LLM to the rescue

For now I’m just continuing to focus on the SDR and architectural parts of RustRadio, and I’m letting the LLM-written code do the HTML manipulation.

Yeah, it’s kinda vibe coding. But doesn’t use unsafe, and it demonstrably outputs what I want. (I mean, sure it may require some follow-up prompts), so who cares?

The Continue reading

Kubernetes Operational Maturity: Secure and Resilient Cluster Federation with Cluster Mesh

Practically no one runs a single Kubernetes cluster in production these days. Maybe that’s how it started but data sovereignty requirements, acquisitions, AI initiatives and the need for edge servers, among other considerations, have pulled most enterprises into multi-cluster territory whether they planned for it or not. Reaching Kubernetes operational maturity—the point at which a fleet of clusters operates as one secure, observable, policy-consistent system—depends entirely on how those clusters are connected. Operating in a multi-cluster environment has evolved into the unspoken standard, one requiring a careful re-evaluation of the network architectures used to link clusters together.

That re-evaluation rarely happens. Most enterprises connect their clusters with the same networking patterns they were using before Kubernetes existed: load balancers fronting internal services, DNS records published to external zones, and IP-based firewall rules. Those patterns were built for north-south traffic moving in and out of a traditional data center perimeter, not for east-west traffic moving between internal workloads.

Running east-west traffic on north-south plumbing

The conventional way to make services in one cluster reachable from another is to expose them externally with a load balancer in front, a DNS name registered in a public zone, a firewall rule allowing traffic in. Continue reading

SONIC Part III: SONiC Introduction

SONiC is a vendor-neutral, Linux-based network operating system (NOS) that uses a database-driven architecture. Its software components run in multiple containers and exchange information through Redis. In SONiC, several named databases are defined for different functions, and these databases are mapped to Redis logical database IDs. Through this design, configuration data, application state, operational state, and ASIC-related state move between software layers by means of specialized processes.

Different hardware vendors may add their own platform integrations, transceiver support, monitoring utilities, or management workflows. However, the core SONiC architecture remains the same. This is one of the main reasons why SONiC knowledge, troubleshooting methods, and automation practices are transferable across different hardware platforms.

Vendor neutrality does not mean that every SONiC-based implementation behaves exactly the same in every operational detail. It means that different implementations follow the same architectural model. To organize information clearly, SONiC defines several named databases, each of which is mapped to a Redis logical database ID:

·       CONFIG_DB (Redis DB 4): Stores the user’s intended configuration.

·       APPL_DB (Redis DB 0): Stores application-level objects that are ready for processing by lower software layers.

·       STATE_DB (Redis DB 6): Stores operational state information about system Continue reading

Scaling Akvorado BMP RIB with sharding

To associate routing information—like AS paths or BGP communities—to flows, Akvorado can import routes through the BGP Monitoring Protocol (BMP). As the Internet routing table contains more than 1 million routes, Akvorado needs to scale to tens of millions of routes.1 This has been a long-standing challenge,2 but I expect this issue is now fixed by using RIB sharding, a method that splits the routing database into several parts to enable concurrent updates.

Previous implementation

Akvorado connects 2 elements to build its RIB:

  1. a prefix tree, and
  2. a list of routes attached to each prefix.
Akvorado BMP RIB implementation before sharding with the memory layout of each
structure and a single lock.
Akvorado BMP RIB implementation without sharding. One single read/write lock.

In the diagram above, the RIB stores five IPv4 prefixes and two IPv6 prefixes. One of them, 2001:db8:1::/48, contains three routes:

  • from peer 3, next hop 2001:db8::3:1, AS 65402, AS path 65402, community 65402:31,
  • from peer 4, next hop 2001:db8::4:1, same ASN, AS path, and community,
  • from peer 5, next hop 2001:db8::5:1, AS 65402, AS path 65401 65402 Continue reading

The Five Pillars of AI Agent Accountability: A Diagnostic Framework for Engineering Leaders

You’re in a board meeting. The CISO is presenting on AI risk. The CFO asks a simple question:

“When that finance agent we deployed last quarter accessed a customer payment record, can we tell who authorized it, what policy permitted it, and produce the full audit trail?”

The CISO looks at the head of the platform. The head of the platform looks at security. Nobody answers.

If you can picture that meeting happening at your company, you’re not alone. McKinsey found that only one-third of organizations have AI agent governance maturity at level 3 or higher. The other two-thirds are exactly the silence in that boardroom.

This post is the diagnostic framework that closes that gap. It’s part 2 of a five-part series on AI agent accountability, and if you only have time to read one post in the series, read this one. By the end you’ll have a five-question assessment to run with your team this week, and a maturity model to score where you stand today.

Not all governance equals AI agent accountability. Many enterprises believe they’re covered because they have network policies or an API gateway, but governance without accountability is a security theater: it Continue reading

HN828: How Selector Unifies Cloud and On-Prem Network Observability (Sponsored)

Selector is extending its AI-driven network observability capabilities into public clouds. On today’s sponsored episode, we dig into how Selector gathers and analyzes public cloud network telemetry, how it integrates cloud and on-prem network data to provide end-to-end visibility, how it integrates with third-party Application Performance Monitoring (APM) systems to correlate network and application performance,... Read more »

Hedge 306: RPKI Transport

Synchronizing information across the Internet, at an initial glance, looks like a fairly simple problem to solve. Just copy a file to a host and create a magic protocol, right? Not really. Each kind of data has a fairly unique set of requirements–and RPKI data, used to provide security information for BGP, is no different. Job Snijders joins Tom and Russ to talk about ERIK, a protocol developed to synchronize RPKI records.
 
For more information, check out Job’s web site and the IETF draft.
 

 
download

1 2 3 3,873