Lab: EVPN Asymmetric IRB with Anycast Gateways

I postponed the discussion of ARP issues with EVPN anycast gateways to keep yesterday’s blog post reasonably short. If you’re impatient and want to try that out, I have just the right lab exercise for you; you’ll have to extend VLANs into end-to-end MAC-VRF instances and add IRB and anycast gateways:

You can run the lab on your own netlab-enabled infrastructure (more details), but also within a free GitHub Codespace or even on your Apple-silicon Mac (installation, using Arista cEOS container, using VXLAN/EVPN labs).

Cisco AUTOCOR Passed

Yesterday I took and passed the Cisco AUTOCOR (previously DEVCOR) exam which is the core exam for CCNP Automation. That means I need a specialist exam to become CCNP Automation certified. It also means I’m qualified to sit the CCIE Automation lab.

What did I think of the exam?

As with any exam, there is good and bad. I’ll start with the good.

The exam aligned well with the blueprint. I didn’t feel there were any real surprises or questions on items that weren’t part of the blueprint.

There wasn’t a lot of trivia. No memorization of specific API endpoints or anything like that.

The exam experience was fine. I took it in a testing facility, which I prefer, and I had no issues. I was provided with earplugs which was nice to stay focused although this is a small facility and there was only one other candidate.

I liked the different types of questions. You have your standard multiple choice, single answer and multiple choice, multiple answer, but also fill in the blanks, and lablets. It’s nice that there is quite a bit of code in the exam, it is an automation exam after all. I also think it’s Continue reading

KubeVirt Live Migration Done Right: What it Takes to Run VMs on Kubernetes

Running VMs in Kubernetes sounds like a crazy workaround for avoiding vendor lock-in, and standardizing legacy applications and newer containerized workloads on one control plane with one set of security policies to govern them all. It is, however, a rapidly growing pattern, and KubeVirt live migration — moving running VMs between nodes without downtime — is increasingly central to platform engineering use cases that require full VMs, like on-demand CI/CD pipelines.

KubeVirt is gaining traction as a way to bring VMs into Kubernetes as first-class workloads, managed with the same tools and primitives that platform teams already use for containers. It has, however, introduced some unique challenges.

Here’s the uncomfortable truth about that migration: compute and storage are the easy parts. Networking is where migrations stall, roadblock multiple, and platform teams start questioning whether KubeVirt was the right call in the first place.

If your VMs have no fixed IP dependencies, no VLAN memberships, and no upstream firewall rules scoped to specific subnets, you can migrate them into Kubernetes without losing sleep over the networking layer. If you’re running hundreds or thousands of VMs with IP addresses hardcoded into application configs, DNS entries, and firewall ACLs — and you need Continue reading

The AI Agent Accountability Crisis: Why Governance Isn’t Keeping Up With Deployment

Every enterprise is building AI agents. Marketing has one summarizing campaign performance. Engineering has one triaging incidents. Customer support has one resolving tickets. Finance has one processing invoices. Each was built by a different team, using a different framework, with different assumptions about security.

Now those agents are talking to each other through agent-to-agent (A2A) communication. The incident-triage agent calls the customer-support agent to check affected accounts. The invoice agent calls an external payment API. The marketing agent queries a data warehouse with customer records.

When something goes wrong (and at this scale of deployment, it will), can you answer:

  • Who authorized the action?
  • What policy permitted it?
  • What was the full chain of events?

If you can’t, you have an accountability gap.

This is part one of a five-part series on AI agent accountability for engineering and security leaders. We’ll work through the gap between agent deployment and governance, the diagnostic framework that exposes it, why your existing tools won’t close it, and the principles you’ll need to evaluate any solution that claims it can.

What is AI agent accountability?

AI agent accountability is the ability to trace, prove, and audit every action an AI agent takes. This includes Continue reading

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

At Cloudflare, we are heavy users of ClickHouse, an open-source analytical database management system. We redesigned one of our largest ClickHouse tables to add a column to the partitioning key. The change enabled per-tenant retention on a table that serves hundreds of internal teams. The design went through several rounds of revision and review with engineers across multiple teams before we landed on the final approach. But a few weeks after rollout, the jobs that produce most of Cloudflare's bills were running up against their hard daily deadline.

All the usual suspects looked clean: I/O, memory, rows scanned, parts read. Everything we would normally check when a ClickHouse query is slow appeared to be normal. The problem turned out to be lock contention in query planning, something we'd never had reason to look for before.

This is the story of how this migration exposed a hidden bottleneck in ClickHouse's internals, and the patches we wrote to fix it.

The setup: a petabyte-scale analytics platform

We use ClickHouse to store over a hundred petabytes of data across a few dozen clusters. To simplify onboarding for our many internal teams, we built a system called "Ready-Analytics" in early 2022.

The premise is Continue reading

ARP with EVPN Asymmetric IRB

In a previous blog post, I described the ARP issues you’ll encounter when using centralized routing (on a spine switch) between two EVPN MAC-VRF instances (a fancy name for a VLAN encapsulated in VXLAN or MPLS).

That blog post established a baseline that will help us unravel the ARP behavior in a more realistic scenario: asymmetric Integrated Routing and Bridging (IRB). That’s a mouthful, but it’s really quite a simple concept; the following diagram explains the asymmetric forwarding behavior:

Packet forwarding in an EVPN asymmetric IRB design

Packet forwarding in an EVPN asymmetric IRB design

What’s New in Calico v3.32

We’re excited to announce the release of Calico Open Source v3.32! 🎉

This release corresponds with Kubernetes v1.36 (Codename Haru) and it goes beyond just sharing a cat as the mascot of the release, it actually extends capabilities and features of Kubernetes to keep you up to date with the latest innovations of the cloud.

This release brings some of the most significant architectural changes in Calico, from live-migrating KubeVirt VMs to eBPF based Maglev load balancer.
Here’s a quick look at everything that’s new:

🚨 Breaking Changes & Deprecations

  • ClusterNetworkPolicy (Alpha2) replaces Admin and Baseline Admin Network Policies: AdminNetworkPolicy and BaselineAdminNetworkPolicy have been removed. You must migrate to ClusterNetworkPolicy before upgrading to v3.32, as Calico will no longer enforce the old resources.
  • calico-apiserver Deprecated: The aggregated API server is deprecated and will be removed in a future release. It is being replaced by Native v3 CRDs. (Requires MutatingAdmissionPolicy feature gate, Kubernetes 1.30+).

🚀 Key Feature Updates

1. KubeVirt VM Live Migration Support

  • What it does: Allows live-migrating KubeVirt VMs between nodes without dropping TCP connections.
  • How it works: Achieves IP persistence by binding the IP to the VM name rather than the ephemeral pod.
  • Activation: Set kubeVirtVMAddressPersistence: Enabled Continue reading

OpenClaw Ruined AI and It Makes Me Happy

The biggest AI story of 2026 isn’t the growing need for electrical power or the ridiculous way the market sold out for RAM based on a letter of intent to acquire. No, the biggest AI story of the year so far is how a scrappy little project completely upset the AI apple cart. OpenClaw (nee ClaudeBot, nee OpenMolt) set the world on fire. And it destroyed how people were trying to direct AI. I’m sitting over here giggling about it.

Round The Clock

The basics of OpenClaw are simple enough. You have a system of agents that do things. It can read your texts or email and triage the flow of information. It can send you a text summary of the news or the weather every morning. But it can also be configured to monitor things as they arrive to deal with them on the fly. That’s where the real narrative shift has happened.

When you open a browser window to talk to an LLM you are creating a session that has a finite time limit. You are saying that you are going to work on a project for a specific period of time and that’s that. Once you complete Continue reading

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

We’ve enabled higher usage limits, faster performance, and better reliability for Browser Run by rebuilding on top of Cloudflare’s Containers.

You can now spin up 60 browsers per minute via the Workers binding and run up to 120 concurrently — 4x the previous limit. Also, Quick Action response times dropped more than 50%. You don't need to change anything: these improvements are live today. On top of that, we’re shipping fixes and new features faster than before. Read on to learn how we did it and see the data.

Remind me: what is Browser Run?

Browser Run enables developers to programmatically control and interact with headless browser instances running on Cloudflare’s global network. That’s useful for end-to-end testing of web applications, securely investigating suspicious URLs, and leveraging how browsers can easily render PDF documents, amongst other quick actions like capturing screenshots and extracting content. More recently, it’s become a critical enabler of AI agents to interact with the web. We’re building Browser Run to be the go-to platform to responsibly utilize automated browsers securely at massive scale.

Outgrowing our bunk bed

Before adopting Cloudflare Containers, we shared infrastructure with Browser Isolation (BISO). While technically similar, BISO’s larger container images slowed Continue reading

Meet NFA v26.02, featuring BGP visibility tools, extended threshold matching, and SNMP reporting enhancements.

We’re excited to announce the release of Noction Flow Analyzer v26.02. This version includes a focused set of improvements that enhance BGP visibility, expand threshold-monitoring options, improve flow-processing performance, and refine the SNMP reporting experience. This update builds on the foundation of v26.01 and introduces new tools for network engineers who rely on real-time routing intelligence and traffic analysis.

BGP diagnostics and visibility tools

The biggest addition in v26.02 is a complete set of BGP diagnostics and visibility tools. These give network administrators new insights into routing behavior directly within NFA. The new BGP diagnostics panel introduces ping and traceroute checks, allowing engineers to run connectivity and path diagnostics without leaving the NFA interface. Additionally, a BGP Data Lookup feature enables direct queries against NFA’s internal BGP tables, supporting exact-match and more-specific match modes for precise prefix investigations. Finally, BGP History Lookup provides access to historical route events, including key attributes such as prefix, next-hop, AS path, and more. This makes it easier to trace routing changes over time and connect them with traffic events.

NFA 26.02
We’re excited to announce the release of Noction Flow Analyzer v26.02. This version includes a focused set of improvements that enhance Continue reading

Pytest for Automated Network Testing (II)

Pytest for Automated Network Testing (II)

In part one, we covered the basics of pytest and wrote our first network tests. We tested BGP and OSPF on a single device, then extended it to multiple devices. We also looked at parametrization and how it helps treat each device and each neighbour as an independent test.

In this part, we will cover inventory management with Nornir and pytest fixtures.

Pytest for Automated Network Testing
Pytest gives you full control. You write the test, you decide exactly what to check, and you get a clear pass or fail result. You can test one device
Pytest for Automated Network Testing (II)

Nornir Introduction

Nornir is a Python automation framework designed for network engineers. Instead of writing your own logic to connect to devices, manage inventory, and run tasks in parallel, Nornir handles all of that for you. We have a dedicated series on Nornir, which you can check out here, so we are not going to do a deep dive in this post.

The reason we are using Nornir here is for inventory and task management. Instead of hardcoding a list of IP addresses in our collection file, we define our devices in a hosts file with groups, credentials, and Continue reading

When “idle” isn’t idle: how a Linux kernel optimization became a QUIC bug

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux, and as a result governs how most TCP and QUIC connections on the public Internet probe for available bandwidth, back off when they detect loss, and recover afterward. At Cloudflare, our open-source implementation of QUIC, quiche, uses CUBIC as its default congestion controller, meaning this code is in the critical path for a significant share of the traffic we serve.

In this post, we’ll tell the story of a bug in which CUBIC's congestion window (cwnd) gets permanently pinned at its minimum and never recovers from a congestion collapse event.

The story starts with a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 §4.2-12 — a fix to a real problem in TCP that, when ported to our QUIC implementation, surfaced unexpected behaviors in quiche. It has a happy ending: an elegant (near-)one-line fix that broke the cycle.

CUBIC's logic in a nutshell

Before we dive into the core problem, a quick refresher on Congestion Control Algorithms (CCAs) may help to set the stage.

The central knob a CCA turns is the congestion window (cwnd Continue reading

1 2 3 3,870