Stop Designing Route Reflectors Like It's 2005 (Part 2: The Fast-Twitch Muscle)

Hervé Hildenbrand

13 May 2026 — 10 min read

In Part 1 we structurally isolated the Route Reflector onto its own compute. The brain was protected. But protecting the brain does not protect the body. That is exactly what this part is about.

The natural next step is optimization: tuning Optimal Route Reflection (ORR) and maximizing RIB Sharding to squeeze every possible millisecond of control-plane performance out of your x86 instances.

It is a logical move. But relying purely on a smarter brain for survivability is a dangerous trap. A network with a faster brain and the same legacy nervous system will still watch its own outages happen in slow motion.

Consider what happens during a standard topology change. A PE-facing link goes down, removing an active BGP next-hop. Even a highly optimized Route Reflector processing the WITHDRAW exactly as designed (best-path recompute, RIB-OUT serialization, TCP transmission to every client) takes time.

The Propagation Tax

In a legacy RR design lacking a pre-loaded alternate, the operational math is unforgiving. Every millisecond your Route Reflector spends processing a BGP failure is a millisecond your edge routers spend firing packets into a black hole.

Let's call this the Propagation Tax. It is the time delta between the physical reality changing and your control plane finishing its update cycle. It manifests as:

Packet loss during routine topology changes. Customer escalations for micro-outages. SLA breaches that defy explanation because "we have full redundancy."

This tax is structurally unavoidable as long as your PEs remain blind to anything except the single path the RR has chosen for them. This exposes the core flaw in legacy design.

The Single-Path Lie

When a traditional Route Reflector receives multiple paths to a prefix it runs its best-path algorithm and normally advertises only the selected best path to each client. The alternates may still exist inside the RR but they are hidden from the PEs. That is path hiding.

Path hiding once made sense. Memory was expensive and FIBs were small. But more importantly the early internet was built for best-effort delivery. In a mostly text-based world, if packets stopped flowing for a few seconds while the control plane caught up it was perfectly acceptable.

None of those things are true anymore....

When the single advertised path goes down, the PE blackholes traffic until the RR detects the change, recomputes, serialises the update, and pushes it. With full internet tables, policy-heavy RIB-OUT, and many RR clients, this can easily move from milliseconds into seconds under real failure conditions. In modern networking a second is a geological era.

BGP Add-Paths (RFC 7911) gives you a way to end the lie. It changes the conversation between the RR and the PE from a monologue into a briefing. The RR can now send multiple paths for the same prefix. The PE receives a primary path and a pre-loaded backup. It does not have to wait for the RR to recalculate and propagate a new answer. It already has the answer.

The RR is still your control plane. It still decides what to send. But when the inevitable happens it is the PE that fires the reflex. It is the PE that saves your traffic. The RR catches up later.

The Danger of Unbounded Add-Paths I have watched a well-meaning Add-Paths rollout drive PE control planes to 100% CPU in ninety seconds. Production hardware is never uniform. If you blindly flood every alternate path to every PE, older, CPU-constrained boxes will collapse under the BGP churn. The fix is not "more compute." The fix is sending less, adapting your advertisement policy to the physical limits of the receiving PE.

The mechanism is pretty straightforward. When you enable Add-Paths "send all" or a high path count every RR client receives N times the BGP UPDATE volume on every churn event. The RIB-OUT generation work on the RR scales with (paths x clients). The BGP-IN parsing work on each PE scales with paths x prefixes touched. On a modern PE with a dedicated BGP control-plane CPU, this is absorbed. On a five-year-old PE running BGP on a shared route processor, it is not.

The lesson is to treat Add-Paths advertisement policy as a per-platform contract. Use add-path send-count, per-neighbor filters, or AFI-specific scoping. The goal is the minimum path diversity required for PIC Edge to function (typically two) not the maximum the protocol allows.

But Add-Paths alone is just data. The reflex still has to fire.

The Muscle: The Hierarchical FIB

To understand how the reflex actually fires, you have to look at the data plane. Prefix Independent Convergence is structurally impossible without a Hierarchical Forwarding Information Base (FIB).

In a legacy flat FIB every BGP prefix is mapped directly to a physical outgoing interface and MAC address. If a link fails, the router CPU must individually rewrite every single prefix entry in the hardware. This takes seconds.

In a Hierarchical FIB the architecture is decoupled into a strict chain of pointers: a BGP Prefix points to a Protocol Next-Hop, the Next-Hop points to an IGP Adjacency and the Adjacency points to a Physical Interface.

This decoupling changes the physics of a network failure. If a physical core link dies the silicon only updates the Adjacency pointer. The million BGP prefixes sitting at the top of the chain do not move. They do not even know the failure happened. The recovery is instant.

Without a Hierarchical FIB, PIC is mechanically impossible. It does not matter what your RR does. It does not matter how many alternate paths Add-Paths delivers. If the FIB is flat, the router physically cannot fail over without walking every prefix. Every vendor shipping PIC today relies on this hierarchical pointer chain inside their forwarding silicon. It is the enabling hardware architecture.

Once the FIB is hierarchical, two forms of PIC are possible. They protect different things.

PIC Core: The IGP Reflex

PIC Core protects against failures inside your IGP domain. It relies entirely on BGP next-hop recursion using this pointer chain.

Millions of BGP prefixes point to a single BGP Next-Hop IP address. That BGP Next-Hop is resolved by the IGP. When a core link dies, the IGP detects the topology change and calculates a new shortest path. The router then updates the single IGP pointer for that specific BGP Next-Hop in the FIB.

Because all the BGP prefixes are just pointing to the Next-Hop, they do not need to be touched. They instantly ride the new IGP path with zero per-prefix churn. The IGP handles the crisis, removing BGP from the synchronous repair path entirely.

But PIC Core only handles failures between PEs. It cannot save you when the PE itself, or its external eBGP peering, is what died. The IGP cannot route around an exit point that no longer exists.

PIC Edge: The Pre-Loaded Reflex

That is what PIC Edge is for. It protects against the failure of the exit point itself.

Instead of relying on the IGP to find a new path, the router pre-programs a special pointer structure (often called a PathList) into the FIB. Millions of BGP prefixes point to this single PathList.

Inside this PathList, there is a Primary Next-Hop and a Backup Next-Hop. When the primary eBGP peer or remote PE goes offline, the hardware detects the failure and flips the pointer inside the PathList from Primary to Backup. Every BGP prefix pointing to that PathList instantly fails over in the data plane without waiting for BGP to withdraw and reconverge.

This is the pre-loaded reflex. The PE knew the backup before the failure happened. The RR gave it the intelligence (via Add-Paths); PIC Edge gave it the muscle (via the Hierarchical FIB).

The control plane still catches up eventually. BGP will eventually withdraw the dead path, the RR will recalculate, and the PE will receive the proper updated best-path. But the traffic never stopped. The data plane handled it.

In enterprise backbones where uptime is contractual, this is often the difference between a silent recovery and a customer escalation.

It is also why Add-Paths and PIC Edge must be deployed as a single architecture, not as separate features. Without Add-Paths, PIC Edge has no backup to pre-load. Without PIC Edge, Add-Paths is just a briefing with no muscle to act on it.

This is why, in RR-based backbones, Add-Paths and PIC Edge should be treated as one architecture in two halves. Add-Paths is the briefing. PIC Edge is the reflex. The RR provides the intelligence; the PE executes the response. Neither half delivers the full convergence story without the other.

ORR vs. Add-Paths: Different Time Domains

If you have been optimising your RRs, you have heard of BGP Optimal Route Reflection (ORR). I will be completely transparent here: while I have the operational experience for Add-Paths and PIC Edge, I have never deployed ORR in a live production backbone. But theoretically, it fits into this architecture perfectly.

ORR is sometimes confused with Add-Paths because both involve the RR being smarter about what it sends. They solve completely different problems.

A legacy RR calculates the best path from its own physical perspective in the topology. This forces remote PEs into sub-optimal routing. They exit the network from wherever the RR thinks is closest which is rarely where they actually are. Hot-potato routing breaks.

ORR makes the Route Reflector calculate the best path from the perspective of each client PE. Instead of one global best path, the RR computes a per-client best path using each PE's actual IGP position. The traffic exits where it should rather than where the RR is sitting.

Here is the important distinction: ORR is a steady-state optimisation. It improves where traffic goes during normal operations. It does not change what happens during a failure.

ORR does not provide the PE with a backup path. It does not pre-load a reflex. When the selected path fails, ORR does not help the PE fail over faster. It only ensures the original path selection was correct.

Add-Paths and PIC Edge operate in a completely different time domain. They exist for the crisis moment. ORR exists for the normal moment.

Both are worth deploying. Neither replaces the other. If you deploy ORR without Add-Paths, you get better routing in steady state but the same propagation tax during failures. If you deploy Add-Paths without ORR, you get fast failover but sub-optimal routing in normal operations.

The ideal end state is all three: ORR for correct steady-state path selection. Add-Paths for pre-loaded alternatives. PIC Edge for instant data-plane failover.

The Window of Vulnerability

To understand why all of this matters and why we route around the control plane instead of trying to make it faster, walk through what actually happens during a remote failure in a network that lacks a Pre-Loaded Reflex:

Detection. A PE notices its peering is down. (Fast, milliseconds.) Transmission. The PE generates a BGP WITHDRAW and sends it to the RR. Processing. The RR queues the update, ingests it into RIB-IN, runs best-path across potentially millions of paths, and writes RIB-OUT. Propagation. The RR serialises updates into TCP sockets and transmits them to every other client PE. Execution. Remote PEs receive, parse, and finally update their FIBs.

In an idle network, this completes in milliseconds. During a real topology change, when the RR is already churning through a BGP storm, it can stretch into full seconds. I have seen it stretch further than that, on networks I will not name, during incidents I am still asked not to discuss.

During every millisecond of that window, the rest of the network is forwarding traffic into a route that no longer exists. The remote PEs are not negligent. They are simply operating on the last truth the RR sent them. The RR has not yet tapped them on the shoulder.

This is the Propagation Tax, paid in dropped packets. And it is paid by your customers, not by your control plane.

You cannot make BGP fast enough to outrun this. BGP is stateful, serialised, and gated by TCP. No amount of compute on your vRR fundamentally changes its character. Throwing CPU at the control plane is like making a postal worker run faster. The letter still has to be written, stamped, and delivered.

With Add-Paths and PIC Edge, the architecture is different. The PE has already received the alternate path and pre-programmed it. The failure causes an instant data-plane pointer flip. Convergence is measured in microseconds, on silicon, without involving the control plane at all. The control plane eventually catches up and confirms the failover. But the traffic never stopped.

The Architecture

This is the nervous system we have been building across both parts. Let me lay it out as a single logical architecture:

Structural Isolation. The RR runs on its own compute (vRR on x86), deployed as a cluster (at least two instances per domain, more for scale). It is not co-resident with any forwarding path. Its failure is a control-plane event, not a data-plane event.

RIB Sharding. The workload is partitioned across RR instances using address-family or policy-based splits. This is horizontal scaling for the brain, not redundancy.

BGP Add-Paths. The RR advertises multiple paths per prefix to each PE, scoped and filtered per platform capability. This is the intelligence briefing.

PIC Edge. Each PE pre-programs a primary and backup next-hop into the Hierarchical FIB. When the primary fails, the silicon flips the pointer instantly. This is the reflex.

PIC Core. Core link failures are absorbed by the IGP through the hierarchical pointer chain. BGP is not involved.

ORR. The RR calculates per-client best paths based on each PE's actual IGP position, ensuring correct hot-potato routing in steady state.

The brain is isolated. The nervous system is pre-loaded. The reflexes fire without waiting for the brain. That is the design.

In Part 1 we protected the brain. In Part 2 we gave the body its own reflexes. Together, they form a backbone that does not just survive failures. It routes around its own control plane.

That is the architecture.