The Paranoia Audit: Why I Checked a 'Perfect' Network

The Paranoia Audit: Why I Checked a 'Perfect' Network

We love to lie to ourselves about redundancy. We configure BFD. We enable TI-LFA. We type commit. Then we sleep, assuming the network will heal itself in milliseconds.

I recently spent days trying to prove myself wrong. Two years ago, I wrote about our migration to Segment Routing and our achievement of sub-50ms convergence. But a question started nagging at me: Is it actually fast? Or do we just think it is?

I audited the entire stack. I wanted to see if we had "Green Screen Syndrome" - where the dashboard says yes but the traffic says no. What I found was instructive. The architecture held up. The homework we did two years ago paid off.

Here is the breakdown of the "Convergence Triangle" and how you can verify your own network is as solid as you think it is.

The "Magic Signpost" Theory

To understand why this matters, you have to understand how a router thinks. Imagine a pizza delivery driver. If a bridge collapses, the old way (Standard BGP) requires the driver to pull over, open a new map, and recalculate the entire route from scratch. That can take 30 to 60 seconds.

BGP PIC (Prefix Independent Convergence) is like a GPS with a pre-loaded alternate route ready to activate the instant the bridge collapses. The router performs a single pointer flip in a hierarchical FIB structure. There is no full lookup - just an instant redirect to a pre-installed backup path.

The Convergence Triangle is the three-layer stack that makes this work. I audited each layer to confirm we didn't have any silent failures.

The Convergence Triangle

Layer 1: Transport (TI-LFA)

This is the road itself. I verified coverage using show isis interface extensive.

Pass Criteria: You must see Post convergence Protection: Enabled on every interface. In the inet.3 table, look for backup next-hops with weight 0xf000.

My Result: Coverage was solid. Every interface showed TI-LFA enabled.

Layer 2: Service (BGP PIC Edge)

This is where most implementations fail silently. This is the layer that protects you when an Egress PE dies. TI-LFA cannot help you there.

Run this command on Junos (or find the equivalent on your NOS):

show route <prefix> extensive

You are looking for two lines:

  • Indirect next hop: weight 0x1 (Primary)
  • Indirect next hop: weight 0x4000 (Backup)

If you only see 0x1, you have no backup.

My Result: The layer was in good shape. Our config showed the complete Add-Path setup working perfectly, with backup paths installed in the kernel.

The Hidden Killer: Label Allocation

This is the part of PIC that catches most people. PIC pre-computation requires that your label space be allocated per-next-hop, not per-prefix. Per-prefix means that during a failure the router has to update thousands of individual entries. I verified we were running Per-Next-Hop. We were safe. No legacy headaches.

The Speed of Light vs. BFD

We obsess over BFD timers. We tune them down to 300ms intervals. But for direct fiber, BFD is a safety net, not the primary detector.

The Detection Hierarchy:

  • LOS (Loss of Signal): 2 to 5ms
  • ISIS Reaction: 5 to 10ms
  • BFD: 900ms+

When a fiber is cut, the physics of the light failing triggers the failover before BFD even notices a packet is missing. The real detection happens at the speed of light.

The Verdict

I didn't find a bug. I didn't find a misconfig. So was the audit a failure? No.

This audit confirmed that the layered protection model is working as designed. When that 3 AM fiber cut happens, we know the sequence:

  • T+0ms: Fiber cut.
  • T+10ms: Traffic flowing on backup path.

There is peace of mind in knowing exactly how your network fails. If you haven't audited your PIC implementation recently, I'd encourage you to run the same checks. Would you like me to create a specific checklist for auditing a different protocol, like OSPF or EVPN?

Read more