Network Automation

From 'Hello World' to 'Agent Ready': Why We Built Instead of Bought

Hervé Hildenbrand

07 Apr 2026 — 3 min read

Last month I shared a screenshot of a single switch validation: 12 tests, 1.46 seconds. It was cool, but it was a toy example.

Scaling that to a nearly 300-device fabric requires more than just a loop. It requires architecture.

The obvious question is: "Why not just use Arista Cloud Vision (CVP)?"

CVP is an incredible product. But we are builders. We don't want a black box - we want a toolbox. We need the ability to customize our validation logic down to the packet level, tailoring it to our specific storage behaviors, our OOB quirks, and our exact compliance standards.

This is why ANTA is a gift from Arista. Most vendors lock their validation logic inside a proprietary dashboard. Arista open-sourced theirs. They didn't just sell us the platform; they gave us the engine.

Here is how we moved from a simple script to a production-grade validation engine.

1. The "One Size Fits None" Problem

A generic validation script is a liability. A Spine isn't supposed to have VXLAN. If I run a generic "interface up" check on a Storage switch, I miss the critical check for Jumbo Frames or OutDiscards.

We realized we needed Contextual Validation. We built 10 specific catalogs to match our topology:

Spines (sp*): Validate BGP EVPN overlay, but skip VXLAN tests (since Spines don't terminate VTEPs).
Leafs (lf*): Full validation: EVPN + VXLAN + MLAG consistency.
Border Leafs (blf*): Complex routing checks (EVPN + OSPF + ISIS). Split into prod, pre-prod, and sandbox profiles.
Storage (sws*): High-throughput checks: buffer depth, Jumbo Frames, and strict STP validation.
OOB (swo*): Basic hygiene: Reachability, SSH security, NTP (no routing protocols expected here).

2. Building the API Layer (The "Wrapper")

This is the most critical part. ANTA is the engine. But we needed a driver.

We didn't want engineers manually running scripts. We wanted a service. So we built a custom API wrapper.

Instead of typing commands, our orchestration system calls the API with device names (e.g., sp01, lf04, sws12), selects the correct test catalog (Spine vs. Leaf vs. Storage), and executes the validation.

And for the humans? Of course, we built a web dashboard that consumes this API. But crucially, the dashboard is just a "dumb" client. It hits the exact same API endpoints as our automation scripts.

3. The Metrics: 97.8% Success Rate (And Why 100% is a Myth)

When we ran this against 288 devices, we finished in 121 seconds. The result: 97.8% success.

The remaining gap wasn't outages. It was the dust of a living network:

Signal vs. Noise: Interface counters showing 50 errors in a sea of 10 billion packets - too small for an alert, but enough to fail a strict validation test.
The "Server-Side" Gap: MLAG interfaces configured on our leafs but down on the server side.
Provisioning Delta: Devices with config templates applied but not yet connected to servers.

Every "failure" was valid data. We don't hide it because it hurts hygiene. We accept the result, and now we know exactly what that gap is.

4. The Philosophy: Machine-to-Machine (M2M) First

Why did we go through the trouble of building an API wrapper?

Network Engineering is stuck in a "Machine-to-Human" world. We love our CLIs. We love parsing text. But you cannot build robust automation on top of text streams designed for eyeballs.

We operate on a Machine-to-Machine first principle. If it doesn't have an API, it doesn't exist in our ecosystem.

This prepares us for the future of Agentic AI. Nobody is confident letting an AI agent use the CLI on its own - a raw CLI is an open minefield. The API is a guardrail. It forces the Agent into a safe lane. It defines exactly what is allowed (Validation) and what isn't (Destruction).

By wrapping ANTA in our own API, we have created a "tool" that any machine can use safely. Today, it's our dashboard. Tomorrow, it's an AI agent: Agent calls API → Agent proposes fix.

We are actively removing the human from the "Read-Eval-Print" loop. Not because we want to replace engineers, but because we want engineers building systems, not typing show ip int brief 300 times a day.

The API isn't just a wrapper. It is a declaration that our network is software, not hardware.