nanog mailing list archives

Re: Validating multi-path in production?

From: Jeff Tantsura <jefftant.ietf () gmail com>
Date: Fri, 12 Nov 2021 13:47:17 -0800

LAG - Micro BFD (RFC7130) provides per constituent livability. MLAG is much more complicated (there’s a proposal in 
IETF but not progressing), so LACP is pretty much the only option.
ECMP could use old/good single hop BFD per pair.
Practically - if you introduce enough flows with one of the hash keys monotonically changing, eventually you’d exercise 
every path available;
on itself would not help for end2end testing, usually integrated with a form of s/net flow to provide “proof of transit.
Inband telemetry (chose your poison) does provide basic device ID it has traversed as well as in some cases POT. 
Finally - there are public Microsoft presentations how we use IPinIP encap to traverse a particular path on wide radix 
ECMP fabrics.

Cheers,
Jeff

On Nov 12, 2021, at 07:55, Adam Thompson <athompson () merlin mb ca> wrote:

Hello all.
Over time, we've run into occurrences of both bugs and human error, both in our own gear and in our partner networks'
gear, specifically affecting multi-path forwarding, at pretty much all layers: Multi-chassis LAG, ECMP, and BGP MP.
(Yes, I am a corner-case magnet. Lucky me.)

Some of these issues were fairly obvious when they happened, but some were really hard to pin down.

We've found that typical network monitoring tools (Observium & Smokeping, not to mention plain old ping and
traceroute) can't really detect a hashing-related or multi-path-related problem: either the packets get through or
they don't.

Can anyone recommend either tools or techniques to validate that multi-path forwarding either is, or isn't, working
correctly in a production network? I'm looking to add something to our test suite for when we make changes to
critical network gear. Almost all the scenarios I want to test only involve two paths, if that helps.

The best I've come up with so far is to have two test systems (typically VMs) that use adjacent IP addresses and
adjacent MAC addresses, and test both inbound and outbound to/from those, blindly trusting/hoping that hashing
algorithms will probably exercise both paths.

Some of the problems we've seen show that merely looking at interface counters is insufficient, so I'm trying to find
an explicit proof, not implicit.

Any suggestions? Surely other vendors and/or admins have screwed this up in subtle ways enough times that this
knowledge exists? (My Google-fu is usually pretty good, but I'm striking out - maybe I'm using the wrong terms.)

-Adam

Adam Thompson
Consultant, Infrastructure Services

100 - 135 Innovation Drive
Winnipeg, MB, R3T 6A8
(204) 977-6824 or 1-800-430-6404 (MB only)
athompson () merlin mb ca
www.merlin.mb.ca

Current thread:

Validating multi-path in production? Adam Thompson (Nov 12)
- Re: Validating multi-path in production? Jeff Tantsura (Nov 12)
  - Re: Validating multi-path in production? Mark Tinka (Nov 26)
- Re: Validating multi-path in production? Saku Ytti (Nov 13)
- Re: Validating multi-path in production? James Bensley (Nov 14)
  - Re: Validating multi-path in production? Adam Thompson (Nov 14)
    - Re: Validating multi-path in production? Martijn Schmidt via NANOG (Nov 14)
    - Re: Validating multi-path in production? Tom Beecher (Nov 15)