nanog mailing list archives

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey


From: "Vanbever Laurent" <lvanbever () ethz ch>
Date: Thu, 8 Jul 2021 14:43:22 +0000

Hi Jörg,

Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever 
monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts 
that may cause sudden buffer overflow? Or perhaps is your probing traffic already high priority?

Best,
Laurent

On 8 Jul 2021, at 15:58, Jörg Kost <jk () ip-clear de> wrote:

We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to 
lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone 
variances, or errors in measuring ).

We noticed this when we replaced Nagios with some more bursting, trigger-happy monitoring software a few years back. 
Since then, it's reporting false positives from time to time, and this can become annoying.

Besides spending a lot of time debugging this, we never had a breakthrough in finding the root cause, just looking to 
replace things in the next year.

On 8 Jul 2021, at 15:28, Mark Tinka wrote:

On 7/8/21 15:22, Vanbever Laurent wrote:

Did you folks manage to understand what was causing the gray issue in the first place?

Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm.

Mark.


Current thread: