nanog mailing list archives

RE: A survey about networking incidents


From: "Aaron Gould" <aaron1 () gvtc com>
Date: Thu, 24 Jan 2019 11:31:40 -0600

It seems that this is even increasingly harder in a MEF/SP-type Layer 2 emulated network of eline, elan, etree type 
things…

 

Yeah seems that you have to have synthetic-type traffic generated and inserted into the data path to measure on…

 

Isn’t CFM/Ethernet OAM supposed to segment up the network into management domains-of-responsibility with mips/meps, etc 
so that you can real-time-monitor your system and others can monitor theirs… I have not set this up, but I thought that 
was one way of being able to know on-going the state of the network, link-by-link and endpoint-to-endpoint… I think 
on-going CMM’s flow to give you an idea of the extent to which links and services are good or not good.

 

Perhaps that’s the proof you could point at for anyone trying to blame the network

 

I’m sure there are other ways… like cisco’s ip sla… accedian’s paa, twamp (I just remembered about twamp, and I think 
that’s perhaps an ip-layer version of what is like Ethernet layer cfm/oam, I could be wrong…but as I think about it, I 
recall mpls-oam, perhaps others too

 

Yes, as network engineer’s, I/we continually have to clear-my-name (clear the network) of blame

 

-Aaron

 

p.s. I’ll try to look at the survey later

 

 

 

From: NANOG [mailto:nanog-bounces+aaron1=gvtc.com () nanog org] On Behalf Of Yu, Minlan
Sent: Wednesday, January 23, 2019 9:32 AM
To: nanog () nanog org
Subject: A survey about networking incidents

 

Hi Nanog,

 

We all know that networks are at the heart of many of the systems we use today. When these systems break, the 
underlying networks are often the first suspects. Networks are hard to diagnose and they are most likely to be blamed 
for problems even if they are completely healthy. As networking engineers, we have all seen cases where another part of 
the system was causing an issue but the network was held the suspect until the problem was resolved.

 

We are researchers from Harvard and The University of Pennsylvania who are interested in understanding this problem and 
its impact better in order to build a solution. Our goal is to be able to quickly rule out the network as a root cause 
for incidents in order to be able to speed up diagnosis and also to improve operator efficiency. We are interested in 
learning the answer to a few questions. Specifically, we would like to know: How often do you see problems where the 
network is blamed but after investigating you find the problem to be caused by some other part of the system? How often 
have you had incidents where the cause of the incident was outside of the boundary of your organization? How much do 
you think fixing this problem can help you and your organization more quickly diagnose problems?

 

We have created a *very* short survey to be able to get an operator's perspective on these questions. It should take 
less than 15 minutes to finish. The findings should help us as well as the research community at large to be able to 
build a solution that can benefit all types of networks, of different sizes, to improve how they do the diagnosis. We 
will be presenting the results of this anonymous survey in a scientific article later this year. We will report back 
our research once it's finished.

 

Survey URL: https://docs.google.com/forms/d/e/1FAIpQLScx-U54eQFQi5AdBCOOucMaI6BVmLwcMFiZl2HVZ9bHi1q8bA/viewform

 

We would greatly appreciate it if you could help us with this research.  Please feel free forward this survey to other 
operators you know. Thank you!

 

Minlan Yu

http://minlanyu.seas.harvard.edu/


Current thread: