nanog mailing list archives

Re: tools and techniques to pinpoint and respond to loss on a path


From: Jared Mauch <jared () puck nether net>
Date: Mon, 15 Jul 2013 17:30:42 -0400


On Jul 15, 2013, at 5:18 PM, Andy Litzinger <Andy.Litzinger () theplatform com> wrote:

 I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take 
it to the ISPs and request a solution.  The blackouts are so quick that it's impossible to log in and get a trace- 
hence the desire to automate it.

I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts 
of data points.

As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B 
across multiple providers across the internet?  Is 30 seconds to a minute of blackout between two destinations every 
couple of weeks par for the course?  My directly connected ISPs offer me an SLA, but what should I reasonably expect 
from them when one of their upstream peers (or a peer of their peers) has issues?  If this turns out to be BGP 
reconvergence or similar do I have any options?

I think there are a number of tools available to detect if something is happening:

1) iperf (test network/bw usage)
2) owamp (one way ping) - you can use this to detect when reordering or other events happen.. this will collect nearly 
continuious data.  requires good ntp references, or accepting you may see skewed data.
3) some other udp/low latency responder.  i've built something of my own that does this, i can provide a pointer if you 
are interested.  i have graphs of my connection at home to someplace remote that crosses 3 carriers.  you can see the 
queuing delay increment throughout the day until peak times and taper off at night.  no loss, but the increase is quite 
visible.
4) some vendor SLA/SAA product.  Cisco and others have SAA responders that work on their devices you can configure to 
collect data.

That being said, losing network for 30 seconds once every 2 weeks I would expect is fairly common.  Someone will be 
doing network upgrades/work or there will be hardware/transmission error, etc.

30 seconds sounds a lot like bgp convergence, and in older platforms, eg: 6500/sup720 expect about 8k prefixes/second 
max to be downloaded into the tcam/fib.  with 400k+ prefixes, it takes awhile to pump the tables into the forwarding 
side.

- Jared

Current thread: