nanog mailing list archives

Re: CenturyLink RCA?


From: Töma Gavrichenkov <ximaera () gmail com>
Date: Sun, 30 Dec 2018 20:46:05 +0300

There's a Reddit user claiming he works at CL who says the reason were some
faulty Infinera DTN-X instances.

https://www.reddit.com/r/centurylink/comments/aa2qa4/comment/ecovgab

(dunno though why the user posted that to Reddit and not here)

30 Dec. 2018 г., 20:19 Saku Ytti <saku () ytti fi>:

Hey John,

Your criticism is warranted, but would also be addressed by
explanation DCN/OOB being the source of the problem.

At any rate, I am looking forward to stop speculating and start
reading post-mortem written by someone who knows how networks work.

On Sun, 30 Dec 2018 at 18:28, John Von Essen <john () essenz com> wrote:

One thing that is troubling when reading that URL is that it appears
several steps of restoration required teams to go onsite for local login,
etc.,. Granted, to troubleshoot hardware you need to be physically present
to pop a line card in and out, but CTL/LVL3 should have full out-of-band
console and power control to all core devices, we shouldn't be waiting for
someone to drive to a location to get console or do power cycling. And I
would imagine the first step to alot of the troubleshooting was power
cycling and local console logs.


-John



On 12/30/18 10:42 AM, Mike Hammett wrote:

It's technical enough so that laypeople immediately lose interest, yet
completely useless to anyone that works with this stuff.



-----
Mike Hammett
Intelligent Computing Solutions
http://www.ics-il.com

Midwest-IX
http://www.midwest-ix.com

________________________________
From: "Saku Ytti" <saku () ytti fi>
To: "nanog list" <nanog () nanog org>
Sent: Sunday, December 30, 2018 7:42:49 AM
Subject: CenturyLink RCA?

Apologies for the URL, I do not know official source and I do not
share the URLs sentiment.
https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen?
From my own history, I rarely recognise the problem I fixed from
reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB
b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU,
I've had this failure mode)
c) DCN had direct access to control-plane, and L2 congested
control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what
type of explanation is acceptable and can be used to fix things.
Hopefully CenturyLink does come out with IP-engineering readable
explanation, so that we may use it as leverage to support work in our
own domains to remove such risks.

a) do not run L2 DCN/OOB
b) do not connect MGMT ETH (it is unprotected access to control-plane,
it  cannot be protected by CoPP/lo0 filter/LPTS ec)
c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP)
d) do fail optical network up

--
  ++ytti



--
  ++ytti


Current thread: