nanog mailing list archives

Re: Spanning tree melt down ?


From: Daniel Golding <dgold () FDFNet Net>
Date: Fri, 29 Nov 2002 11:46:10 -0600 (CST)



Well, yes, they were. But don't blame Cisco - its not like they held a gun
to anyone's head. Of course, there is also the possibility that the
hospitol IT folks said "if you had just agreed to our capital requests
last year, none of this would have happened" and the money tap got turned
on. This could have been long-defered equipement. Or, it could have been a
panic-buy. Either way, the buck stops with the hospitol, rather than the
vendor.

- Dan

On Thu, 28 Nov 2002, Robert A. Hayden wrote:


I'm still failing to see why this required a $3M forklift of new equipment
to correct the problem.  Was this just Cisco sales pouncing on someone's
misfortune as a way to push new stuff?

On Thu, 28 Nov 2002, Stephen J. Wilcox wrote:


Heh, so they kept bolting stuff on and a failure somewhere caused a spanning
tree change which because of over complexity and out of date config was unable
to converge.

Ah yes, occam also applies to switch topology :)

Steve

On Fri, 29 Nov 2002, Simon Lyall wrote:


On Thu, 28 Nov 2002, Garrett Allen wrote:
speculating on cause and effect, my first bet would that someone turned off
spanning tree on a trunk or trunks immediately prior to the flood.  my next
bet would be a babbling device - i've seen an unauthorized hub on a flat
layer 2 net basically shut the network down.  it was after a power hit.
when we found the buggar and power cycled it, all was well.  i don't think
that the researcher was the culprit.  more likely the victim.

This article had some more information:

http://www.nwfusion.com/news/2002/1125bethisrael.html

This slashdot article also seems to have some details:

http://slashdot.org/comments.pl?sid=46238&cid=4770093

Text as follows:

 I contacted Dr. John D. Halamka to see if he could provide more detail on
the network outage. Dr. Halamka is the chief information officer for
CareGroup Health System, the parent company of the Beth Israel Deaconess
medical center. His reply is as follows: "Here's the technical explanation
for you. When TAC was first able to access and assess the network, we
found the Layer 2 structure of the network to be unstable and out of
specification with 802.1d standards. The management vlan (vlan 1) had in
some locations 10 Layer2 hops from root. The conservative default values
for the Spanning Tree Protocol (STP) impose a maximum network diameter of
seven. This means that two distinct bridges in the network should not be
more than seven hops away from one to the other. Part of this restriction
is coming from the age field Bridge Protocol Data Unit (BPDU) carry: when
a BPDU is propagated from the root bridge towards the leaves of the tree,
the age field is incremented each time it goes though a bridge.
Eventually, when the age field of a BPDU goes beyond max age, it is
discarded. Typically, this will occur if the root is too far away from
some bridges of the network. This issue will impact convergence of the
spanning tree. A major contributor to this STP issue was the PACS network
and its connection to the CareGroup network. To eliminate its influence on
the Care Group network we isolated it with a Layer 3 boundary. All
redundancy in the network was removed to ensure no STP loops were
possible. Full connectivity was restored to remote devices and networks
that were disconnected in troubleshooting efforts prior to TACs
involvement. Redundancy was returned between the core campus devices.
Spanning Tree was stabilized and localized issues were pursued. Thanks for
your support. CIO Magazine will devote the February issue to this event
and Harvard Business School is doing a case study."


 --
Simon Lyall.                |  Newsmaster  | Work: simon.lyall () ihug co nz
Senior Network/System Admin |  Postmaster  | Home: simon () darkmere gen nz
ihug, Auckland, NZ          | Asst Doorman | Web: http://www.darkmere.gen.nz









Current thread: