nanog mailing list archives

Re: Heads-Up: GoDaddy Broke the Interwebs...


From: Jared Mauch <jared () puck nether net>
Date: Tue, 11 Sep 2012 17:08:08 -0400


On Sep 11, 2012, at 4:53 PM, Rubens Kuhl <rubensk () gmail com> wrote:

That doesn't mean that their description of the internal error fits
what happened

Anytime I've seen a real RFO, it takes more than 24 hours to collect data.  Sometimes you actually don't know what 
happened.  There's a reason for this comic: http://www.dilbert.com/strips/comic/1999-08-04/  (the reboot cleared the 
problem).

I've seen many odd behaviors of devices that nobody could explain, including the vendors.. sometimes it takes a few 
years to understand what happened.  I recall a case where 2-3 years after a major outage someone made some minor 
comment about their architecture and a light came on.

I welcome more information about mistakes/errors that we can all learn from.  Sharing that information can be hard or 
uncomfortable at times, but can help others learn and not make the same mistakes again.  I took the recommendation of 
others and have started to read "Normal Accidents".  amazon link: http://tinyurl.com/9dc6x98

The whole multiple-failures problem really makes me concerned about cascading system failures when things go wrong.

- Jared

Current thread: