nanog mailing list archives
RE: FYI Netflix is down
From: "Dan Golding" <dgolding () ragingwire com>
Date: Mon, 2 Jul 2012 12:25:54 -0700
-----Original Message----- From: Leo Bicknell [mailto:bicknell () ufp org]
I want to emphasize _and test_.
[snip]
I used to work with a guy who had a simple test for these things, and if I was a VP at Amazon, Netflix, or any other large company I would
do
the same. About once a month he would walk out on the floor of the data center and break something. Pull out an ethernet. Unplug a server. Flip a breaker.
*DING DING* - we have a winner! In a previous life, I used to spend a lot of time in other people's data centers. The key question to ask was how often they pulled the plug - i.e. disconnected utility power without having backup generators running. Simulating an actual failure. That goes for pulling out an Ethernet cord or unplugging a server, or flipping a breaker. Its all the same. The problem is that if you don't do this for a while, you get SCARED of doing it, and you stop doing it. The longer you go without, the scarier it gets, to the point where you will never do it, because you have no idea what will happen, other that you probably getting fired. This is called "horrible engineering management", and is very common. The other problem, of course, is that people design under the assumption that everything will always work, and that failure modes, when they occur, are predictable and fall into a narrow set. Multiple failure modes? Not tested. Failure modes including operator error? Never tested. When was the last time you had a drill? - Dan
Then he would wait, to see how long before a technician came to fix
it.
If these activities were service impacting to customers the
engineering
or implementation was faulty, and remediation was performed. Assuming they acted as designed and the customers saw no faults the team was graded on how quickly the detected and corrected the outage. I've seen too many companies who's "test" is planned months in
advance,
and who exclude the parts they think aren't up to scratch from the test. Then an event occurs, and they fail, and take down customers. TL;DR If you're not confident your operation could withstand someone walking into your data center and randomly doing something, you are
NOT
redundant. -- Leo Bicknell - bicknell () ufp org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
Current thread:
- Re: FYI Netflix is down, (continued)
- Re: FYI Netflix is down Randy Bush (Jul 03)
- Re: FYI Netflix is down Kyle Creyts (Jul 04)
- Re: FYI Netflix is down Randy Bush (Jul 04)
- Re: FYI Netflix is down George Herbert (Jul 02)
- Re: FYI Netflix is down Jon Lewis (Jul 03)
- Re: FYI Netflix is down AP NANOG (Jul 02)
- Re: FYI Netflix is down Joly MacFie (Jul 02)
- Re: FYI Netflix is down James Downs (Jul 02)
- Re: FYI Netflix is down AP NANOG (Jul 02)
- Re: FYI Netflix is down Grant Ridder (Jul 02)
- RE: FYI Netflix is down Dan Golding (Jul 02)
- Re: FYI Netflix is down Brett Frankenberger (Jul 02)
- Message not available
- Re: FYI Netflix is down Greg D. Moore (Jul 02)
- RE: FYI Netflix is down Dan Golding (Jul 02)
- Re: FYI Netflix is down George Herbert (Jul 02)
- Message not available
- Re: FYI Netflix is down Greg D. Moore (Jul 02)
- Re: FYI Netflix is down Steven Bellovin (Jul 02)
- Re: FYI Netflix is down Jay Ashworth (Jul 03)
- Re: FYI Netflix is down George Herbert (Jul 03)