nanog mailing list archives

Re: Cloudflare is down


From: George Herbert <george.herbert () gmail com>
Date: Mon, 4 Mar 2013 15:00:26 -0800

On Mon, Mar 4, 2013 at 10:40 AM, Saku Ytti <saku () ytti fi> wrote:
On (2013-03-04 13:23 -0500), Jeff Wheeler wrote:

We have lots of stupid people in our industry because so few
understand "The Way Things Work."

We have tendency to view mistakes we do as unavoidable human errors and
mistakes other people do as avoidable stupidity.

We should actively plan for mistakes/errors, if you actively plan for no
'stupid mistakes', you're gonna have bad time

From my point of view, outages are caused by:
1) operator
2) software defect
3) hardware defect

Most people design only against 3), often with design which actually
increases likelihood of 2) and 1), reducing overall MTBF on design which
strictly theoretically increases it.

...And a lot of people who know the heirarchy solve 3 and then solve 2
in a way that increases 1 (multiple parallel environments with
different vendors' equipment) only to find that 1 increased, due to
additional complexity.

On the other hand, I've seen people who had horrible explosions of 2
or 3 due to ignoring all but 1.

If you ACTUALLY need that many 9s, you need all of redundancy,
diversity of vendors, and suitably trained, exercised,
process-supported net admins.  That's a few multiples of 2 more
expense than nearly anyone typically wants to pay for.


-- 
-george william herbert
george.herbert () gmail com


Current thread: