nanog mailing list archives
Re: Mitigating human error in the SP
From: Jared Mauch <jared () puck nether net>
Date: Tue, 2 Feb 2010 12:33:50 -0500
We have solved 98% of this with standard configurations and templates. To deviate from this requires management approval/exception approval after an evaluation of the business risks. Automation of config building is not too hard, and certainly things like peer-groups (cisco) and regular groups (juniper) make it easier. If you go for the holy grail, you want something that takes into account the following: 1) each phase in the provisioning/turn-up state 2) each phase in infrastructure troubleshooting (turn-up, temporary outage/temporary testing, production) 3) automated pushing of config via load override/commit replace to your config space. Obviously testing, etc.. is important. I've found that whenever a human is involved, mistakes happen. There is also the "Software is imperfect" mantra that should be repeated. I find vendors at times have demanding customers who want perfection. Bugs happen, Outages happen, the question is how do you respond to these risks. If you have poor handling of bugs, outages, etc.. in your process or are decision gridlocked, very bad things happen. - Jared On Feb 1, 2010, at 9:21 PM, Chadwick Sorrell wrote:
Hello NANOG, Long time listener, first time caller. A recent organizational change at my company has put someone in charge who is determined to make things perfect. We are a service provider, not an enterprise company, and our business is doing provisioning work during the day. We recently experienced an outage when an engineer, troubleshooting a failed turn-up, changed the ethertype on the wrong port losing both management and customer data on said device. This isn't a common occurrence, and the engineer in question has a pristine track record. This outage, of a high profile customer, triggered upper management to react by calling a meeting just days after. Put bluntly, we've been told "Human errors are unacceptable, and they will be completely eliminated. One is too many." I am asking the respectable NANOG engineers.... What measures have you taken to mitigate human mistakes? Have they been successful? Any other comments on the subject would be appreciated, we would like to come to our next meeting armed and dangerous. Thanks! Chad
Current thread:
- Re: Mitigating human error in the SP, (continued)
- Re: Mitigating human error in the SP Dave CROCKER (Feb 01)
- Re: Mitigating human error in the SP Suresh Ramasubramanian (Feb 01)
- Re: Mitigating human error in the SP Mark Smith (Feb 02)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)
- Re: Mitigating human error in the SP Michael Dillon (Feb 02)
- Re: Mitigating human error in the SP David Hiers (Feb 02)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP Dave CROCKER (Feb 01)
- Re: Mitigating human error in the SP Paul Corrao (Feb 02)
- Re: Mitigating human error in the SP James Downs (Feb 02)
- Message not available
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)
- Re: Mitigating human error in the SP Chadwick Sorrell (Feb 02)
- Re: Mitigating human error in the SP gordon b slater (Feb 02)
- Re: Mitigating human error in the SP Larry Sheldon (Feb 02)
- Re: Mitigating human error in the SP isabel dias (Feb 04)