nanog mailing list archives

RE: Data Center testing


From: Deepak Jain <deepak () ai net>
Date: Mon, 24 Aug 2009 16:03:51 -0400


Thanks for the kind words Ken.

Power failure testing and network testing are very different disciplines. 

We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we 
have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this 
philosophy. 

This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own 
and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course 
there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have 
a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc].

We don't hesitate to show our customers and auditors our testing and maintenance logs, go over our procedures, etc. 
They can even watch events if they want (we provide the ear protection). I don't think any facility of any significant 
size can operate differently and do it well.

This is NOT advisable to folks who do not do proper preventative maintenance on their transfer bus ways, PDUs, 
switches, batteries, transformers and of course generators. The goal is to identify questionable relays, switches, 
breakers and other items that may fail in an actual emergency.

On the network side, during scheduled maintenance we do live failovers -- sometimes as dramatic as pulling the cable 
without preemptively removing traffic. Part of *our* procedures is to make sure it reroutes and heals the way it is 
supposed to before the work actually starts. Often network and topology changes happen over time and no one has had a 
chance to actually test all the "glue" works right. Regular planned maintenance (if you have a fast reroute capability 
in your network) is a very good way to handle it. 

For sensitive trunk links and non-invasive maintenance, it is nice to softly remove traffic via local pref or whatever 
in advance of the maintenance to minimize jitter during a major event. 

As part of your plan, be prepared for things like connectors (or cables) breaking and have a plan for what you do if 
that occurs. Have a plan or a rain-date if a connector takes a long time to get out or the blade it sits in gets 
damaged. This stuff looks pretty while its running and you don't want something that has been friction-frozen to ruin 
your window.

All of this works swimmingly until you find a vendor (X) bug. :) Not for the faint-of-heart. 

Anyone who has more specific questions, I'll be glad to answer off-line. 

Deepak Jain
AiNET

I know Peer1 in vancouver reguarly send out notifications of
"non-impacting" generator load testing, like monthly. Also InterXion
in Dublin, Ireland have occasionally sent me notification that there
was a power outage of less than a minute however their backup
successfully took the load.

I only remember one complete outage in Peer1 a few years ago... Never
seen any outage in InterXion Dublin.

Also I don't ever remember any power failure at AiNet (Deepak will
probably elaborate)

2009/8/24 Dan Snyder <sliplever () gmail com>:
Does any one know of any data centers that do failure testing of
their
networking equipment
regularly? I mean to verify that everything fails over properly after
changes have been made over
time.  Is there any best practice guides for doing this?

Thanks,
Dan




Current thread: