nanog mailing list archives

Re: What to expect after a cooling failure


From: Stefan Förster <cite+nanog () incertum net>
Date: Wed, 10 Jul 2013 08:46:51 +0200

* Erik Levinson <erik.levinson () uberflip com>:

[cooling failure]

For those who have gone through such events in the past, what can
one expect in terms of long-term impact...should we expect some
premature component failures? Does anyone have any stats to share? 

We had a similar event (temperatures were a bit higher at 49°C,
duration was a bit shorter, 10am to 3pm) this January. In the two days
after the event, two of our HP servers had drives that went from "OK" to
"Predictive Failure", which is the SmartArray controller's way of
telling about high error rates. Two weeks after, we had a single DIMM
with an uncorrectable ECC error, causing a server reboot. Three weeks
after, a single PSU failed.

In our opinion, the disk problems were caused by the cooling failure,
while the ECC error and the faulted PSU were probably not related.

I believe that your hardware will be fine, but it probably wouldn't be
a bad idea to check if you have current maintenance contracts/warranty
for your servers, or any other way of obtaining replacement drives in
a reasonably short time.


Cheers
Stefan


Current thread: