nanog mailing list archives

Re: "Hypothetical" Datacenter Overheating


From: Warren Kumari <warren () kumari net>
Date: Tue, 16 Jan 2024 08:37:09 -0800

On Mon, Jan 15, 2024 at 9:55 AM, William Herrin <bill () herrin us> wrote:

On Mon, Jan 15, 2024 at 6:08 AM Mike Hammett <nanog () ics-il net> wrote:

Let's say that hypothetically, a datacenter you're in had a cooling
failure and escalated to an average of 120 degrees before mitigations
started having an effect. What should be expected in the aftermath?

Hi Mike,

A decade or so ago I maintained a computer room with a single air
conditioner because the boss wouldn't go for n+1. It failed in exactly this
manner several times.



And in the early 2000s I worked at a (very crappy) ISP/Colo provider which
had their primary locations in a small, brick garage. It *did* have
redundant AC — in the form of two large window units, stuck into a hole
which had been hacked through the brick wall. They were redundant — there
were two of them, and they were on separate circuits. What more could you
ask for?!

At 2AM one morning I'm awakened from my slumber by a warning page from the
monitoring system (Whatsup Gold. Remember Whatsup Gold?) letting me know
that the temperature is out of range. This is a fairly common occurrences,
so I ack it and go back to sleep. A short while later I'm awakened again,
and this time it's a critical alert and the temperature is really high.

So, I grumble, get dressed, and drive over to the location. I open the
door, and, yes, it really *is* hot. This is because the AC units have been
vibrating over the years, and the entire row of bricks above have popped
out. There is now an even larger hole in the wall, and both AC units are
lying outside, still running.

'Twas not a good day….
W




After the overheat was detected by the monitoring system, it would be
brought under control with a combination of spot cooler and powering down
to a minimal configuration. But of course it takes time to get people there
and set up the mitigations, during which the heat continues to rise.

The main thing I noticed was a modest uptick in spinning drive failures
for the couple months that followed. If there was any other consequence it
was at a rate where I'd have had to be carefully measuring before and after
to detect it.

Regards,
Bill Herrin

--
William Herrin
bill () herrin us
https://bill.herrin.us/


Current thread: