nanog mailing list archives

Re: availability and resiliency


From: Adrian Chadd <adrian () creative net au>
Date: Sun, 1 Oct 2000 03:13:50 +0800


On Fri, Sep 29, 2000, Valdis.Kletnieks () vt edu wrote:
On Fri, 29 Sep 2000 18:42:12 EDT, Andrew Brown said:
um...is an smp cpu configuration really going to help your uptime?  or
are there operating systems or hardware out there that can say to
themselves "hmph!  cpu 2 seems not to be working correctly...i'd
better spin it down."

IBM mainframes have been doing this for decades.  I believe that both
OS/VS1 and VM/370 for the S370-158 supported this back in the 1973 timeframe.

About 10 years ago, our 3090-300 blew a TCM and lost one of the 3 CPUs.  As
I was sitting there diagnosing the problem at the console, I got a popup
dialog box from the onboard support processor.  Basically, it wanted to phone
IBM Hardware Support and tell them to send a guy with a new TCM, but it had
detected that it was more than 7 digits and therefor probably a long distance
phone call, was this OK?

Yes, it asked permission to rack up the phone bill before it called for
repairs itself.

Current mainframe state of the art is described in the IBM Journal of
Research and Development - Vol 43, Number 5/6 (Sep/No 99), which was
devoted to the G5 and G6 chipsets used in current IBM S/390 big iron.
The article "RAS strategy for IBM S/390 G5 and G6" (page 875) talks about
the system's ability to not only detect a failing CPU, but on detection
it will latch out the last known good state from the previous instruction,
and retry the failing machine instruction on a hot-spare.  That's after a
reset-and-retry on the failing processor has proven it's a hard failure and
not a soft one.

The mind boggles.... ;)

.. and the concept of this happening on Wintel hardware running anything
is sheer ludicrousy. Whoever mentioned that SMP can help you get high uptime
boxes is smoking heavy crack in most cases.

Note that the big-end Alpha and Sun gear is NUMA, not SMP. Different kettle
of fish there, and if you need an explanation as to why its more likely to
happen with NUMA and not SMP, there are lots of hardware books out there. :-)





Adrian, who notes a lot of "5 9's" computing problems were solved in the 70s
and yet don't appear in most equipment in the naughties.

-- 
Adrian Chadd                    "If a butterfly flaps its wings in China,
<adrian () creative net au>         will a women get naked in Amsterdam?"
                                      -- Ashley Penney on Chaos Theory



Current thread: