Dailydave mailing list archives

Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux

From: Brian Smith-Sweeney <tinbox () nyct net>
Date: Thu, 18 Nov 2004 11:56:34 -0500

Dave Aitel wrote:

So I learned a lesson about reliability and I thought I'd share.Recently, the main hard drive, a little 40 gigger that runswww.immunitysec.com, which also happens to be mail.immunitysec.com anddns.immunitysec.com, started to display read errors in the kernel logs(viewable via "dmesg" if you are root). This also caused datacorruption in a few cases, and some other badness such as long pausesduring writes. Jeremy says that he's never had a hard drive fail onhim, but it happens all the time, so it'll probably happen to you,probably the day after you sign a big contract with someone and needto do something other than mess with your hard drive.

This is a great post, full of really good info; I've added some commentswhere I thought them appropriate.

. Let me start by saying "Sorry, Dave; that really hurts". Harddrives will fail, they always fail, even the priciest drives will croakeventually, and anyone who hasn't had it happen to them yet is justplaying against the odds. Hard drive failure, like any kind of systemfailure, is not an if, it is a when. Again, I feel for you :(.Someone's already responded with technical comments, so I'll leave mostof those out and focus on the general one(s) with possible exceptions.


<SNIP>

You'll want to get a few spare drives, since the first drive I triedto restore onto was bad as well. This is an important note - never buyrecertified drives. Always get spankin' new drives. It might be fun todo a strings on some of those recertified drives, but I didn't have time.

For a production machine, no question. Recertified are great fortest/unimportant systems. But if it's production, and has any worth toyou at all, I would stick with new as well.

One thing my lilo did that was weird was rewrite the fstab to use"LABEL=/" instead of /dev/hda1. If you happen to be hosted at Pilosoft(or another co-lo that is run by someone on the local linux usersgroup list - very good idea!) they might jump in and save your buttwhen you load it up and it doesn't work and you're too tired to figureout why.

Unless I'm confused, the command you ran above (lilo -r /mnt/hda1)installs lilo onto the appropriate drive, but uses the lilo versionthat's on the Knoppix system (depending on your path). This might bewhy you got a rewritten fstab file (though an interesting question is,if the drive partition was labled properly, is why *didn't* thiswork). I've had good luck using "chroot" to do this sort of thing inthe past; basically you "chroot <your_root_partition>", and run lilo asnormal. At least, I think that's what I did; it's been a while sinceI've had to do this.

On a side note, you can use e2fslabel to write those labels that lilowas trying to use to your partitions. The command's prettystraigthforward, and grabbing a man page on it should reveal the details.

The next step after doing all this is typically to make a plan thatinvolves not having to ever do this again. For those of you not in theknow - you want a hardware supported (get a good modern motherboard)RAID-1 solution and you want to be able to swap out one of your twodrives (mirrored) when Linux tells you that one is bad. You also wantto have some sort of backup solution running (of course), and you wantto have a secondary DNS server and a backup machine somewhere inanother state (or country) that can take over if your main CO-LO goesunder or something. Something that can provide basic mail and webservices is nice. It might be good to hire an admin who is not you.

On to my general point (finally =) ). Or rather, I should say, onto myreinforcing Dave's point, because I think it's an incredibly important one.


RAID=good.  RAID=very, very good.

If you have a production machine, run a hardware RAID. Even if you'retrying to control costs. I've had good luck with 3ware and Adaptec IDEraids in linux, and bad luck with Arco and Promise IDE; SCSIs are on thewhole more reliable but can be much more expensive. But this wasseveral years ago, and times have changed, so don't take my word forit. The important thing, though, is for <$500 you can have a hardwareRAID card and a second drive ready to go, and a primary drive failuredoes nothing but make you put up some "scheduled maintenance" time.Every time I've had a drive failure occur on an important productionsystem that was not RAIDed ,the time and resources put into restoringthe host far exceeded what would have been the cost to setup a decenthardware RAID initially (you can use a software RAID in a pinch, but Ifind it's usually just worth it to shell out the extra bucks for thehardware. Software RAID=more time/work on your part, which you'retrying to avoid). RAID is not a silver bullet: it won't help a powersurge that fries the whole system, or filesystem corruption unrelated tohardware which will happily get mirrored on both drives, or someoneburning your building down, or you trying to get rid of all the files onyour root partition that have a question mark in them and decided to run"rm -rf *?". That's why you need good backup systems too, and a solidrecovery plan in place ahead of time (backup system - restore plan &testing = "Well, I *thought* my files were here...").

I worked with a great network engineer once when he was doing a big gearswitch for the backbone of the university I was working for. When itcame time to make the final changes, they'd documented every detail ofevery step they were going to take. When I noticed this, he said "Yeah,well when the time comes to use it, pulses will be running high andeveryone will be nervous. We don't want them to have to think, becausethen they can make a mistake. We like to keep it as stupid aspossible." That's how I like to think of recovery plans: make themstupid. Make them so you know exactly what you have to do, every tweakand detail you had to go through to get the system running the firsttime. You'll still have to do some thinking ("hey, this OS isoutdated...should I take this opportuinty to upgrade, or minize thenumber of changes?"), but the more you limit that, the better.Remember, you're going to be very, very sad when this happens, and itwill most likely happen at 3:15am on Saturday when you're exhausted, orat 9:15am on Monday before a meeting of your company's board ofdirectors. Everyone knows this is when computers commit suicide.Once again Dave, my sympathies. Glad you had a good colo to help youout! =)


Cheers,
Brian
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
https://lists.immunitysec.com/mailman/listinfo/dailydave

Current thread:

Re: The Small Company's Guide to Hard Drive Failure and Linux, (continued)
- Re: The Small Company's Guide to Hard Drive Failure and Linux Paul Wouters (Nov 18)
- Re: The Small Company's Guide to Hard Drive Failure and Linux Frank Berger (Nov 18)
  - Re: The Small Company's Guide to Hard Drive Failure and Linux Derek Vadala (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Dave Aitel (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux miah (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Anthony.zboralski (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Derek Vadala (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Anthony.zboralski (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Derek Vadala (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Anthony.zboralski (Nov 18)
- Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux Brian Smith-Sweeney (Nov 18)
- Re: The Small Company's Guide to Hard Drive Failure and Linux Gadi Evron (Nov 19)