Dailydave mailing list archives

Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux


From: Brian Smith-Sweeney <tinbox () nyct net>
Date: Thu, 18 Nov 2004 11:56:34 -0500

Dave Aitel wrote:

So I learned a lesson about reliability and I thought I'd share. Recently, the main hard drive, a little 40 gigger that runs www.immunitysec.com, which also happens to be mail.immunitysec.com and dns.immunitysec.com, started to display read errors in the kernel logs (viewable via "dmesg" if you are root). This also caused data corruption in a few cases, and some other badness such as long pauses during writes. Jeremy says that he's never had a hard drive fail on him, but it happens all the time, so it'll probably happen to you, probably the day after you sign a big contract with someone and need to do something other than mess with your hard drive.

This is a great post, full of really good info; I've added some comments where I thought them appropriate.

. Let me start by saying "Sorry, Dave; that really hurts". Hard drives will fail, they always fail, even the priciest drives will croak eventually, and anyone who hasn't had it happen to them yet is just playing against the odds. Hard drive failure, like any kind of system failure, is not an if, it is a when. Again, I feel for you :(. Someone's already responded with technical comments, so I'll leave most of those out and focus on the general one(s) with possible exceptions.


<SNIP>

You'll want to get a few spare drives, since the first drive I tried to restore onto was bad as well. This is an important note - never buy recertified drives. Always get spankin' new drives. It might be fun to do a strings on some of those recertified drives, but I didn't have time.

For a production machine, no question. Recertified are great for test/unimportant systems. But if it's production, and has any worth to you at all, I would stick with new as well.


One thing my lilo did that was weird was rewrite the fstab to use "LABEL=/" instead of /dev/hda1. If you happen to be hosted at Pilosoft (or another co-lo that is run by someone on the local linux users group list - very good idea!) they might jump in and save your butt when you load it up and it doesn't work and you're too tired to figure out why.

Unless I'm confused, the command you ran above (lilo -r /mnt/hda1) installs lilo onto the appropriate drive, but uses the lilo version that's on the Knoppix system (depending on your path). This might be why you got a rewritten fstab file (though an interesting question is, if the drive partition was labled properly, is why *didn't* this work). I've had good luck using "chroot" to do this sort of thing in the past; basically you "chroot <your_root_partition>", and run lilo as normal. At least, I think that's what I did; it's been a while since I've had to do this.

On a side note, you can use e2fslabel to write those labels that lilo was trying to use to your partitions. The command's pretty straigthforward, and grabbing a man page on it should reveal the details.

The next step after doing all this is typically to make a plan that involves not having to ever do this again. For those of you not in the know - you want a hardware supported (get a good modern motherboard) RAID-1 solution and you want to be able to swap out one of your two drives (mirrored) when Linux tells you that one is bad. You also want to have some sort of backup solution running (of course), and you want to have a secondary DNS server and a backup machine somewhere in another state (or country) that can take over if your main CO-LO goes under or something. Something that can provide basic mail and web services is nice. It might be good to hire an admin who is not you.

On to my general point (finally =) ). Or rather, I should say, onto my reinforcing Dave's point, because I think it's an incredibly important one.

RAID=good.  RAID=very, very good.

If you have a production machine, run a hardware RAID. Even if you're trying to control costs. I've had good luck with 3ware and Adaptec IDE raids in linux, and bad luck with Arco and Promise IDE; SCSIs are on the whole more reliable but can be much more expensive. But this was several years ago, and times have changed, so don't take my word for it. The important thing, though, is for <$500 you can have a hardware RAID card and a second drive ready to go, and a primary drive failure does nothing but make you put up some "scheduled maintenance" time. Every time I've had a drive failure occur on an important production system that was not RAIDed ,the time and resources put into restoring the host far exceeded what would have been the cost to setup a decent hardware RAID initially (you can use a software RAID in a pinch, but I find it's usually just worth it to shell out the extra bucks for the hardware. Software RAID=more time/work on your part, which you're trying to avoid). RAID is not a silver bullet: it won't help a power surge that fries the whole system, or filesystem corruption unrelated to hardware which will happily get mirrored on both drives, or someone burning your building down, or you trying to get rid of all the files on your root partition that have a question mark in them and decided to run "rm -rf *?". That's why you need good backup systems too, and a solid recovery plan in place ahead of time (backup system - restore plan & testing = "Well, I *thought* my files were here...").

I worked with a great network engineer once when he was doing a big gear switch for the backbone of the university I was working for. When it came time to make the final changes, they'd documented every detail of every step they were going to take. When I noticed this, he said "Yeah, well when the time comes to use it, pulses will be running high and everyone will be nervous. We don't want them to have to think, because then they can make a mistake. We like to keep it as stupid as possible." That's how I like to think of recovery plans: make them stupid. Make them so you know exactly what you have to do, every tweak and detail you had to go through to get the system running the first time. You'll still have to do some thinking ("hey, this OS is outdated...should I take this opportuinty to upgrade, or minize the number of changes?"), but the more you limit that, the better. Remember, you're going to be very, very sad when this happens, and it will most likely happen at 3:15am on Saturday when you're exhausted, or at 9:15am on Monday before a meeting of your company's board of directors. Everyone knows this is when computers commit suicide. Once again Dave, my sympathies. Glad you had a good colo to help you out! =)

Cheers,
Brian
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
https://lists.immunitysec.com/mailman/listinfo/dailydave


Current thread: