header
Disk Doom: or How I learned to Stop Worrying and Love the Backup
Posted on 2008-10-23 [direct link] Tags: Failure.

The last week has been very annoying indeed, assuming you look at it purely from a computer health point of view. Which is the only sensible way to look at these things. It started innocently enough - I was merrily pimping Ramdaq's new site when there was a power cut here at the towers. When asus (as the computer is descriptively known) came back up one of the drives in the main RAID5 had been marked as faulty.

No problem. Using RAID storage is one of the ways to guard against this kind of thing. Hard disks are miraculous little things, if you take the time to really learn what they're doing it'll boggle your mind. But even more boggling is the horrendous failure rates they suffer. And most boggling of all - we keep using them.

I've had 2 disk failures already this year, although admittedly both of those were directly tied to misuse (being knocked off a table while running etc). So back to my use of RAID - this is, in simple terms, a kind of insurance for hard disks. You buy more disks than you strictly need to store your data and you arrange for the computer to spread things around and keep spare copies of things in the spare space. If a disk dies then you're OK. Everything can be recovered from the spare bits. Thats the general idea in a kind of gross simplification and ignoring "RAID for speed" type stuff.

So I'm laughing. I just check the failed disk out, and if its really dead I replace it.

Only while I'm checking it out another disk gets marked faulty. And when 2 of your 4 disks in a RAID5 are faulty you're in trouble because there aren't enough spare copies to cover the failures. So I had to hope they weren't really faulty at all and I could just keep using them. No such luck - they were both riddled with unreadable sectors and generally well on their way to paper weight status.

Am I the most unlucky person in the world? Not according to Robin Harris I'm not, just ahead of my time. Trying to force the faulty disks into a read-only array resulted in a moderately usable file system, but it was pretty painful to use (lots of IO errors) and you'd have gone crazy trying to recover from it.

So I had to get 4 similar disks out of another machine (wz01) and ddrescue the goosed RAID5 partitions onto them. Ddrescue is a brilliant kit of code, it has saved many clients' data over the years. Ddrescue doesn't do anything you can't do with other tools but it just makes it so simple and reliable. Simple and reliable are great assets when you're panicking about dubious disks.

Several ZZTops later I had a new set of disks to use, and ddrescue was confident it'd recovered everything apart from a few hundred KB. I forced the array to start despite the two components marked faulty and copied everything off to a large external disk. This took about 9 further ZZtops. I then copied it all back onto a brand new SATA disk. Alas, no RAID smugness any more... the volume of data is just too much to justify the cost for a home machine. Instead I'll just use the external drive for nightly backups and hope for the best.

The key thing about all this is that it was never really that much of a disaster. Even if they data had been lost things wouldn't have been that bad, as I already had a good backup policy in place for all my "created" data (photos, video, code etc). Anything else (mainly music, movies and junk) can either be reacquired or forgotten about. Happiness is a recent backup.

The second thing to remember is that with tools like mdadm and a filesystem like ext3 you're probably going to get (most of) your data back at some point. Everything is so well documented and so flexible that as long as some of the disk can be read you'll be able to recover something. Excellent tools once you've learned them, and the only real way to learn them is in a datacentre at 3am. Iit focuses the mind, so a little home hiccup like this becomes a breeze.

The final, and most important, point is that if you have 16 disks in use and they're an average of 2 years old (or more) then you're due failures right about now. So I only have myself to blame. I know hard disks are crap and don't last... but they're so seductive and once you're using them you forget how old they are. I'm going to preemptively junk all my old RAIDs (mainly 4 or 5 200GB to 250GB drives per machine) and replace them with single 1TB drives, coupled with an external drive for snapshot backups.

But its so hard to bin disks... I might just find space for them in a big RAID10 for /tmp ;)

Hard Disks


[2008-10-23 at 17:50 (updated 2 times)] [views: 393] [direct link]
Browse recent articles
Login / Register
Archives
January, 2009
SunMonTueWedThuFriSat
 123
45678910
11121314151617
18192021222324
25262728293031
[prev month]

Tag Cloud
Work Test Rantcliff Cashmoney Polyfarm Music Fire Wales Religion Cola Rat Programming Failure Batshitinsane Crafts Holiday Food Mia Orb Xmas Video Ramdaq Kfc Project Halloween Photo Mp3 Meejah Art Mail
Search

big cheese
samworm AT gmail DOT com

© samworm





Subscribe in a reader