Planet HantsLUG

June 15, 2018

Andy Smith

Another disappointing btrfs experience

I’ve been using btrfs on my home fileserver for about 4½ years. I am not entirely happy with it and kind of wish I never did it; I will certainly not be introducing it anywhere else. I’m also pretty lazy though, which probably explains why I haven’t ripped it out and replaced it with something else yet.

I’ve had a few problems with it over the years. To be fair I’ve never lost any data; it’s really the availability aspects of it which I feel just aren’t ready yet. When I use multiple storage devices it’s generally to increase availability. I don’t expect device failure to stop me doing what I need to do, at least for small amounts of device failure.

Unfortunately btrfs has consistently not lived up to these expectations. Almost every single-disk failure I’ve had in the past has resulted in an “outage” of some sort. As this is just our data, at home, it may be strange to think of it as an outage, but that’s what it is. Our data became unavailable in some way for some period of time.

This time around, one of the drives started throwing up “Currently unreadable (pending)” and “Offline uncorrectable” sectors a few days ago. That means that there’s areas of the drive that it cannot read. Initially there were just a small number, and a scrub came back clean so that suggested the problem sectors were at that time outside of any filesystem.

In a more critical setting I’d have spare drives available and would just swap them, but for home use I’m usually comfortable with forcing the drive to reallocate these by forcing a write, before ordering a replacement if the problem doesn’t go away. Worst case, I have backups.

After a day or so though, the number of problem sectors was increasing and it was obvious the drive was going to die fairly soon. I ordered a replacement. About 6 hours before the replacement arrived the drive completely stopped responding.

Now, this drive was at the time one of five in the btrfs filesystem, and the filesystem has a raid1 storage policy so there should have been no issue with one device going missing. But apparently there was a problem. btrfs sits spewing the kernel log with errors about lost writes to a device that’s no longer there; the filesystem goes read-only.

The replacement drive arrives, but with the filesystem read-only I can’t add it. I can’t even unmount the filesystem (says it is busy but lsof doesn’t see any users). Nope, I had to reboot the fileserver, at which point the filesystem wouldn’t mount at all because you have to give it the degraded mount option if you want it to mount with any devices missing.

Add the replacement drive, btrfs device remove missing /path/to/fs to kick off a remove of the dead device. Things are at least up and running read-write while this is going on. In fact it’s still going on, because there was 1.2TiB of data on the dead device and reconstructing it is painfully slow. As I write this we’re now about 9 hours in and there’s still about 421GiB to go.

So, it’s not terrible. No data was lost (probably). A short outage due to a required reboot. But it is kind of disappointing and not really how I want to be spending my time just because a single HDD slipped its mortal coil. I am massively thankful that the operating system of that fileserver is still on four other HDDs on ext4+lvm+md and never give me any trouble. Otherwise I’d have to be booting into a rescue OS to fix this sort of thing. When the thing you’re glad of is that you didn’t use a filesystem, that isn’t a great advert for that filesystem.

I should probably try to find some time to play (again) with ZFS-on-Linux. I did actually give it a go last year but got bogged down trying to compare its performance against btrfs and ext4+lvm+md using fio, which proved quite difficult to do, and I moved on to other things.

One of the things that initially attracted me to btrfs is the possibility of using a mish-mash of differently-sized drives. Due to BitFolk constantly replacing hardware I have in my possession plenty of HDDs of differing sizes that are individually perfectly serviceable, but would be awkward to try to match up into identical sizes for conventional RAID arrays. Over the years of this btrfs filesystem it had started out with mostly 250G drives and just before this failure it was 1x 1TB, 3x 2TB and 1x 3TB.

I had thought that ZFS requires every device to be the same capacity (i.e. it would only use the smallest capacity) but I’ve since been informed that ZFS will just use the capacity of the smallest device in the vdev. So assuming mirror vdevs, I’d just need to pair the drives up (or accept that the capacity will be that of the smaller of the two).

That doesn’t seem too onerous at all, when considering the advantages that ZFS would bring. I’m most interested in the self-healing (checksums) and the storage tiering (through using faster devices like SSDs for L2ARC and ZIL). btrfs doesn’t have a good solution for tiering yet, unless you are insane and want to play with bcache(fs).

So, yeah, should stop being lazy and crack on with ZFS again. In my copious free time.

by Andy at June 15, 2018 12:45 AM