No notification on disk failure

maxhq · October 19, 2015, 8:02pm

Hi there,

as I am evaluating RockStor and currently testing the basic features, I did the following:

created a RAID5 on two SATA disks (pool, share and Samba share)
copied data onto it
now I simulate a disk failure: I unplug disk 2
data is still accessible - good!
/var/log/messages says what it should say:

BTRFS: bdev /dev/sdb errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
BTRFS: bdev /dev/sdb errs: wr 11, rd 0, flush 0, corrupt 0, gen 0

An email is sent 10 minutes later with a warning from smartd: “Device: /dev/sdb [SAT], unable to open device”

But: I get no notification via Dashboard or instant mail due to btrfs errors.
Only when I go to Storage - Disks I see a trash bin icon with tooltip “Disk is unusable because it is offline.”

I guess it normally should display a warning if there is something wrong with a disk. Am I missing something or is it a bug?

Regards,
maxhq

Dragon2611 · November 3, 2015, 8:21am

Did you really mean RAID5, as that should have failed to create in the first place seeing as RAID5 requires a Minimum of 3 disks…

roweryan · November 3, 2015, 8:38am

It doesn’t matter for btrfs.

“Note that the minimum number of devices required for RAID5 is 2. In
case of a 2 device RAID5 filesystem, one device has data and the other
has parity data. Similarly, for RAID6, the minimum is 3 devices.” from https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Raid_5_and_Raid6

Dragon2611 · November 3, 2015, 8:39am

Interesting given that when I tried to remove a Dead (Only dead because I did something stupid to it) from a 3 Disk Raid5 it complained about the minimum level of redundancy would’t be met.

roweryan · November 3, 2015, 1:30pm

Was it balanced? (extra words here for spam filter)

Dragon2611 · November 3, 2015, 3:20pm

Nope drive replacement is still ongoing, about 105GB left I think.

Edit

Also @suman it would be nice if we could force the GUI to mount a degraded array since doing it manually doesn’t mount the subvols automatically. (I don’t think unless I missed a Trick) and there might be some data you want to copy off in the meantime.

maxhq · November 4, 2015, 9:54am

@Dragon2611 Can you please add two details: 1. Where did it complain about the minimum redundancy level? 2. Was that after a restart or instantly?
Thanks a lot!

Dragon2611 · November 4, 2015, 9:56am

It was in the CLI, I don’t remember precisely so BRTFS-progs rather than rockstor.

I think I accidently tried to remove the missing disk before adding the replacement, the docs aren’t exactly stellar on rebuilding an array.

Learning2NAS · November 30, 2015, 9:57pm

I have just done a similar test. 4 disks in RAID 1 Mirror config. Pull the power and data cord from disk 3. I am able to access all of my data and shares, but did not get any notification of disk failure from the GUI. If I go to the Storage>Disks menu I see the trash can with the same tooltip (“Disk is unusable because it is offline.”)

It has only been a few minutes and I have not performed a balance yet. I am waiting for an e-mail or some form of error message to appear before I continue. Is there any kind of fix in the works for this yet?

suman · November 30, 2015, 10:07pm

Yes, disk failure notification feature is one of the top issues and we’ll be working on it soon. @lakshmipathi_g may find your test useful as he’s looking into using udev/pyudev for this.

Learning2NAS · December 1, 2015, 2:24am

Great @suman!

I waited a few hours and started a balance. The three remaining disks rebuilt the RAID 1 in a few hours. The status page now reads 100% and zero errors, however my pool size has not changed from when I had four disks.

Before: 4x500GB SATA in one pool. Capacity: 931GB usable
Now: 3x500GB SATA in one pool. Capacity: 931GB usable

How do I get the pool/share capacity to update after balancing?

Eric_Mesa · April 8, 2016, 12:06pm

What does a RAID1 with more than 2 disks mean? Doesn’t the capacity make sense, then?

Flyer · April 8, 2016, 12:42pm

RAID1 with 2+ disks means that you’ll be always on a working RAID1 for total disks - failed disks > 1 (so with total disk = 4, failed disk simultaneously = 2 your always on a good raid 1, total disk = 2, failed disk = 1…bye bye raid1)

Flyer

bug11 · April 8, 2016, 2:27pm

On raid 1 you can have 2disks and 1faliure and still be good. On raid 5/6 thigs are a bit different.

Edit: Hot-spare and hot-swap is in the works, so a bit of patience, i think it will be here. Patches are submitted on the mailinglist, so it is to bugfix etc

Eric_Mesa · April 8, 2016, 3:44pm

Oooh, hot-spare/hot-swap should REALLY help a lot. According to the memtest calculator it really reduces the risk inherent to RAID5/6 rebuilds. (Other than the RAID write hole)

bug11 · April 9, 2016, 9:36am

Here is the link on the mailinglist describing hot-spares etc:
http://www.spinics.net/lists/linux-btrfs/msg48916.html

m3elloa · June 3, 2016, 8:04pm

Just subscribed to my stable updates (We got to support right)?

Started testing in a VBox VM with one boot vdisk and two data vdisks and saw the same behavior on raid 1. I hope the notification will be ready soon as this is important.

suman · June 4, 2016, 12:31am

Really appreciate your support! This is an important feature indeed. Hope to put it behind us soon.

Learning2NAS · February 26, 2017, 7:55am

Hey guys,

I’ve been away from the project for about a year now and I am back considering Rockstor again for a new deployment. The missing disk failure notification feature was one of the reasons I didn’t use Rockstor last time around.

Can anyone update me on this now? Does the current release support disk failure detection & notification?