I think a hard drive died. Should I see some sort of warning?

slavetothesound · September 5, 2016, 2:39am

I have a RAID5 array of 5 x 3tb drives. I think one (or two?) have failed. I’ve been away from home for a month, and I noticed some things were weird about 3 weeks ago when deluge started reporting errors and couldn’t download any torrents, but didn’t have things configured for ssh or troubleshooting over the internet. Plex was still running and I was too busy to think about it.

I came home today and rebooted the NAS, hoping for a ‘turn it off and on again’ fix. I still don’t see any warnings in the web panel, but things are definitely worse now. I had to reboot a second time before the rock-ons would turn on. Now… no files in any folders. And the Storage -> Disks page is a little off. One of the drives in the pool is listed as 094859c5f4ec420fb422a7c5162049e0 instead of sdx like the rest. It now says SMART is not supported (unlike the rest of same model) and if I click on it, I get an error message that it doesnt exist, and then it appears with a new name when I refresh the page.

I was hoping to throw my spare 3TB drive in and rebuild, but if my folders appear totally empty now, does that mean something else has gone wrong? Should I have a warning message somewhere?

I think about 6tb of torrents are all that are lost, nothing I can’t recover. But internet is slow in this part of the world, is there anything I should try before I just start over from scratch with whichever drives are still good?

phillxnet · September 5, 2016, 2:01pm

@slavetothesound A belated welcome to the Rockstor community. Just quoting from your slightly more recent post with your new findings reported there so the two are tied together.

Yes the strange long random name for devices is a placeholder for ‘missing’ devices, ie devices that were once know to be connected but are no longer sensed as connected. In the testing channel updates this has been improved a tad from the user point of view by preceeding this name with “detached-(long-random-string)”. The SMART disabled is intended as this devices is no longer connected. The user messaging for this is also improved a tad in that regard in the testing channel updates.

So as you suspect you have a dead / missing / detached drive. To help forum members advise you the output from the following command in a terminal as the root user would be helpfull:

btrfs fi show

I suspect it will identify a ‘missing’ device. Also check the smart status of the remaining attached drives if you are able.

Currently btrfs doesn’t have the concept of a ‘bad device’; and when a device does go bad the mount command will simply refuse to mount the pool there by forcing a manual repair procedure via command line, hence the empty shares and the lack of warnings (bar the missing drive on the Disks page) as we depend on the subsystems to provide these and given there isn’t yet a clear indication of such a bad device we are ‘caught between a rock and a hard place’. There are however patches to provide this functionality working their way through the linux-btrfs mailing list authored by an Anand, but due to their significant nature it is taking some time to see them incorporated.

Also note that the btrfs parity raid levels of raid5 and raid6, which work very differently from the other raid levels, are not considered ready for production use.

I would advise that you first attempt to retrieve what data you can and then nuke and pave to a raid1 or raid10 arrangement and loose the faulty drive obviously. The btrfs restore function may well be your friend in this case, especially given your use of raid5 in this pool.

If on the other had you would like to attempt a repair of this pool then please see the recently revised Data loss Prevention and Recovery in RAID5/6 Pools section of the official Rockstor docs.

Hope that helps and apologies for the slow response.

slavetothesound · September 5, 2016, 4:44pm

Thank you for the detailed response! I was at a loss for a starting point.

Output of ‘btrfs fi show’ confirms that a drive is ‘missing’. I’ve determined which one it is and replace it with a good drive, and instead of rebuilding I think I’ll see if I can mount the array as is and copy the things I most want to keep. Then I’ll start over with raid 10 as you suggest. Thanks!

[root@weiland mnt2]# btrfs fi show
Label: 'rockstor_rockstor'  uuid: 150c532d-d963-463e-9c2d-c806f46655aa
	Total devices 1 FS bytes used 8.13GiB
	devid    1 size 105.80GiB used 46.02GiB path /dev/sde3

warning, device 1 is missing
warning devid 1 not found already
checksum verify failed on 3137863680 found BA8DD541 wanted 5E6E68F7
checksum verify failed on 3137863680 found BA8DD541 wanted 5E6E68F7
bytenr mismatch, want=3137863680, have=0
Label: 'weiland'  uuid: 7aa74f0e-55a2-4f4e-8b65-83d39a5463f4
	Total devices 5 FS bytes used 3.23TiB
	devid    2 size 2.73TiB used 895.50GiB path /dev/sdb
	devid    3 size 2.73TiB used 895.50GiB path /dev/sdc
	devid    4 size 2.73TiB used 895.50GiB path /dev/sdd
	devid    5 size 2.73TiB used 895.50GiB path /dev/sdf
	*** Some devices missing

slavetothesound · September 5, 2016, 5:06pm

I’m not having much luck mounting in degraded mode.

[root@weiland mnt2]# mount -o degraded,device=/dev/sdb,device=/dev/sdc,device=/dev/sdd,device=/dev/sdf /dev/sdb /mnt2/weiland
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
[root@weiland mnt2]# dmesg | tail
[ 1921.047994] BTRFS: failed to read chunk tree on sdd
[ 1921.063130] BTRFS: open_ctree failed
....

I get the same error specifying only one device in the pool, and still if I use the btrfs flag
mount -o degraded -t btrfs /dev/sdd /mnt2/weiland

phillxnet · September 5, 2016, 5:50pm

@slavetothesound If you are on step 4 from the referenced “Data loss Prevention and Recovery in RAID5/6 Pools” doc make sure you arn’t yet including the ‘fresh’ replacement drive in the list as that hasn’t yet been added to this pool. Note also that you may only get one shot at this as subsequent degraded mounts may not allow write access. And remember that drives can have a different name from boot to boot.

Hope that helps.