BTRFS FS RAID Failure

b8two · May 27, 2021, 5:45am

Hi All,

First of all I’ll state that this BTRFS issue is from an OMV 32bit installation BTRFS Version V4.20.1

I’m very frustrated with the outcome of events where I think I have lost 6.23TiB of content.

I’m going to Vent / document the process I went through here as it might help others or I might get some guidence on how to get access to my data again. My memory of what I started working on a couple of days ago is not a fresh, so I may not have the details quite right.

I had 2x 8TB Drives in RAID 1 (Mirror), One Drive started to “Fail” and caused performance issues but I was still able to access the disks, etc.
Hence I managed to Convert the BTRFS from Mirror to Single. I did this by using OMV’s “WIPE” option on the drive that was failing, which was quick, so it might have just deleted the Partition information. then I could run the BTRFS convert utility to convert from RAID1 to single. This completed after a day or so. (I don’t understand why this took so long.)
This was fine and continued to work. I had left it like this for a few days.

Eventually the drive that was “Failing” started to cause IO issues for the whole machine and I needed to pull it out. I tried to power cycle to fix the IO issues and drive power down commands, etc. Basically it had to be removed.
After this the only way to mount the now Single BTRFS volume (I think it complained of a missing volume) was to mount it in degraded mode.
This worked for a while but I found it frustrating that it would not mount on boot.

I tried the command to remove the missing device. I think that after this it ether said that the missing device was the device in use OR it no longer indicated a missing device. I’m not sure now.

It still would not just mount, only in degraded mode.

Typically a check disk utility would be the “goto” for other files systems, hence I started searching around.

At some point I ran;

btrfs inspect-internal dump-super

btrfs rescure super-recover

I tried btrfs rescue zero-log
https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-rescue
indicates possible loss of data in the past 30 seconds but since I have not written data to the drive for a while and it was not mounted, I thought this should be safe.

I’ve run a btrfs check --check-data-csum but after a long time of running and no error ouputs I canceled it.

I’ve run a btrfs check --init-extent-tree but that ran for days and ended up “Aborted”. so I don’t know if that had done anything or was just reporting what It could do.

I found this man page for BTRFS;

It indicated under SAFE OR ADVISORY OPTIONS that I could run this command.
btrfs check --clear-space-cache v1|v2
I read online that this space-cache is entries that indicate where empty space on the dirve exists and is used to speed up free space checks. Also something I had read indicated to me that perhaps this could be the reason.

I ran v1 but that Failed,
I ran V2 and it indicated nothing to do
I ran v1 and it indicated something different this time
so I ran this command a couple more times.
Then I think it indicated an error

After this btrfs would no longer mount the drive.
I really don’t understand what is wrong with BTRFS that a “safe” check tool can make things worse then it was.

This is the error output to the SSH Prompt

mount: /srv/dev-disk-by-id-ata-ST8000AS0002-1NA17Z_Z8409EZA-part1: wrong fs type, bad option, bad superblock on /dev/sda1, missing codepage or helper program, or other error.

the Console indicates

BTRFS error (device sda1) : block=7630187954176 read time tree block corruption detected
BTRFS error (device sda1) : failed to read block groups: -5
BTRFS error (device sda1) : open_ctree failed

I ran;

btrfs rescure super-recover

responds with;

All supers are valid, no need to recover

BTW, the “Failed” drive, I removed the screws and cleaned contacts on the PCB, it now appears to work okay, spin up and read checks okay but I have not written to the disk, possible to use for some revocery maybe if I can undo the “WIPE”.

Any suggestions?

b8two · May 27, 2021, 2:14pm

I checked my Rockstor’s BTRFS version and it’s V4.19.1
My assumption that the version on OMV was old/older.

BTRFS is Version 5.12 but how can anyone run the latest version?
https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature

GeoffA · May 27, 2021, 5:03pm

@b8two sorry to hear of your woes with this, it can’t be much fun (understatement!).

I’m not heavily into OMV, my only experience of it was with EXT4 formatted drives. However as its a 32-bit version that you say you are using, I can only assume it’s not current and so uses an older BTRFS version. I’m not up to speed with the lifecycle and release versions of BTRFS, so I could be wrong there.

My advice would be that if you wish to stay with BTRFS, I’d use a NAS solution that is built around it from the ground up. Any takers?

So, assuming you have backup(s) of your 6.23 TB of data, I’d replace the potentially faulty drive and build a fresh RAID1 using a current NAS solution and restore the data from your backup.

Now, if you are using a 32 bit version of OMV because of your hardware, your options are more limited I’m afraid.

Good luck!

b8two · May 27, 2021, 6:49pm

Thankyou for your input @GeoffA.

@GeoffA, what version BTRFS are you running?

I used OMV due to the 32bit hardware limitation of the MB I had avaliable and RockStor isn’t compatiable. That being said, my RockStor box is 64bit and I recently updated Rockstor to stable (Linux RockStor 5.3.18-lp152.72-default #1) but I think I had started with a LEAP install and it’s BTRFS is (Dec 2018). Compared to OMV’s BTRFS (Jan 2019), a month newer.

Hence the problem I have with BTRFS would most likely have occured if the drives were in my Rockstor box. In the past I had changed my 6 Drive RAID 10 to a 5 Drive RAID 10 in rockstor when a Drive started to show signs of old age but the process was very different.

If I was running openSUSE: Tumbleweed, what version BTRFS would that be running?

Would anyone recommend an upgrade from Leap to Tumbleweed while in use as suggested here?
https://en.opensuse.org/openSUSE:Tumbleweed_upgrade

I don’t have an issue with Rockstor currently but I have noticed the BTRFS is very old and after my experience with out of date tools, it concerns me.

Flox · May 28, 2021, 1:43am

Hi @b8two,

Sorry you’re having issues with your filesystem; I’ll let people more knowledgeable than me chip in here as I don’t want to risk data loss on your side as a result of my words. I do want to clarify the situation with regards to the version of Btrfs in Rockstor. In particular:

Although the btrfs-progs package in Leap 15.2 is indeed based on 4.19.1, it is continuously updated and receives a ton of backports from the newer versions to this package. The same goes for the kernel, which means that a fully updated Leap 15.2 is not running a Btrfs stack from Dec 2018. You can see an example with more details in a now older post related to Leap 15.1 (the same principles apply):

This process can actually be considered as a big advantage of openSUSE Leap as it receives backports of important newer Btrfs updates once they have been deemed ready (secure, stable, etc…) to be backported.

Even though I’m not helping with your situation, I hope I could clarify a bit the situation about the version of the Btrfs stack you are running under Rockstor.

GeoffA · May 28, 2021, 6:21am

I’m running Rockstor 4.0.7, so my BTRFS version will be the one baked in there - @Flox has the info above.

Hooverdan · May 28, 2021, 10:05pm

On my LEAP 15.2 test install of Rockstor (4.0.6 upgraded to 4.0.7) I am running btrfs-progs v5.11. On my CentOS version of Rockstor (3.9.2-57I have recently forced a newer kernel along with brtfs-progs v5.12.1, but of course that’s bleeding edge and not necessarily recommended :).