Warning, might want to wait for Rockstor 3.9.1 and the new kernel/progs

chicago · July 8, 2017, 4:48am

I just updated Rockstor the other day and now my btrfs pool is completely gone. I was running raid10 and now when I try to mount my pool, my entire system locks up with the following errors:

[ 716.902506] BTRFS error (device sdb): failed to read the system array: -5
[ 716.918284] BTRFS error (device sdb): open_ctree failed

Another Rockstor user here who also updated posted in the btrfs mailing list with a very similar error message who also updated to Rockstor 3.9.1 with the newer 4.10 kernel and newer btrfs-progs.

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg65734.html

His error is very similar in behavior and error messages:
[ 23.393973] BTRFS error (device sdf): failed to read block groups: -5
[ 23.419807] BTRFS error (device sdf): open_ctree failed

This could entirely just be a coincidence. But two separate users losing their entire btrfs array to the same error running the same Rockstor install right after the same kernel/progs update? The hair on my neck is standing up. Be careful.

phillxnet · July 8, 2017, 2:07pm

@chicago First off a very belated welcome to the Rockstor community and thanks for helping to support Rockstor development via subscription to the stable updates channel.

The likelihood here is that the mount process is often when ‘problems’ show in the most obvious way. You may well find ‘tell tail’ signs that may help with the diagnosis further back in your logs. And often our stable channel subscribers will only end up rebooting upon new stable channel releases; such as we have just had. The btrfs mailing list thread you link to started by Daniel

(sorry I don’t know of Daniel’s user name on this forum) concerned, in part, a hardware disk failure on Raid5/6 (not Raid 10 as you report) ie:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg65754.html
quoting Roman Mamedov’s reply in that thread:

"Also one of your disks or cables is failing (was /dev/sde on that boot, but may
get a different index next boot), check SMART data for it and replace."

[   21.230919] BTRFS info (device sdf): bdev /dev/sde errs: wr 402545, rd 234683174, flush 194501, corrupt 0, gen 0

This is obviously a tense time but I do find those comments a tad inflammatory / accusational shall we say. Also:

“failed to read the system array: -5” is not ‘the same’ as “failed to read block groups: -5” but you do acknowledge this earlier in your post here as you do in you recent linux-btrfs mailing list post. Plus btrfs Raid 1/10 are entirely different from the parity raids of btrfs 5/6 so hopefully this bodes well for your data (assuming no backups). Another note on language use: it is perplexing that you and Daniel share the “… same Rockstor install …” . Presumably you meant version. Plus you haven’t yet confirmed that your pool or it’s data are actually ‘lost’.

Linking to your linux-btrfs post for context and hopefully we can avoid duplicating effort:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg65772.html

As a hopefully helpful comment on your linux-btrfs posting I think it might be as well for you to report there the exact steps you took to attempt a manual mount as there you quote your own log entries thus:

[  716.902506] BTRFS error (device sdb): failed to read the system array: -5
[  716.918284] BTRFS error (device sdb): open_ctree failed
[  717.004162] BTRFS warning (device sdb): 'recovery' is deprecated,
use 'usebackuproot' instead
[  717.004165] BTRFS info (device sdb): trying to use backup root at mount time
[  717.004167] BTRFS info (device sdb): disk space caching is enabled
[  717.004168] BTRFS info (device sdb): has skinny extents
[  717.005673] BTRFS error (device sdb): failed to read the system array: -5
[  717.020248] BTRFS error (device sdb): open_ctree failed

which indicates the use of a recovery mount option. This deprecated mount option is NOT one that Rockstor exercises or even allows via the UI. I would as a consequence suggest that you report at lest when talking to the canonical authors of the filesystem you are experiencing trouble with ‘the whole story’. Akin to this in your same linux-btrfs mailing list thread (and not here) we have:

“I had also added a new 6TB disk a few days ago but I’m not sure if the balance finished as it locked up sometime today when I was at work. Any ideas how I can recover?”

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg65772.html

accompanied by your other log entry excerpt:

… [ 969.375296] BUG: unable to handle kernel NULL pointer dereference
at 00000000000001f0

[ 969.376583] IP: can_overcommit+0x1d/0x110 [btrfs] …

All core Rockstor developers are obviously avid readers of the linux-btrfs mailing list. Please try in the future to take more care with premature attribution of blame or aspersions. We all make mistakes and the important thing is to work together to try and make better software / file systems (which are hard) and that is what I hope we are all trying to do here. Potentially belying the work of all Rockstor developers is probably not the best route to gaining assistance where and when you might need it. We have to work together to succeed.

Lets hope the btrfs experts can shed some light on your experience with mounting your pool.

chicago · July 8, 2017, 4:24pm

Where in my post was I being inflammatory? My exact wording was “two users at the same time had the same problem after updating the same kernel and the same btrfs-progs”. It does seem a little fishy. I didn’t say “HEY EVERYONE ABANDON ROCKSTOR!”. I didn’t say “HEY EVERYONE ROCKSTOR IS CRAP DON’T USE IT!” I merely said that there is some concern regarding two people using the same software applying the same update both losing the ability to mount their array. And perhaps they are related so here is a warning from someone who had issues with the update.

Regarding the addition of a disk, yes I added one. But a rebalance to one disk in a raid10 scenario should not cause the entire filesystem to be unmountable like that. Btrfs shouldn’t just say “ah well I lost a disk can’t see the filesystem sorry!”

Perhaps be a little less paranoid and guarded when someone posts a caution when they have lost their array after an update? If you think I’m being inflammatory I do apologize but I assure you 100% my concern was with the update.