Howto recover a Raid 1 pool from a disrupted balance

wihe1 · October 7, 2018, 5:46pm

Hello,
I am using Rockstor for more than a year now, but still feel I am a Nebie to the system.
I recently received the message Pool device errors or something like it. I managed to find out that one of two 3GB disks in my Raid 1 configuration had read errors. I bought a new 4GB disk, added it, resized the pool to include the new disk, which in its turn initiated the balance of the pool initiated. That was 5 days ago. Progress was awfully slow. Until yesterday I saw progress when asking for the balance status on the command line. Last 24 hours no progress. Just an hour ago I lost contact with the networkdrives in my NAS. I could use the webui. First looked at samba and decided to switch it off an on again to see whether that would restore the connection.
Then I got a Houston message. Sorry Iwas unable to copy it . I decided to do a reboot from the webui.

The reboot failed, so I connected a screen and keyboard to the server. I saw a long trace of messages ending in the message that the system was ready to reboot, but was not able to sent that message to a ???.

I decided to do a hard reboot. oops… first message was that the system failed to start btrfs. Fortunately the start continued and I was able to start the web ui.
My pool however was unmounted and the automatic restart of the balance failed with the message Error running a command. cmd = btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt2/Pool01. rc = 1. stdout = [‘’]. stderr = [“ERROR: error during balancing ‘/mnt2/Pool01’: Invalid argument”, ‘There may be more info in syslog - try dmesg | tail’, ‘’] It took ages to get infomraton on the pool screen and also on the Disks screen.
In the mean time the web GUI is offline agein and the ip-address can not be found

SO… Now I am lost, could you please help me in solving this issues

regards,
Willem

I just found that after starting the balance on Wednesday 3 october a upgrade is done to .41 on Friday 5 October. Can this be the reason?

I add three screen foto’s illustrating what happens on the console after starting the system

@phillxnet: Philip can you help me with this issue. It seems you have expertise on how Rockstor handles this, and you worked on the .41 release. I can not reach my data anymore and hesitate how to proceed in this matter.

phillxnet · October 18, 2018, 8:24pm

@wihe1 Welcome to the Rockstor community and thanks for helping to support Rockstor’s development via a stable subscription.

Yes, balance requires a mounted pool and so given yours was not mounted the ‘Invalid argument’ is the result.

I’m assuming you used the Web-UI to add the new 4GB disk to the pool, but you state also resizing the pool!. Could you, if possible, state the commands you used. The may still be in your command line history if you did indeed go the command line route.

Also we need to know more about the pool in question as raid1 can only handle failure on a single drive, you mention having write errors on more than a single drive.

Did you also add the new 4GB drive live or what that attached when the system was powered off?

Yes, you will find disabling quotas on the pool speeds such things up quite considerably. The more recent stable channel update versions can handle pools with disabled quotas, you just loose share (subvol) size reporting currently.

An upgrade mid balance shouldn’t affect the balance, and no kernel update was included. But you can, under some circumstances, end up running some new code and some old which can cause bugs but a reboot, or a rockstor service restart should resolve them. The 41 release was a big one and is detailed here:
https://github.com/rockstor/rockstor-core/pull/1971
but mainly concerned adding the ability to replace a missing disk in a Rockstor managed pool and added some stuff to the pool details page (ie the temp name of type /dev/sd* for instance).

From the pics it looks like you pool, as you say, has issues. And they have in turn caused a kernel panic.

Yes I have done some work on rockstor in the disks, pools, and shares areas and have worked on almost all of the individual releases for the past 3 years. But your current issue is with btrfs crashing / or at least upsetting our now rather old kernel. As to how to proceed, you best bet is to try another reboot and hope it can at least resurrect itself but it looks like the issue may be, as you surmised, an auto continuation of the balance (this happens at the btrfs level, rockstor simply has a process to monitor what it expects to be an ongoing balance if one was initiated via the Web-UI. Btrfs itself will auto resume a balance that did not complete due an interruption.

If you can reboot and gain command line access again it would be useful to get some info on your pool:

btrfs fi show

and

cat /proc/mounts

and if you pool does end up mounting this time, which can happen

btrfs balance status /mnt2/pool-label

So it looks like you have a poorly pool that is in turn throwing the kernel. In which case you may have to use a live boot cd to repair the pool. One possibility could be a openSUSE leap 15 or tumbleweed live variant and then use it’s command line (with much newer kernel and btrfs tools) to attempt to retrieve what data you can (if needed) or in fact repair the pool. Take a look at the following openSUSE guide on repairing a btrfs volume (pool in Rockstor speak):

https://en.opensuse.org/SDB:BTRFS
You want the “How to repair a broken/unmountable btrfs filesystem” section.

The other earlier sections on subvol layout are all different for Rockstor, although we are intending to move over to openSUSE as time permits for the necessary preparations.

This is a pain but if your (our) existing kernel is failing you then it may be a viable work around to effect a repair / data retrieval so that you can return you Rockstor’s normal function.

Hope that helps and let us know how you get on.

wihe1 · November 17, 2018, 10:32am

In the end using the openSuse procedure helped to get the Pool accessible. I had to run the procedure up until the last step. That finally gave results, although the pool is still degraded, RW. My next step is to get the pool fully operational again and get rid of the disk with the RO errors.

phillxnet · November 17, 2018, 3:03pm

@wihe1 Thanks for the update and well done getting that pool at least partially sorted.

If you now have access you might first want to refresh your backups, just in case there is a backwards slid in the pools health during the final repair procedures.

Keep us updated, and remember that at least the Tumbleweed iso is update very frequently so if something doesn’t work out with your existing version it’s always worth trying the latest version as btrfs is very actively developed and they are improving things all the time, with Tumbleweed incorporating the latest version with each release.

wihe1 · November 18, 2018, 9:45am

I copied all the content to two different drives. I deleted the mount options of the pool, which succeeded. I run a scrub and later a balance on the pool. They both succeeded.
Then I tried to remove the disk which signaled the read errors. All this from the webui.The process returned the following message.

By the way, the link in step 3 points to an url that does not exist.

OK, just refreshed the webuit and checked the drives. The remove succeeded after all.

Willem

phillxnet · November 18, 2018, 6:10pm

@wihe1 Thanks for the update.

Yes this one is on my list and I hope to get to it soon, we have it as a GitHub issue here:

github.com/rockstor/rockstor-core

pool resize disk removal unknown internal error and no UI counterpart

opened 06:19PM - 01 Jun 17 UTC

closed 02:10PM - 09 Jul 19 UTC

phillxnet

Thanks to forum member Noggin for highlighting this behaviour. Occasionally when… removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation: ![harmless-put-timeout-on-dev-remove](https://cloud.githubusercontent.com/assets/2521585/26693598/a6a7f03a-46fc-11e7-80d2-f45df1f32e28.png) There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress). ``` btrfs balance status /mnt2/time_machine_pool/ No balance found on '/mnt2/time_machine_pool/' ``` The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of **btrfs fi show**: ``` Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 4 FS bytes used 31.57GiB devid 3 size 149.05GiB used 17.03GiB path /dev/sdd devid 4 size 0.00B used 5.00GiB path /dev/sda devid 5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe devid 6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7 ``` and then later on: ``` Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 4 FS bytes used 32.57GiB devid 3 size 149.05GiB used 18.03GiB path /dev/sdd devid 4 size 0.00B used 2.00GiB path /dev/sda devid 5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe devid 6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7 ``` As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time. Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr): https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319 where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge).

It’s cause is understood, quoting from that issue:

“Essentially when running a 'btrfs device delete (dev-name and/or missing) an internal balance is initiated. This, in almost all ‘real hardware’ cases, causes the indicated error. Essentially a ‘time out’ as a balance will typically take hours and we currently fail to run the associated code resize_pool() as an async taks like we do with regular balance operations and so our db commit code is ‘frozen’ until such time as the ‘btrfs device delete’ completes it’s internal balance.”

Apologies for that as I meant to warn you of this Rockstor Web-UI shortfall but alas I did not. In short the Web-UI element times out waiting and we have no ‘progress’ mechanism, and unlike a scrub or a user initiated balance btrfs currently provides no ‘status’ feature for a disk removal initiated internal balance. But I have notes from when I last updated that issue (as it was a functional caveat to another I did recently which was released in version 3.9.2-41) and they contain a mechanism I think we can key to abstract our own ‘status’ report; so I should be able to pick this one up from there when next time permits.

Yes, that may come back in time but is currently awol, I think it’s basically down to time availability currently. This element of Rockstor’s ‘business’ is in @suman’s wheel house, while I’m currently focusing on the code.

Yes, we successfully initiate it so it will continue and ultimately finish (if all is well) but the Web-UI code just gets stuck waiting for it and hits the observed timeout. And upon it finishing a web page refresh initiates a pool info update which then self corrects the previously stale info we have due to the stuck state.

So I’m assuming you are now out the other side of this disk failure / kernel panic event? If so well done for persevering and hopefully sooner rather than later there should be more capable Rockstor offerings. But unfortunately they just aren’t quite at feature parity (but fairly close as it goes). All in good time hopefully.