Reboot, although still in pool balancing / Error running a command. cmd

phillxnet · May 24, 2021, 11:51am

@musugru Hello there.
As per @Hooverdan statement:

You may be swimming against the tide some what with the older CentOS based Rockstor variant. And even if you upgraded to our Stable offering in that now legacy variant the kernel is not updated. So you might well be better off attempting this ‘recovery’ scenario using an instance of our Rockstor 4 ‘Built on openSUSE’. No installer downloads as of yet, available soon hopefully, but you can build your own fully updated version via the instructions available in our GitHub repo for this:

This will put you in better ‘hands’ re the capabilities of the less well developed parity raids within btrfs, i.e. 5 & 6, but obviously doesn’t address any hardware strangeness such as your reported smart drive error where smartmontools couldn’t find the stated drive. It may be you have a flaky connection to the drive. this would also explain the ‘rough’ detached devices. It may well help those here help you if we can also see the name mapping between what Rockstor references as the ‘temp’ name of sd* and the by-id names. Temp name is used by Rockstor as they can change from boot to boot.
So if you could post the output of the following two commands executed on the same boot:

btrfs fi show
ls -la /dev/disk/by-id/

Normally if there is a detached disk then btrfs fi show indicates a missing disk. And the fact the smartmontools program reported a missing device by it by-id name, which doesn’t change, suggests that you may have an intermittent drive connection. Or, as you suggested, a drive that comes and goes.

Drives in Rockstor are tracked via their serial number, see:

But that serial number doesn’t look like any of the others. So I’m a little confused by that currently.

Best see about building a Rockstor 4 installer and using a resulting install of that to do any pool repair I think. You will likely want to continue using this instance there after but if needed you can always revert back to your existing install if need be. Just don’t have both system disks attached at the same time as this has known confusion issues.

Also if you are to use a newer instance for the pool recovery (advised due to parity raid having known problems in repair) and want to use the Web-UI to do this, as opposed to the command line only, then you will have to first mount the poorly pool first via the command line as Rockstor can’t yet import poorly pools that can only mount ro, which may be the case with your current pool.

mount -o ro /dev/disk/by-id/<any-pool-member-by-id-name> /mnt2/<pool-label/name-here>

N.B. as the balance was ongoing when you shutdown, it may want to resume, and in turn cause problems. So you could add to the optinos (the following -o) a skip_balance, so that you have “-o ro, skip_balance” and out of interest if you also have a missing disk, indicated by btrfs fi show, then you will also need a ‘degraded’ in there too.

Once you have a successful mount in place a Rockstor import (via any disk member) should work as expected. This is helpful as you can then ‘deal’ with things in the newer install where the Web-UI has far greater capabilities for reporting issues and helping with repair etc and sits on top of a years newer kernel and btrfs-progs.

Yes, any hardware enforcement / correction can only help. Also check your memory:
http://rockstor.com/docs/pre-install-howto/pre-install-howto.html#memory-test-memtest86
and re-seat all your sata cables and make sure they are up to spec.

I am unfortunately no expert in any of these areas so if anyone else has further suggestions here then please chip in. But the simply act of a shutdown during a balance is not expected to fail so there may well be something else a-foot here.

Repeating what our Web-UI and docs suggest, the btrfs parity raid levels of 5/6 and not considered read for production and actually lack some features in comparison to the raid 1/10 variants. They are simply younger within the fs. Hence the suggestion to use a Rockstor 4 instance to gain whatever advantage / fixes you can from the aggressive btrfs back-port work done by the SuSE/openSUSE folks that we base our Rockstor 4 instance on. Plus that kernel gets updates and our legacy CentOS one in Rockstor 3 does not.

And be sure to look at your general system log also via the “journalctl” command i.e. for a live feed of this from a terminal you can do

journalctl -f

and there are many other options to that command.

Hope that helps and let us know how you get along.