Drop enforcing the raid level when balancing?

s.ma · February 1, 2022, 2:29pm

With the plan to add another disk later I read your documentation about raid1c3.

Rockstor ‘allows’ these raid levels but is currently un-aware of them. As such if any Pool modifications are enacted via the Web-UI, e.g Balance a pool or Pool Resize/ReRaid the Rockstor defaults will be reasserted.
(Installing the Stable Kernel Backport — Rockstor documentation)

This problem surprised me generally, because I thought there’s no need to explicitly specify a (new) raid level to do a (re)balance. And concerning Rockstore, it especially surprised me, because I thought you already tried to introduce the separate word “ReRaid” for UI choices that change the “raid level” or “redundancy profile”. (To me the latter or “redundancy level” may sound the most clear.)

I guess there must be a reason you chose to enforce the redundancy level, wouldn’t it be possible for you to let the UI show a selection “keep” or “unchanged” by default and just drop (or blank out) that option from the balance command?

(So that even without fully knowing/supporting all available raid levels, it would not have to interfere or break such setups.)

phillxnet · February 1, 2022, 8:48pm

@s.ma Hello again.
Re:

There is (for ease of use and to avoid inadvertent multiple raid levels co-existing), and we do plan to allow the flexibility you indicate once we move into being more flexible re differing data/metadata levels. We just haven’t gotten that far yet. The first step will be to just expose this to the user. And then take it from there. Step by step.

We have a core issue open for this first step but I can’t seem to locate it just now.

But yes, in time we should become more flexible but will probably have to default to re-asserting raid levels to their current setting. It’s just those settings re data/metadata are currently set. And if one doesn’t re-assert there can be issues down the line, i.e change from single to raid1 by adding a disk and if you don’t do a full balance the initial single data is not changed to raid1 untill it’s re-written. Not what folks unknowing of this ‘nature’ of btrfs will expect. So we brute force it whenever it’s changed.

I like the ReRaid myself actually. And if folks know the term ‘raid’ they will likely be able to infer re-raid was my thinking.

Hope that helps and thanks for the input. We are however a small team with a large task. So we all have to be way more patient than most of us would prefer to be. Plus we have the more pressing matter of addressing our Python 2 → 3 change to think about before doing too many ‘improvements’ on the feature front.

s.ma · February 5, 2022, 1:12pm

Thanks, I certainly didn’t know those parts of btrfs ‘nature’ yet.

In the meantime I actually read about a related possible surprise in this area: “we have noticed that applying one of these [raid level changing] balance filters to a completely empty volume leaves some data extents with the previous profile. The solution is to simply run the same balance again. We consider this to be a btrfs bug and if no solution is forthcoming we’ll add the second balance to the code by default. For now, it’s left as-is.” Home | Unraid Docs

I like the ReRaid myself actually. And if folks know the term ‘raid’ they will likely be able to infer re-raid was my thinking.

Yes, I can fully confirm. I got your idea, and it even stuck well (positively).

Coming to think it’s quite fine as a shorthand (button or menu item) for one aspect of “balance”, when speaking of raid and redundancy level in the details.

It was probably the term “balance” that introduces quite a lot of confusion, while that operation seems to actually be a “re-write” of the filesystem, to a specified degree and with some shaping options available (ReRaid, ReSpace, … but not do compression…).

phillxnet · February 5, 2022, 6:43pm

@s.ma Thanks for some more nice input/sharing here.
Re:

My reading of that passage is that the [raid level changing] in this case concerns raid1c3/c4 which we don’t yet offer. And given they are years younger than even the parity raids of 5/6 they are more likely to suffer from such edge cases. Interesting observation/find thought. And in context we recommend installing the stable Backport kernel to use such raid levels, and the parity raid actually. See the following new how-to in our docs on that front:
https://rockstor.com/docs/howtos/stable_kernel_backport.html

However in a similar situation but on quotas when they were years younger than they are now, we also had to do a double enable to have them stick. We still do this actually and need to re-visit. But it causes no harm other than a spurious log entry. We can now remove that is my thinking, now that quotas has aged some since then.

Cheers. I also wanted something pretty short to avoid buttons with walls of text on :).

Yes, we inherit balance from upstream directly. It’s a tricky one as few raid systems allow online raid profile changes, which balance can instantiate, so it’s new to many folks not previously familiar with btrfs. Most old raid systems were set for life in whatever you first built them in. Of course the flexibility brings with it complexity and that is one of our challenges within the Web-UI. As you see we don’t do ‘all of btrfs’ and are not likely to soon but we try to cover the basics and most used / usefully elements. All depends on what’s requested and the developer resource/contribution side of things really. And our CentOS to openSUSE move has take quite the toll on those resources so we haven’t added much for the last couple of years.

If one adds a compression flag then rebalances then all re-written data during the balance will be compressed. Compression is another level of complexity best avoided in production if one can help it. Just adds more moving parts to an already large stack of ‘magic’. But folks experience varies. I’ve just seen many compression related bugs on the btrfs mailing list over the years. Likely it also shows up other not directly related stuff but still.

Hope that helps and thanks for sharing your findings and ideas/impressions. Much appreciated.

s.ma · February 6, 2022, 4:42pm

Well, the question would be if all data blocks are going to be re-written. Could be my info was old or I was confusing it with scrub not compressing already written data.

The wiki Compression - btrfs Wiki only mentions btrfs filesystem defrag -r (recursive on directory)

Compression is not a must have at all, though.

This is more important, and it’s not working here. I tried that now, adding a raid1 pool on two luks disks in Rockstor on a VM, and adding a network share.

Device remove fails, because UI complains about min. drive number.
UI to re-raid to single seems to succeed, but removing is not possible (balance log: can’t go below raid1 min. number)
Second re-raid (balance) is not possible in UI, as “single” is not selectable anymore.
manual balances on the commandline seem to succeed (but repeatedly report the same number of modifications)
While all device deletes continue to fail with can’t go below min. raid1 number…
(Even after balance --force -sconvert=single -mconvert=single -dconvert=single)

phillxnet · February 6, 2022, 7:27pm

@s.ma
Re:

Because a raid1 has 2 disk minimum, you can’t remove a disk from it without resorting to a degraded mount. That’s btrfs. We guide you through this in more real, rather than contrived circumstances. I.e. if a disk was missing due to failure.

The balance involved with a move from raid1 to single has to complete. It’s probably still working in the background so there are still raid1 content that is awaiting the full balance that has to complete. You have to be patience here as the last re-raid wizard indicates. And if you watch the data allocated to each drive and free space you likely will see it move. Also the pool details should show the new raid level once the operation is complete.

We have not changed this area of this code since it was last tested. You may have an issue but you may also have found a corner case. Difficult to tell without a reproducer. How much data did you have on these drives. A nice info checker is the usage command:

btrfs fi usage /mnt2/pool-name-here

It should tell you the amount of differing raid blocks on the pool. Also note that in some settings where there is no or very little data there can be corner cases.

For the:

You must still have some raid1 content. Take a closer look via that command to see. Btrfs wise we are default openSUSE and rely on them to make-stuff-so. And given they employ a few of the major contributors I think we are in good hands. When in doubt: more info. If you are testing the limits look for a simple reproducer from blanked disk so we have an exact case to look out for and to see if this is a known upstream issue.

Hope that helps and let us know if you can find the raid1 entry here. But Rockstor will not allow removal of a disk that results in bringing the pool below it’s minimum as that would necessarily enter to pool into degraded territory and so the degraded proviso (custom mount option) would be required to do that.

Or we may just have a failure to update the raid level in the db. Single is not a common raid profile given it’s not raid. But this raid level is informed by a simple bit of code here:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/fs/btrfs.py#L405-L419


      
              return total_devices - attached_devids
          
          
          def degraded_pools_found():
              """
              Primarily intended to indicate the existence of any degraded pools, managed
              or otherwise. Originally used by data_collector to feed real time Web-UI
              indicators. Non-managed pool coverage allows for the indication of a
              degraded mount requirement pre-import or on fresh disaster recovery
              installs.
              :return: Number of degraded pools as indicated by any line ending in
              "missing" following an associated "Label" line.
              """
              # --raw used to minimise pre-processing of irrelevant 'used' info (units).
              cmd = [BTRFS, "fi", "show", "--raw"]

And that is called on each Web-UI initiated refresh of the pool. You may just have to refresh the Pools page for it to allow the removal.

Sorry to not be of more help here, lots of things thrown together. If you end up having what looks like a specific issue it’s best to focus on that issue in a specific thread in the forum to help folks follow just that issue. Here we began discussing a design decision then moved to raid level change bugs. Also note that command line operations can interfere with running Rockstor processes and no raid level change is imidiate. Snapshots are almost. But raid level changes with consequent balances (what we do for the reasons discussed earlier and years of doing it this way) take time. Almost all data has to be re-written. In almost all cases on the pool front the pool is the source of truth. But the Web-UI indicates the user preference which we are oblighed to enforce.

Cheers and keep the analysis coming. All good. But remember we need exact and ideally small reproducers for bug reports. And we have had not other reports of this type for a few years now. But hey, they have to start somewhere so if this is a corner case or an obvious thing that has been missed then great. But exact reproducers from clean drives is the way to expose an issue and help anyone trying to fix it to have the ability to know their fix has worked. Bar tracking down the exact cause of course.

Hope that helps and again thanks for your engagement here. I myself am on a hiatus of sorts after the years long push to get to our new v4 “Built on openSUSE” base. But there are still things happening in the background, some of which will help to prove the function of all key capabilities which will be nice. And we hope to use this new testing setup (based on openQA) to establish a better safety net going forward where we have to immediately make thousands of changes to up-our-anty re technology versions. All good and all in good time hopefully. This will be the start of our next testing channel release which I have yet to announce bar a mention in our first v4 stable release notes:

s.ma · February 6, 2022, 11:35pm

Yes, after reading that the raid level can’t be dropped (left as is) with the disk add/remove gui, I tried out the add/remove gui feature (“delete” seems only for backwards compatibility), just what we talked about.

What I posted above is a minimal gui test case with blank disks in a VM.

I understand that observation 1. above could be considered expected. Though, without a hint it likely leaves users puzzled what to do, or just seeing this thing as broken if failing to release a disk.
My VM runs purely in ram, the balances don’t take a second, and I saw the result status in the UI before proceeding.

To confirm I tried the raid level enforcing balance on the command line (without luks)

mkfs.btrfs -d raid1 -m raid1 <dev1> <dev2>
mount <dev1> /mnt
btrfs balance start -mconvert=dup -dconvert=single  /mnt
btrfs device remove <dev1> /mnt

It seems to be the leap 15.3 kernel that fails with releasing a disk from a raid1 (btrfs fi usage report says system data is still raid1), while a 5.15 kernel (debian backports vm) produced no problems with the above.

So for this one, hopefully a fixed kernel version will reach the stable repositories rather sooner than later.

Thanks for your patience Phil.

phillxnet · February 7, 2022, 10:39am

@s.ma Nice, and thanks for the tight summary.

Try your command line test case after first loading the pool with some data in it’s raid1 case. I’ve seen corner case strange behaviour with completely empty pools. You may find than once you have a few Gig in there the behaviour is more sane. Also keep an eye on using sane drive sizes during these tests. Anything below 5 GB can also behave strangely some times.

So, in short, try the same with say 2 x 15 GB drives and load say 4 GB into the raid1 before doing your test. You may find all is OK in this. Very small drives in btrfs end up having a ‘unit’ size of 256 MB rather than the usual 1 GB size. Giving a pool enough room to manouver is important and required for raid level changes (ReRaids ). And of course loading it with some data rather than nothing to again approach a ‘real’ test.

As always feedback is super welcome. We sometimes take a while to root out a base cause but always best to get there in the end. Also regarding kernel versions, and if you are in the mind for experimenting, we have the new doc entry to install far newer kernels where many corner cases are already fixed (same links as before but contextual given your deb kernel reference):

https://rockstor.com/docs/howtos/stable_kernel_backport.html

This HowTo is a step by step for installing the openSUSE backport of the latest stable kernel version and btrfs progs (the user space part of btrfs). It’s how folks can enable raid 5/6 write access for example.

Hope that helps.

phillxnet · February 7, 2022, 10:42am

@s.ma Oh I almost forgot.

You may also find that after a reboot (after the re-balance/ReRaid) all will behave as expected. Sometimes there are ‘left-overs’ and an unmount/remount makes all nice again. I’ve also seen that. Far less so in later kernels but you could also try that in your nice little test setup.

Hope that helps as well. Drive management is receiving work all the time and there are known ‘characteristics’ such as unmount/re-mount requirements at times. Expecially after major changes like a ReRaid. But best let any balance finish first if you can.