Can't add/clone new share due to qgroup inconsistencies (with quotas off)

Hooverdan · October 22, 2021, 5:35pm

This is still on my Centos 3.9.2-57 version with a recent kernel update.

I was trying to perform some cleanup of Rockon shares.
I cloned an existing share using the WebUI. Once the cloning process was done, I removed the original share that I cloned from. So far so good. 24 hours later I tried to continue, by cloning another share into a new one, and then ran into the issue below. I also tried creating a new share directly on the same pool (not the OS drive by the way), but that failed as well.

  Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
    yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/share.py", line 180, in post
    pqid = qgroup_create(pool)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1127, in qgroup_create
    max_native_qgroup = qgroup_max(mnt_pt)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1082, in qgroup_max
    o, e, rc = run_command([BTRFS, 'qgroup', 'show', mnt_pt], log=True)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 176, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/sbin/btrfs qgroup show /mnt2/<pool name removed>. rc = 1. stdout = ['']. stderr = ['WARNING: qgroup data inconsistent, rescan recommended', 'ERROR: cannot find the qgroup 0/7301', "ERROR: can't list qgroups: No such file or directory", '']

note: I scrubbed my pool name from the error message

No quotas are enabled (and haven’t been for years).

I suspect that a clone/delete share operation the day before might have done something to the consistency of the quota definition (even if they are not active). I wasn’t entirely sure about how to perform a rescan (since the error message suggested that) from the UI, so I used the command line:

/usr/sbin/btrfs quota rescan -R /mnt2/<pool name>
to see whether any rescans were running, and when there weren’t any:

/usr/sbin/btrfs quota rescan /mnt2/<pool name>

After completion (a couple of minutes of checking with the above -s option) This didn’t seem to resolve the issue.

In the meantime, I started a scrubbing operation on the pool, as I realized that my scheduled scrub hasn’t been running for quite some time for some reason (separate topic, I think).

Any suggestions?

phillxnet · October 23, 2021, 11:39am

@Hooverdan Hello again.

I’ve addressed your other issue re:

and it seems that from:

All bets are off I’m afraid. The output from kernels and their often but not always associated btrfs-progs changes all the time and our Web-UI has to move with that. Hence my question about kernels in that other thread. Newer kernels have also chanced how they deal with quotas also and we have also made some progress on that front; but again only in the v4 variant. The btrfs parsing in our CentOS variant is now years behind what it is in our v4 ‘Built on openSUSE’ variant and in the latter we have a know kernel version to work against. As we did in the CentOS variant but we had to do that maintenance and frankly we failed at it so it is now, in v4, back in the hands of those most expert at it: upstream. We still have a moving target of sorts thought e.g.:

github.com/rockstor/rockstor-core

Pending 'missing' case issue re btrfs fi show

opened 10:55AM - 10 Sep 21 UTC

closed 03:21PM - 09 Feb 23 UTC

phillxnet

There is a pending proposal to enhance the output of 'btrfs fi show' to include …the devid and an advisory, or missing, device path. I.e. currently there is the following expected output of btrfs fi show: ``` ... devid 1 size 5.00GiB used 1.26GiB path /dev/loop0 *** Some devices missing ``` proposed to be replaced with: ``` ... devid 1 size 5.00GiB used 1.26GiB path /dev/loop0 devid 2 size 0 used 0 path /dev/loop1 MISSING ``` Where the device path "/dev/loop1" in this case, may be omitted if unavailable. This additional missing dev info could be of great use to our 'missing' logic which is currently a little convoluted (for another issue) but more pressing is the capitalisation of the "missing" to "MISSING" as we currently have an overarching flaging mechanism that simply searches for any line ending with "missing" in 'btrfs fi show': See in fs/btrfs: - is_pool_missing_dev(label) - degraded_pools_found() See also our test data for examples of captured 'btrfs fi show' output with pools where a device has one or more missing members: - fs/tests/test_btrfs.py test_degraded_pools_found(self) btrfs mailing list references: https://lore.kernel.org/linux-btrfs/f31fb924-847a-8b54-3da6-707914135f05@suse.com/T/#md987208db4b27afe89e903d6873e2c5c36f3c167 https://lore.kernel.org/linux-btrfs/b983789e-3671-a5f3-1803-37b8992261b4@suse.com/T/#t

So what we need is report such as your; (highly detailed) but on our current efforts and using what we ‘dish out’. Changes in the output of kernel/btrfs-progs can basically break our Web-UI. That is bad and we try to fail elegantly where-ever possible such as we did in your last post re scrub using presumably the newer kernel you mention.

Ensure you get your pool into as healthy a state as possible (scrub) and make sure it will mount rw on Rockstor reboot. Then migrate to the v4 “Built on openSUSE” variant where you can hopefully take advantage of improvements we have made there. And if the kernels there are still not new enough we are at least much closer to them and interested and capable of releasing updates to address up-and-coming changes, such as the above GitHub issue.

Hope that helps and I know it’s a pain but we had to move OS for a number of reasons. And in doing so we have dodged a few disasters along the way. One of which was our poor record on maintaining our own kernels, all be it unmodified epal kernel-ml releases.

phillxnet · October 23, 2021, 11:55am

@Hooverdan Re:

And your penchant for such things , I looked up @kageurufu recent post on hoicking up the kernels within an openSUSE Leap install, such as only slightly deviate from re their JeOS in our own v4 installers.

Wasn’t sure if you caught this and looks to be relevant to your own potential migration if you are un-comfy with downgrading your kernel. However I would point out that the openSUSE folks to fairly aggressively backport a load of btrfs stuff so the actual kernel version no longer really reflects the contained btrfs version. But that is another story and you may already be ahead in your CentOS instance with custom kernel version, btrfs wise, to what is found in these ports. Hence the above reference.

Hope that helps. Also might as well go with our Leap 15.3 installer profiles now as 15.2 approaches EOL.

Hooverdan · October 23, 2021, 5:08pm

Yes, like in the other thread on scrubbing, I promise I will move shortly :). Like @GeoffA I just need to get my not unsubstantial backup in order also. I thought it interesting that it would pop-up on a sub-version change of the kernel. The btrfs programs actually were not updated along with the latest kernel as there wasn’t a newer version, so that’s why thought it was strange that I would now suddenly run into a quota issue. But then again, I haven’t deleted a share in a while, which might have done something …
The scrub completed, no major errors, and a couple of reboots addressed some sudden docker permission issues that I encountered. So, the pool looks healthy.
Before I play around more with share cloning, creating new ones, I will probably prioritize the OpenSUSE upgrade, and hope I won’t have to go down @kageurufu’s rabbit hole.

Thanks.

Hooverdan · October 24, 2021, 5:36pm

Well, like I mentioned in another thread, I took the step! Running my main NAS now on OpenSUSE 4.0.9
I didn’t have to downport kernels or change btrfs-progs versions, fortunately (for once I did not run into murphy’s law everwhere I went ).

In the new installation and after adding the RAID pool back in using the WebUI, I noticed this output on the terminal about orphan qgroup relation entries.

And, interestingly, the share I tried to clone under the 3.9.2-57 version and gave me the error messages, automagically was there in the WebUI after the import …
So, I guess for now, I can consider this resolved. Let’s see what happens.

As usual, thanks for all the support in my self-inflicted problems, because I couldn’t leave good enough alone

GeoffA · October 25, 2021, 8:09am

Excellent move @Hooverdan - welcome to the World Of 4
So, backups all ok and in order I trust?

Hooverdan · October 25, 2021, 2:39pm

Thank you.
So far, so good! Did some comparisons (fortunately not all the data on the NAS is backup worthy) and all seems to be in order. Will have to try the share cloning/creating a new one soon.