Scrubs all end with status unknown

KarstenV · May 23, 2023, 5:40pm

I’m having a small problem with my Rockstor install.

When a new scrub is started, the UI almost instantly return a scrub with a status unknown. Instead of the normal Running (followed by finished). And if I click the status, there is no data to display.

In the CLI, I can see the scrub being started, and after some time (depending on what I scrub) finish with Status 0, which is a normal uneventfull scrub AFAIR.

What could be going on?

I am on 4.5.9-1.

Hooverdan · May 23, 2023, 8:29pm

@KarstenV does it make a difference in behavior if you manually start one, or observing one that was scheduled?

KarstenV · May 24, 2023, 6:15pm

Behavior is the same either way.

Scheduled scrubs start at the scehdules time, but an end time is not shown in the task history, and status is unknown.

Under the pool scrub history its shown as this:

phillxnet · May 25, 2023, 5:57pm

@KarstenV Hello again.
Is the pool mounted degraded, or have a sticky degraded mount option?

I’m thinking of the following issue:

github.com/rockstor/rockstor-core

indicate scrub, with degraded, will auto cancel

opened 12:17PM - 27 Feb 23 UTC

phillxnet

At least with more modern kernels, i.e. 6.2.0 Stable Backport (https://rockstor.…com/docs/howtos/stable_kernel_backport.html) if a scrub is attempted when the associated pool is mounted degraded, the status in Rockstor is reported as cancelled. This is our reporting of the 'aborted' status. ``` btrfs scrub status /mnt2/test-pool UUID: 2c680ff8-9687-4356-87db-e48d23749d80 Scrub started: Mon Feb 27 11:53:57 2023 Status: aborted Duration: 0:00:03 Total to scrub: 5.27GiB Rate: 1.41GiB/s Error summary: no errors found ``` Removing the degraded mount option (and rebooting as it's pretty sticky - as per our user facing "Maintenance required" text) results in the expected behaviour. We currently have the following text explaining the status values on our: ># Scrub Details page: > Formatted output from: btrfs scrub status -R /mnt2/test-pool for the given job ID. If job is running then browser refresh for update. > > ## Scrub Status values: > > running: Ongoing, finished: Completed, terminated: Job open when force scrub initiated. > halted: Interrupted e.g. reboot, cancelled: btrfs scrub cancel /mnt2/test-pool executed. > conn-reset: Forced scrub encountered running scrub. It is proposed that we add this 'degraded' mount option to the potential cause of a 'cancelled' status. I.e. halted: Interrupted e.g. reboot, cancelled: 'btrfs scrub cancel /mnt2/test-pool' executed, **or degraded mount option in use**.

But this doesn’t match your observed “unknown” status. Might be worth given that issue a quick read however.

Hope that helps.

KarstenV · May 25, 2023, 7:15pm

Theyre not mounted degraded.

I am able to use the pool as usual and write to it.

The rockstor boot pool and my main RAID1 pool are mounted normally, and both show the same behaviour.

Here are the mount options for my Radi1 pool: rw,relatime,space_cache,subvolid=5,subvol=/

I have never had to force a mount or something like that.

“btrfs fi sh” gives this:

Label: ‘ROOT’ uuid: 4ac51b0f-afeb-4946-aad1-975a2a26c941
Total devices 1 FS bytes used 10.42GiB
devid 1 size 72.46GiB used 20.06GiB path /dev/sdg4

Label: ‘RSPool’ uuid: 12bf3137-8df1-4d6b-bb42-f412e69e94a8
Total devices 7 FS bytes used 5.62TiB
devid 9 size 2.73TiB used 2.31TiB path /dev/sde
devid 10 size 1.82TiB used 1.40TiB path /dev/sdc
devid 11 size 1.82TiB used 1.40TiB path /dev/sdd
devid 12 size 1.82TiB used 1.40TiB path /dev/sdf
devid 13 size 2.73TiB used 2.31TiB path /dev/sdb
devid 15 size 2.73TiB used 2.31TiB path /dev/sdh
devid 16 size 3.64TiB used 1.23TiB path /dev/sda

And even thuogh the GUI shows the status as unknown, the scrub runs fine in the background.

“btrfs scrub status” gives this for e.g. my root drive:

Scrub started: Thu May 25 21:05:05 2023
Status: finished
Duration: 0:01:27
Total to scrub: 10.61GiB
Rate: 124.85MiB/s
Error summary: no errors found

The last scrub for my main pool ( scheduled one):

Scrub started: Sat Apr 1 22:00:03 2023
Status: finished
Duration: 7:25:34
Total to scrub: 11.24TiB
Rate: 440.87MiB/s
Error summary: no errors found

There is nothing wrong with my pools I think, the GUI just doesnt catch that the scrub started, and therefore cant report the status or end result.

phillxnet · May 26, 2023, 12:55pm

@KarstenV Thanks for the extra info.

These days we use both the regular btrfs scrub status and it’s raw variant btrfs scrub status -R and our most recent work in this area was the following that added more tests to our scrub interpretation code:

github.com/rockstor/rockstor-core

Refactor scrub status parsing to enhance/enable testing #2342

rockstor:testing ← phillxnet:2342_Support_future_kernel_scrub_output

opened 02:37PM - 10 Feb 23 UTC

phillxnet

+429 -57

Previous scrub status enhancements lacked test coverage. By refactoring our scru…b status parsing mechanisms we can return almost full test coverage to this capability. Replaces all-in-one scrub_status() with a wrapper that invokes: 1. scrub_status_raw() 'btrfs scrub status -R' parser. 2. scrub_status_extra() non '-R' variant of 1. Combining the results there-after. ## Includes - Replacing pkg_resources (setuptools) 'parse_version' with custom btrfsprogs_legacy() function. - Tests for btrfsprogs_legacy() and previously missing scrub status parsing extensions now refactored into (1.) and (2.) above. Fixes #2342 N.B. the stated issue is primarily functionally addressed by the referenced pull requests in that issue: "BTRFS SCRUB status enhancements for rockstor #2157 However there remained, there-after, important test deficits indicated by @FroggyFlox 's review comment here: https://github.com/rockstor/rockstor-core/pull/2157#pullrequestreview-404667948 This pull request addresses the following code review comments intended for follow-up work; such as this pr: > ## Caveats > > ... I did notice, however, a few points that could improved in further PRs. These do not break the current system, however, so I believe they shouldn't necessarily be blocking this current PR. > > 1. in scrub_status(), we now have a total of three calls to run_command(). While these work well as it, having several calls to the same command within a given function will prevent us (I believe) from properly covering these with our units tests. I would thus suggest to keep in mind the possibility for separating them from the scrub_status() function. The first one, for instance, simply fetches the btrfs-progs version, which could thus be separated and thus easily re-used as necessary. > 2. Following-up on the previous point, we should think of updating our relevant unit tests (test_btrfs_*) to cover the newly-parsed metrics. And my own forum comments here: https://forum.rockstor.com/t/integer-out-of-range/8183/12 > ...that we don’t have test coverage of the most recent additions to that rather critical function. So I’ll leave the associated issue open where we can also add test data from the latest kernel. regarding the same initial issue.

Could you also give the output for your data drive of the raw scrub command, and the output of the btrfs version command we use to modify our scrub status command parser accordingly:

btrfs version

I.e. see:

github.com

rockstor/rockstor-core/blob/e5c3126dc11a5bdbc993dbc3988902ae10e55227/src/rockstor/fs/btrfs.py#L1877-L1894


      
          def btrfsprogs_legacy():
              """
              Returns True if "btrfs version" considered legacy: i.e. < "v5.1.2" (approximately).
              Previously used parse_version(btrfs_progs_version) < parse_version("v5.1.2"), this
              was removed as it depended on setuptools and was overkill in this situation.
              :return: Legacy status.
              :rtype Boolean
              """
              legacy_version = [5, 1, 2]
              out, err, rc = run_command([BTRFS, "version"])
              # "btrfs-progs v5.14"
              # e.g. v4.12 Leap 15.2, v4.19.1 Leap 15.3, v5.14 Leap 15.4 v6.1.3 Backports
              btrfs_progs_version = out[0].split()[1].strip(" v").split(".")
              # ["4", "12"], ["4", "19", "1"], ["5", "14"], ["6","1", "3"]
              for index, element in enumerate(btrfs_progs_version):
                  if int(element) < legacy_version[index]:
                      return True
              return False

What I’m thinking here is that there has been a btrfs backport since we last visited that code and we are then misinterpreting the btrfs output as if it was from a legacy brtrfs version - but that the proposed backport has now outdated our assumptions based on the btrfs versioning.

Hope that helps, and thanks for the engagement on this issue.

KarstenV · May 26, 2023, 1:38pm

I’ll try to provide the wanted data

I ran a scrub overnight on my RAID1 Pool, so the data should be more up to date.

Here are the -R results of my boot pool and my RAID1 pool.

Boot Pool:

btrfs scrub status -R /dev/sdg4
UUID: 4ac51b0f-afeb-4946-aad1-975a2a26c941
Scrub started: Thu May 25 21:05:05 2023
Status: finished
Duration: 0:01:27
data_extents_scrubbed: 286399
tree_extents_scrubbed: 24498
data_bytes_scrubbed: 10987966464
tree_bytes_scrubbed: 401375232
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 226394
csum_discards: 2456215
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 43911217152

RAID1 pool:

btrfs scrub status -R //mnt2/RSPool
UUID: 12bf3137-8df1-4d6b-bb42-f412e69e94a8
Scrub started: Thu May 25 21:20:22 2023
Status: finished
Duration: 7:19:25
data_extents_scrubbed: 188784274
tree_extents_scrubbed: 850863
data_bytes_scrubbed: 12348365783040
tree_bytes_scrubbed: 13940539392
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 801856
csum_discards: 3013935884
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 1361572790272

“btrfs ver” gives this:

btrfs-progs v6.2.1

I think I am still on Leap 15,3

KarstenV · May 26, 2023, 1:45pm

As a side info, If I do a zypper refresh I get errors:

zypper refresh
Repository ‘Kernel_stable_Backport’ is up to date.
Repository ‘Leap_15_3’ is up to date.
Repository ‘Leap_15_3_Updates’ is up to date.
Repository ‘Rockstor-Testing’ is up to date.
Retrieving repository ‘filesystems’ metadata …[error]
Repository ‘filesystems’ is invalid.
[filesystems|https://download.opensuse.org/repositories/filesystems/15.3/] Valid metadata not found at specified URL
History:

[filesystems|https://download.opensuse.org/repositories/filesystems/15.3/] Re pository type can’t be determined.

Please check if the URIs defined for this repository are pointing to a valid rep ository.
Skipping repository ‘filesystems’ because of the above error.
Repository ‘home_rockstor’ is up to date.
Repository ‘home_rockstor_branches_Base_System’ is up to date.
Repository ‘Update repository of openSUSE Backports’ is up to date.
Repository ‘Update repository with updates from SUSE Linux Enterprise 15’ is up
to date.

Could this have something to do with the problems?

Flox · May 26, 2023, 2:17pm

I’ll quickly answer that one: I personally do not think so.
You’re seeing this error because this repository is no longer published by upstream for Leap 15.3. They only have Leap 15.4 and 15.5 published now. However, the only consequence is that you will no longer receive update for the packages on that system (which include btrfsprogs). If you have not updated your kernel since then, you may still be on kernel 6.2, but it looks like the kernel in the Kernel_stable_Backport repo is now on 6.3 so you may by out-of-sync between btrfsprogs and kernel versions. I can’t remember for sure but the name of the package may be kernel-default if you’d like to verify:

zypper info kernel-default

That being said, it may not be the cause of your issue… I don’t remember the exact format that Rockstor expects from btrfs scrub status -R so we’ll have to wait for @phillxnet’s expertise here.

Hope this helps,

KarstenV · May 26, 2023, 2:36pm

Zypper info kernel default reported this:

zypper info kernel-default
Retrieving repository ‘filesystems’ metadata …[error]
Repository ‘filesystems’ is invalid.
[filesystems|https://download.opensuse.org/repositories/filesystems/15.3/] Valid metadata not found at specified URL
History:

[filesystems|https://download.opensuse.org/repositories/filesystems/15.3/] Repository type can’t be determined.

Please check if the URIs defined for this repository are pointing to a valid repository.
Warning: Skipping repository ‘filesystems’ because of the above error.
Some of the repositories have not been refreshed because of an error.
Loading repository data…
Warning: Repository ‘Leap_15_3_Updates’ appears to be outdated. Consider using a different mirror or server.
Warning: Repository ‘Update repository of openSUSE Backports’ appears to be outdated. Consider using a different mirror or server.
Reading installed packages…

Information for package kernel-default:

Repository : Kernel_stable_Backport
Name : kernel-default
Version : 6.3.4-lp154.2.1.gc5b4604
Arch : x86_64
Vendor : obs://build.opensuse.org/Kernel
Installed Size : 299.7 MiB
Installed : Yes
Status : up-to-date
Source package : kernel-default-6.3.4-lp154.2.1.gc5b4604.nosrc
Upstream URL : https://www.kernel.org/
Summary : The Standard Kernel
Description :
The standard kernel for both uniprocessor and multiprocessor systems.

Source Timestamp: 2023-05-25 04:46:56 +0000
GIT Revision: c5b4604852c852b09d2b5d0753f5c34058b4f1c3
GIT Branch: stable

So it looks like the kernel is perhaps on a newer version than btrfs. I’ll let @phillxnet comment on that.

Hooverdan · May 26, 2023, 5:02pm

I just checked on my environment (with the kernel backport).

The kernel is the same as yours:
Version : 6.3.4-lp154.2.1.gc5b4604

btrfs ver returns 6.3

(and that’s on Leap 15.4)
so that would indicate that due to the filesystems repo not updating under Leap 15.3 anymore, the btrfs-progs are out of sync since they come from that repo …

For fun I started a new scrub to ensure that one was running on the latest above versions, and the status correctly adjusted to running, as well as the stats output was populated correctly … again, not helping, but just providing some additional observation.

WebUI output:

vs. terminal output (I was not sync’d in refreshing so the values differ a bit obviously)

KarstenV · June 10, 2023, 12:59pm

Today I finaly got the time to update my Rockstor system to Leap 15.4

I followed the guide available here:

https://rockstor.com/docs/howtos/15-3_to_15-4.html

That went fine, and the system now reports Leap 15.4, and doesn’t error out on the filesystem during “zypper -refresh” anymore.

So everything seems fine.

Except my scrubs stil error out immediatly…
So it does not seem the repository being out of sync was the problem.

Any new Ideas?