Pool is read-only

My primary data pool, “Pool1”, has become read only. Recent changes include:

  1. I enabled compression and tried running a balance but it failed: Error running a command. cmd = btrfs balance start --full-balance /mnt2/Pool1. rc = 1. stdout = [’’]. stderr = [“ERROR: error during balancing ‘/mnt2/Pool1’: Read-only file system”, ‘There may be more info in syslog - try dmesg | tail’, ‘’]
    Compression is now Off again. (Based on another of my recent discussions, I was going to enable quotas but have not yet done so.)

  2. I installed the Vaultwarden rockon but left it turned off until I have time to play with it. I subsequently found that I can’t delete one of the new shares (this was part of troubleshooting).

  3. I had rescheduled some snapshots and a couple of them were set to trigger at the same time (midnight). Was this dumb and dangerous? For now, I’ve disabled snapshots tasks on that pool

  4. I backed up 100GB from other computers to a share on Pool1.That’s just normal activity but it’s the only other thing I was up to yesterday.

I do have a backup but I would like to recover access to Pool1. Below is what I’ve tried so far but now i’m at a bit of a loss.

Thanks for any guidance.

rockstor:~ # btrfs filesystem show
Label: 'ROOT'  uuid: d97982eb-fc92-4c35-8993-8bd2eb032704
        Total devices 1 FS bytes used 5.73GiB
        devid    1 size 230.85GiB used 6.80GiB path /dev/sdf4

Label: 'Pool2'  uuid: 3c242b00-dfc5-4aa6-8e9c-5cff96a1c338
        Total devices 3 FS bytes used 1019.78GiB
        devid    1 size 931.51GiB used 730.00GiB path /dev/sdb
        devid    2 size 232.89GiB used 31.00GiB path /dev/sde
        devid    3 size 465.76GiB used 264.02GiB path /dev/sdd

Label: 'Pool1'  uuid: 97e4fc2c-581e-401f-ba42-ef7b0e53a536
        Total devices 2 FS bytes used 1.06TiB
        devid    2 size 3.64TiB used 1.06TiB path /dev/sdg
        devid    6 size 3.64TiB used 1.06TiB path /dev/sdc



rockstor:~ # btrfs fi df /mnt2/Pool1
Data, RAID1: total=1.06TiB, used=1.06TiB
System, RAID1: total=32.00MiB, used=192.00KiB
Metadata, RAID1: total=3.00GiB, used=1.77GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


rockstor:~ # btrfs fi usage /mnt2/Pool1
Overall:
    Device size:                   7.28TiB
    Device allocated:              2.13TiB
    Device unallocated:            5.15TiB
    Device missing:                  0.00B
    Used:                          2.12TiB
    Free (estimated):              2.58TiB      (min: 2.58TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:1.06TiB, Used:1.06TiB
   /dev/sdc        1.06TiB
   /dev/sdg        1.06TiB

Metadata,RAID1: Size:3.00GiB, Used:1.77GiB
   /dev/sdc        3.00GiB
   /dev/sdg        3.00GiB

System,RAID1: Size:32.00MiB, Used:192.00KiB
   /dev/sdc       32.00MiB
   /dev/sdg       32.00MiB

Unallocated:
   /dev/sdc        2.57TiB
   /dev/sdg        2.57TiB


rockstor:~ # btrfs scrub status -R /mnt2/Pool1
scrub status for 97e4fc2c-581e-401f-ba42-ef7b0e53a536
        scrub started at Mon Nov 15 15:36:23 2021 and was aborted after 00:00:00
        data_extents_scrubbed: 0
        tree_extents_scrubbed: 0
        data_bytes_scrubbed: 0
        tree_bytes_scrubbed: 0
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 0
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 0
rockstor:~ #

rockstor:~ # btrfs scrub start /mnt2/Pool1
scrub started on /mnt2/Pool1, fsid 97e4fc2c-581e-401f-ba42-ef7b0e53a536 (pid=5351)
rockstor:~ # btrfs scrub status -R /mnt2/Pool1
scrub status for 97e4fc2c-581e-401f-ba42-ef7b0e53a536
        scrub started at Mon Nov 15 15:57:44 2021 and was aborted after 00:00:00
        data_extents_scrubbed: 0
        tree_extents_scrubbed: 0
        data_bytes_scrubbed: 0
        tree_bytes_scrubbed: 0
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 0
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 0

rockstor:~ # dmesg | tail
[ 1209.087029] BTRFS info (device sdg): use no compression, level 0
[ 1209.087260] BTRFS info (device sdg): disk space caching is enabled
[ 1209.114138] BTRFS info (device sdg): use no compression, level 0
[ 1209.114370] BTRFS info (device sdg): disk space caching is enabled
[ 1209.141664] BTRFS info (device sdg): use no compression, level 0
[ 1209.141893] BTRFS info (device sdg): disk space caching is enabled
[ 1209.169733] BTRFS info (device sdg): use no compression, level 0
[ 1209.169965] BTRFS info (device sdg): disk space caching is enabled
[ 1209.197352] BTRFS info (device sdg): use no compression, level 0
[ 1209.197584] BTRFS info (device sdg): disk space caching is enabled
rockstor:~ #

One more note… it is mounted RO (first screen shot) but there is no RO mount option (2nd screen shot)… so I’m trying to figure out how the heck I did this to myself and how to mount it RW again.

Yes, I rebooted the machine but it didn’t change.

image
image

And lastly, my apologies for this triple post.
After another reboot, I found in dmesg that it is being forced to RO on error. I’m just no closer to a next step.

[    8.976125] r8169 0000:03:00.0 eth0: Link is Down
[   11.681259] r8169 0000:03:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   11.698821] NET: Registered protocol family 17
[   13.439059] device-mapper: uevent: version 1.0.3
[   13.441362] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
[   34.669233] BTRFS info (device sdb): use lzo compression, level 0
[   34.669733] BTRFS info (device sdb): disk space caching is enabled
[   34.670226] BTRFS info (device sdb): has skinny extents
[   39.302102] BTRFS info (device sdg): use no compression, level 0
[   39.302386] BTRFS info (device sdg): disk space caching is enabled
[   39.302629] BTRFS info (device sdg): has skinny extents
[   41.890060] BTRFS warning (device sdg): orphan qgroup relation 0x3e4->0x7df00000000000e
[   41.890569] BTRFS warning (device sdg): orphan qgroup relation 0x3e4->0x7df000000000018
[   41.891040] BTRFS warning (device sdg): orphan qgroup relation 0x3e4->0x7df000000000023
[   41.891507] BTRFS warning (device sdg): orphan qgroup relation 0x3e4->0x7df000000000032
[   41.891963] BTRFS warning (device sdg): orphan qgroup relation 0x3e4->0x7df00000000003d
[   41.939928] BTRFS info (device sdg): checking UUID tree
[   43.104220] BTRFS info (device sdg): balance: resume -dusage=90 -musage=90 -susage=90
[   43.106984] BTRFS info (device sdg): relocating block group 3206324158464 flags system|raid1
[   44.103722] BTRFS info (device sdg): found 9 extents
[   45.722625] BTRFS info (device sdg): relocating block group 2225964318720 flags metadata|raid1
[   65.937998] audit: type=1400 audit(1637019170.459:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=5111 comm="apparmor_parser"
[   65.972923] ------------[ cut here ]------------
[   65.973297] WARNING: CPU: 1 PID: 3584 at ../fs/btrfs/extent-tree.c:863 lookup_inline_extent_backref+0x5a0/0x640 [btrfs]
[   65.973549] Modules linked in: dm_mod af_packet rfkill nls_iso8859_1 nls_cp437 vfat fat nouveau edac_mce_amd snd_hda_codec_via snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio mxm_wmi ccp wmi video snd_hda_intel ppdev kvm ttm drm_kms_helper saa7164 snd_hda_codec irqbypass pcspkr tveeprom snd_hda_core snd_hwdep k10temp sp5100_tco dvb_core snd_pcm i2c_piix4 parport_pc drm parport joydev snd_timer r8169 realtek videodev fb_sys_fops snd syscopyarea asus_atk0110 sysfillrect libphy sysimgblt i2c_algo_bit soundcore mc button acpi_cpufreq hid_logitech_hidpp hid_logitech_dj hid_generic usbhid btrfs libcrc32c xor raid6_pq ohci_pci ata_generic serio_raw pata_atiixp ohci_hcd ehci_pci ehci_hcd usbcore sg
[   65.975762] CPU: 1 PID: 3584 Comm: btrfs-balance Not tainted 5.3.18-lp152.102-default #1 openSUSE Leap 15.2
[   65.976594] Hardware name: System manufacturer System Product Name/M3A78-CM, BIOS 2801    08/23/2010
[   65.977445] RIP: 0010:lookup_inline_extent_backref+0x5a0/0x640 [btrfs]
[   65.978271] Code: 48 8b 5c 24 38 4c 8b 74 24 48 e9 17 fe ff ff 48 8b 5c 24 38 b8 8b ff ff ff e9 62 fe ff ff 48 c7 c7 38 7b 4c c0 e8 5e 87 f0 c6 <0f> 0b b8 fb ff ff ff e9 4a fe ff ff 48 8b 7c 24 18 48 c7 c6 b8 7b
[   65.979183] RSP: 0018:ffffa34cc03af910 EFLAGS: 00010282
[   65.980098] RAX: 0000000000000024 RBX: ffff98283d7aaf50 RCX: 0000000000000000
[   65.980998] RDX: 0000000000000000 RSI: ffff9828efa59a18 RDI: ffff9828efa59a18
[   65.981904] RBP: ffff9827c78872d8 R08: 00000000000003ac R09: ffff9828ec6209a0
[   65.982814] R10: ffff9826c0000000 R11: 0000000000000010 R12: 0000000000004000
[   65.983731] R13: 0000000000000000 R14: 00000206525b8000 R15: 0000000000000009
[   65.984644] FS:  0000000000000000(0000) GS:ffff9828efa40000(0000) knlGS:0000000000000000
[   65.985559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.986473] CR2: 00007f0d3b7fde38 CR3: 0000000228fb2000 CR4: 00000000000006e0
[   65.987390] Call Trace:
[   65.988326]  ? set_extent_bit+0x19/0x20 [btrfs]
[   65.989222]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
[   65.990119]  __btrfs_inc_extent_ref.isra.44+0x88/0x260 [btrfs]
[   65.990998]  ? btrfs_merge_delayed_refs+0x307/0x3e0 [btrfs]
[   65.991871]  __btrfs_run_delayed_refs+0x67b/0x1190 [btrfs]
[   65.992749]  btrfs_run_delayed_refs+0x62/0x1f0 [btrfs]
[   65.993622]  btrfs_commit_transaction+0x50/0xa60 [btrfs]
[   65.994499]  prepare_to_merge+0x246/0x260 [btrfs]
[   65.995373]  relocate_block_group+0x1f8/0x760 [btrfs]
[   65.996240]  btrfs_relocate_block_group+0x16f/0x2e0 [btrfs]
[   65.997107]  btrfs_relocate_chunk+0x31/0xb0 [btrfs]
[   65.997970]  __btrfs_balance+0x8e4/0xad0 [btrfs]
[   65.998826]  btrfs_balance+0x346/0x710 [btrfs]
[   65.999683]  ? btrfs_balance+0x710/0x710 [btrfs]
[   66.000537]  balance_kthread+0x34/0x50 [btrfs]
[   66.001379]  kthread+0x10d/0x130
[   66.002229]  ? kthread_park+0xa0/0xa0
[   66.003080]  ret_from_fork+0x22/0x40
[   66.003931] ---[ end trace aec56495399420e3 ]---
[   66.004861] BTRFS: error (device sdg) in btrfs_run_delayed_refs:2129: errno=-5 IO failure
[   66.005655] BTRFS info (device sdg): forced readonly
[   66.017091] BTRFS info (device sdg): balance: ended with status: -30

@wdc Hello again.
Yes, if a pool is poorly or suspected of being poorly btrfs forces it into being read-only hence your:

So it may be that this is a product of a failure of sorts.

Also note that if a balance has been instigated and not completed at reboot time it will resume on subsequent boot/remount. That looks to be what’s happening here:

That is your manually instigated balance resuming. And then running into an issue.

You could try mounting with an option to not resume the balance which may given you a window of rw within which you could try a scrub. It may be able to sort things out. But as always, given you have ro access ensure you backups are up-to-date before proceeding further as a forced ro is usually to protect the pool from further corruption. And before anything else at all check you system memory.

Pre-Install Best Practice (PBP): https://rockstor.com/docs/installation/pre-install-howto.html#
section:
“Memory Test (memtest86+)”
As bad memory can lead to corruption prior or future so best ensure this is good before proceeding.

First:

  • Ensure your PSU is non suspect (tricky).
  • Test your systems memory extensively.

And in short the problem pool is mounting OK then the balance is restarting and this activity is pushing it to ro when it encounters a fault that it can’t correct during the balance. So skip the auto resume of the balance via a
https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)

skip_balance

mount option which we support as from 2017:

Then try a scrub. But only do anything else once you have ensure good power and good memory.

And yes keep compression off at least for now. It’s just another complexity and you need to keep things as simple as possible while you establish the issue.

Incidentally was this pool imported from a prior CentOS Rockstor instance? It may have had issues for a while that only the newer kernel is picking up on. We get a lot more checks in the several years difference between the two kernels.

Hope that helps. Going RO is ‘standard’ if a fault is found that can’t be instantly fixed in btrfs.

2 Likes

re: [quote=“phillxnet, post:4, topic:8085”]
Incidentally was this pool imported from a prior CentOS Rockstor instance?
[/quote]
I’m actually not certain - I recall copying everything off to several other machines when I did the v4 installation jbut I don’t recall if Pool1 survived the install or I built it fresh and copied everything back. I did recently modify the pool after one drive threw a SMART wobbly.

When I found the errno=-5 in the log, my heart sank a bit and I was beginning to realize that the patient’s condition might be fatal. So I’ll run the memtest, and remount skipping the balance, scrub, etc. My worst-case scenario is that I have to trash and recreate the array & copy back from the backup. Time-consuming but not a disaster.

Thank you for the guidance here, Philip. I’ll post the results. It will be a couple of days or more as working interferes with troubleshooting.

2 Likes

It’s about time that I posted the outcome of this little adventure. I’ll try to keep it concise.

The PSU has not been tested. Running a test on the PSU is more difficult as I don’t have a tester and I’m not certain that the ones I would be willing to pay for would actually test under load. Hence there remains a risk that @phillxnet identified. I did replace it circa 2014 so it is one of the new components (a very false sense of security here).

Memtest86+ ran for 20 hours and completed 7 passes with 0 errors. I’m rather pleased with this, given the age of the components.

I added the skip_balance option in the Rockstor UI and a reboot brought the pool back in RW. (I know, I could have remounted – I’m being cautious). On reboot, the qgroup message remained. A new scrub ran to completion with no errors.

I removed the skip_balance and rebooted. I was logged into the Rockstor UI and got logged in via ssh. Moments later, connectivity dropped – no ssh, no UI, no ping response (there was a physiological response on my side of the keyboard). Fast forward a little while: after a power button shutdown, I found that I had just enough time to login and run btrfs balance cancel mnt2/Pool1. Now I’m running again.

During the lockups, the console was showing the errors in the photo below (yes, my console is the TV): the most pertinent may be the watchdog: BUG: soft lockup – CPU#0 stuck for XYZs! [dockerd:2905] alternating CPU 0 and CPU 1. I did some googling on that and it may be a btrfs kernel bug. Or it may not be…. I also found some indications that a bad snapshot could cause this issue or others. Unfortunately, nothing concrete and specific.

I have yet to try another balance. I plan to do so but I want the time available to recover if need be. In preparation, I’ve taken a couple of other actions: in addition to my internal backup to Pool2, I spent $60 on a USB drive formatted ntfs and I’m rsync’ing nightly to it. That gives me an out if if the whole system goes up in smoke even if the thing is as slow as molasses in Scotland.

I’ve also deleted all the snapshots related to Pool1; there were 90ish. I noted that those for my “Documents” share were very slow to delete and I could delete no more than 2 at a time without Rockstor throwing an error. The snapshots associated with my “Pictures”, “Backups”, “Music” and “Emby” shares could be deleted in a flash.

If the next rebalance fails, I will probably delete Pool1 and rebuild from scratch.

So that’s where I stand today: up & running with some open questions. I’m still speculating about root cause. PSU? Toggling quotas or compression on & off? Interaction of the toggling with snapshots? Leftover silliness from Rockstor 3? I don’t think I’ll find any empirical evidence for the true cause of the event but I have lost no data.

Thanks for your suggestions on this one. As always, I appreciate the guidance and the learning.

Boot_balance_failure

1 Like

@wdc Thanks for the update, interesting.
Re:

I’d leave compression off myself. It’s just another layer of complexity best avoided if at all possible.

Actually quite likely as it goes. We have had reports, as has the btrfs mailing list, of folks having issues importing older pools that ‘grew’ issues that earlier kernels allowed and ignored. In the interim kernel/btrfs-progs developments grew in turn to avoid/recognise these and there is a massive difference now between our earli 4.10 / 4.12 kernels and the btrfs backported versions of the same in our new Leap bases.

Another idea is to run long smart tests on the drives. It can end up hammering them and can take quite a few horse in some cases but given you now have backup and are experience strangeness it’s definitely worth a try. This can be done from within the Rockstor Web-UI. But I’d stick to running only one test at a time, and let that one finish before starting another test. And only one test per system at a time also.
See our S.M.A.R.T doc entry here: https://rockstor.com/docs/howtos/smart.html
And look for the “Perform test” tab on the drive you want to test.

Yes we need to pop that process into a Huey task at some point as although snapshot creation is near instantaneous, deletion can be a very expensive process; consuming a lot of read/writes and cpu to calculate the new checksums of all the affected changes. And it’s worse the more linked associated data blocks there are.

Nice.

:slight_smile: .

Yes, when I was last shopping for PSU’s a saw some testers having brand new PSU’s let out the magic smoke when they were under load clearly within their supposed specifications. I ended up getting a few Be Quit PSU’s in the end from around their mid range I think it was.

Hope that helps.

1 Like