Balance causes the filesystem to be remounted read-only

As the Topic title says, I am getting this when I attempt to balance the pool:

[218462.262604] ------------[ cut here ]------------
[218462.263189] WARNING: CPU: 3 PID: 40710 at ../fs/btrfs/extent-tree.c:863 lookup_inline_extent_backref+0x5a4/0x650 [btrfs]
[218462.263489] Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace sunrpc xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables bpfilter br_netfilter bridge stp llc dm_mod af_packet bonding tls rfkill intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crc32_pclmul nls_iso8859_1 nls_cp437 ghash_clmulni_intel vfat fat iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas(X) mgag200 aesni_intel crypto_simd cryptd glue_helper pcspkr i2c_algo_bit drm_kms_helper cec rc_core tg3 lpc_ich syscopyarea sysfillrect sysimgblt fb_sys_fops joydev libphy mei_me mei ipmi_ssif ioatdma dca ipmi_si ipmi_devintf ipmi_msghandler button drm fuse configfs btrfs libcrc32c xor raid6_pq sd_mod t10_pi hid_generic usbhid ahci libahci libata ehci_pci ehci_hcd mpt3sas usbcore crc32c_intel raid_class scsi_transport_sas wmi sg scsi_mod
[218462.265863] Supported: Yes, External
[218462.266127] CPU: 3 PID: 40710 Comm: btrfs Tainted: G               X    5.3.18-150300.59.106-default #1 SLE15-SP3
[218462.266410] Hardware name: Dell Inc. PowerEdge R520/051XDX, BIOS 2.9.0 01/09/2020
[218462.266704] RIP: 0010:lookup_inline_extent_backref+0x5a4/0x650 [btrfs]
[218462.266982] Code: 48 8b 5c 24 38 4c 8b 74 24 48 e9 13 fe ff ff 48 8b 5c 24 38 b8 8b ff ff ff e9 5e fe ff ff 48 c7 c7 50 3d 46 c0 e8 4e 6b b7 e6 <0f> 0b b8 fb ff ff ff e9 46 fe ff ff 48 8b 7c 24 18 48 c7 c6 d0 3d
[218462.267569] RSP: 0018:ffffb9ab03707790 EFLAGS: 00010286
[218462.267862] RAX: 0000000000000024 RBX: ffff9f64ee49c770 RCX: 0000000000000000
[218462.268159] RDX: 0000000000000000 RSI: ffff9f66e3659558 RDI: ffff9f66e3659558
[218462.268456] RBP: ffff9f66e08f46e8 R08: 00000000000024e0 R09: 0000000000aaaaaa
[218462.268757] R10: ffff9f60c0000000 R11: ffff9f63e1841ca0 R12: 0000000000004000
[218462.269058] R13: 0000000000000000 R14: 00001b1d88424000 R15: 0000000000000009
[218462.269359] FS:  00007f4b61d839c0(0000) GS:ffff9f66e3640000(0000) knlGS:0000000000000000
[218462.269666] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[218462.269967] CR2: 00007f53cd3eed98 CR3: 0000000588dc4004 CR4: 00000000001706e0
[218462.270272] Call Trace:
[218462.270591]  insert_inline_extent_backref+0x5d/0x110 [btrfs]
[218462.270913]  __btrfs_inc_extent_ref.isra.44+0x88/0x260 [btrfs]
[218462.271242]  ? btrfs_merge_delayed_refs+0x30a/0x3e0 [btrfs]
[218462.271561]  __btrfs_run_delayed_refs+0x67f/0x1180 [btrfs]
[218462.271881]  ? btrfs_set_path_blocking+0x49/0x50 [btrfs]
[218462.272199]  ? btrfs_search_slot+0x8c5/0xa40 [btrfs]
[218462.272514]  btrfs_run_delayed_refs+0x62/0x200 [btrfs]
[218462.272834]  btrfs_commit_transaction+0x50/0xa60 [btrfs]
[218462.273160]  prepare_to_merge+0x24a/0x260 [btrfs]
[218462.273482]  relocate_block_group+0x20d/0x790 [btrfs]
[218462.273807]  btrfs_relocate_block_group+0x173/0x2e0 [btrfs]
[218462.274133]  btrfs_relocate_chunk+0x31/0xc0 [btrfs]
[218462.274454]  btrfs_balance+0xa1c/0x11f0 [btrfs]
[218462.274777]  btrfs_ioctl_balance+0x2f6/0x3a0 [btrfs]
[218462.275098]  btrfs_ioctl+0x16d8/0x3030 [btrfs]
[218462.275398]  ? __handle_mm_fault+0xf23/0x1260
[218462.275696]  ? __fput+0x150/0x270
[218462.276171]  ? ksys_ioctl+0x96/0xb0
[218462.276785]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
[218462.277378]  ksys_ioctl+0x96/0xb0
[218462.277956]  __x64_sys_ioctl+0x16/0x20
[218462.278534]  do_syscall_64+0x5b/0x1e0
[218462.279001]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[218462.279434] RIP: 0033:0x7f4b60e02c27
[218462.279991] Code: 90 90 90 48 8b 05 69 c2 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 c2 2d 00 f7 d8 64 89 01 48
[218462.281178] RSP: 002b:00007ffed47e1108 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[218462.281767] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b60e02c27
[218462.282349] RDX: 00007ffed47e11a0 RSI: 00000000c4009420 RDI: 0000000000000005
[218462.282908] RBP: 0000000000000005 R08: 0000000000000013 R09: 0000000000000013
[218462.283258] R10: 00007f4b60cfd468 R11: 0000000000000246 R12: 0000000000000002
[218462.283818] R13: 00007ffed47e11a0 R14: 00007ffed47e1e79 R15: 0000000000000002
[218462.284372] ---[ end trace 8c578dd76b86700a ]---
[218462.284935] BTRFS: error (device sdd) in btrfs_run_delayed_refs:2147: errno=-5 IO failure
[218462.285500] BTRFS info (device sdd): forced readonly
[218462.286124] BTRFS info (device sdd): balance: ended with status: -30
[218462.295205] BTRFS error (device sdd): fail to start transaction for status update: -30

All disks are passing SMART and I have run a scrub already which has fixed 11 errors that it found.

My assessment is that the partition is damaged somehow but I am not sure what to do about it or what causes it, e.g. do I need to replace a disk, etc.

If someone can help me in understanding what is going on that would help me a lot.

Hi @acd

These error messages:

indicate that your SSD (/dev/sdd) has I/O errors, which forced the BTRFS volume into read-only mode.

This raises a seriouse suspicion, that your SSD is dying …


  • Maybe first post the SMART parameters here, either through the GUI, or via CLI:
sudo smartctl -iA /dev/sdd
  • And print some information about the filesystem, usage, and error counter:
sudo btrfs filesystem show
sudo btrfs filesystem usage <mount-point>
sudo btrfs device stats <mount-point>

That’s also an indication, that there are some hardware issues with your drives … :grimacing:

Can you please also provide the output of your last scrub:

sudo scrub status -d <mount-point>

Can you maybe also elaborate briefly, what you already did and why.

  • Did you run the scrub before the balance?
  • Why do you want to run a balance?
  • How old is / are the disk(s) of the btrfs pool?
  • How long are you already using that pool?

Cheers Simon

3 Likes

Hi simon, thank you for your reply here is the output:

sudo smartctl -iA /dev/sdd
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.106-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDH1Q99Q
LU WWN Device Id: 5 000c50 0a3278415
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 14 13:48:48 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       219186024
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       108
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   095   060   045    Pre-fail  Always       -       3059991600
  9 Power_On_Hours          0x0032   021   021   000    Old_age   Always       -       69557 (182 108 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       108
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   085   047   040    Old_age   Always       -       15 (Min/Max 7/16)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       635116
194 Temperature_Celsius     0x0022   015   053   000    Old_age   Always       -       15 (0 7 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       66375h+28m+41.342s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       289656266128
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2173836410965

 sudo btrfs filesystem show
Label: 'ROOT'  uuid: 4ac51b0f-afeb-4946-aad1-975a2a26c941
        Total devices 1 FS bytes used 5.80GiB
        devid    1 size 236.41GiB used 7.80GiB path /dev/sdg4

Label: 'storage_1'  uuid: 503def00-660b-4cbd-9f5e-5643b18f45a4
        Total devices 6 FS bytes used 10.17TiB
        devid    1 size 3.64TiB used 3.39TiB path /dev/sdd
        devid    5 size 3.64TiB used 3.39TiB path /dev/sdf
        devid    6 size 3.64TiB used 3.39TiB path /dev/sdb
        devid    7 size 3.64TiB used 3.39TiB path /dev/sda
        devid    8 size 3.64TiB used 3.39TiB path /dev/sdc
        devid    9 size 3.64TiB used 3.39TiB path /dev/sde
sudo btrfs filesystem usage /mnt2/storage_1
Overall:
    Device size:                  21.83TiB
    Device allocated:             20.36TiB
    Device unallocated:            1.48TiB
    Device missing:                  0.00B
    Used:                         20.35TiB
    Free (estimated):            759.10GiB      (min: 759.10GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:10.16TiB, Used:10.16TiB
   /dev/sda        3.38TiB
   /dev/sdb        3.39TiB
   /dev/sdc        3.39TiB
   /dev/sdd        3.38TiB
   /dev/sde        3.39TiB
   /dev/sdf        3.39TiB

Metadata,RAID1: Size:18.00GiB, Used:16.80GiB
   /dev/sda        9.00GiB
   /dev/sdb        6.00GiB
   /dev/sdc        3.00GiB
   /dev/sdd       11.00GiB
   /dev/sde        2.00GiB
   /dev/sdf        5.00GiB

System,RAID1: Size:32.00MiB, Used:1.44MiB
   /dev/sda       32.00MiB
   /dev/sdc       32.00MiB

Unallocated:
   /dev/sda      251.99GiB
   /dev/sdb      252.02GiB
   /dev/sdc      251.99GiB
   /dev/sdd      252.02GiB
   /dev/sde      252.02GiB
   /dev/sdf      252.02GiB

for sudo btrfs device stats I resetted them after hte last scrub so it is all 0.

sudo btrfs scrub status -d /mnt2/storage_1/
scrub status for 503def00-660b-4cbd-9f5e-5643b18f45a4
scrub device /dev/sdd (id 1) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 07:04:05
        total bytes scrubbed: 3.39TiB with 0 errors
scrub device /dev/sdf (id 5) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 06:44:46
        total bytes scrubbed: 3.39TiB with 11 errors
        error details: read=11
        corrected errors: 11, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdb (id 6) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 06:34:28
        total bytes scrubbed: 3.39TiB with 0 errors
scrub device /dev/sda (id 7) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 06:45:16
        total bytes scrubbed: 3.39TiB with 0 errors
scrub device /dev/sdc (id 8) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 06:19:27
        total bytes scrubbed: 3.39TiB with 0 errors
scrub device /dev/sde (id 9) history
        scrub started at Thu Dec 11 23:02:13 2025 and finished after 06:13:19
        total bytes scrubbed: 3.39TiB with 0 errors

Originally I wanted to run a balance because the “usage“ of the pool does not match at all the amount of files in the shares + snapshots (1-2 TB difference). When it failed I rebooted and it mounted RW again, I turned off NFS (this is mainly how I use my shares), run a scrub that found nothing and after a few hours the filesystem mounted RO again…

So I ran a btrfs check --repair (it is ok if I loose “some“ data), it fixed a bunch of stuff but the size still looked not ok so, scrub, balance then I posted on the forum :slight_smile:

if the issue is that I need to replace /dev/sdd I am happy to do so but since the is the first drive in the filesystem and it passed SMART I wasn’t sure it was a failing disk.

The pool has been used since end of 2017 roughly.

Thank you for time!

2 Likes

Could you please also share the SMART parameters of sdf:

sudo smartctl -iA /dev/sdf
1 Like

There are 2 lines in the SMART parameters, that catch my attention:

The normalized value (“VALUE”) is usually 100 or 200 (as with the other parameters) when they are “good” and they get less, if there are some issues.

Although the current value and the historically lowest value (“WORST”) are still above the critical value according to the manufacturer (“THRESH”), they are still outliers. If you have other seagate ironwolf drives, feel free to compare these two SMART parameters across all the drives :slightly_smiling_face:


At least your RAID1 pool is fully intact, although I find it strange, that scrub detected read errors of /dev/sdf (id5) but not of /dev/sdd (id1) …

Just a quick side note (/question):

Have you rebooted from the output of your initial post to output of your 2nd post?
… could it have happend, that the errors produced in your initial post by /dev/sdd where actually the disk id5, which now happens to be /dev/sdf :thinking:


Why do you know that the btrfs usage is off?
Do you have the btrfs quotas enabled?

From the first look, there are 2 drives which could need a replacement: /dev/sdf and /dev/sdd.

The fact, that in your first post the drive with I/O errors was /dev/sdd and that the scrub detected the errors for /dev/sdf is raising some suspicion …


You should definitely check the SMART parameters of both drives again (id1 & id5) and I would also get at least one replacement drive ready.

Is the pool currently read-only or read-write?

4 Likes
sudo smartctl -iA /dev/sdf
smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.3.18-150300.59.106-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST4000VN008-2DR166
Serial Number:    ZDHB4MNY
LU WWN Device Id: 5 000c50 0e40c19d2
Firmware Version: SC60
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 14 15:37:55 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   060   044    Pre-fail  Always       -       47239496
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       59
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       48
  7 Seek_Error_Rate         0x000f   090   060   045    Pre-fail  Always       -       1037721097
  9 Power_On_Hours          0x0032   066   066   000    Old_age   Always       -       30515 (15 58 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       59
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   045   045   000    Old_age   Always       -       55
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   086   063   040    Old_age   Always       -       14 (Min/Max 7/16)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       368841
194 Temperature_Celsius     0x0022   014   040   000    Old_age   Always       -       14 (0 7 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       29092h+55m+57.999s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       118157627848
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       995524849488

I know the size is off because shares + snapshots are well below 10TB and I have a raid1 24TB.

1 Like

Currently the partition is RO. Do you think my next move should be to get it to mount as RW and balance to exclude the 2 disks?

1 Like

I’ll let @simon-77 share his opinion on your question.

On a side note, if you’re running mostly or exclusively Seagate drives you can install the linux version of SeaTools, specifically for Seagate drives. Actually, @simon-77 brought it up in a past post for adjusting spin-down timers.

In any case, it’s available via zypper (should be zypper in seachest) which gives a little more clear info on Smart (you can look up the command line options, etc. on the github repositories wiki:

3 Likes

Alright, the SMART attributes of /dev/sdf (Serial Number: ZDHB4MNY) are much more concerning, in particular:

The RAW_VALUEs are showing, that there where a bunch of sector reallocations and uncorrectable errors - which is also the reason for the 11 errors encountere and corrected by the scrub.

I would focus on replacing this drive first. The only strange thing is, that the errors in your initial post where produced by /dev/sdd …

Did you do a reboot beetween your initial post and your second post?

I am asking this, because the device path (i.e. /dev/sdd) can change from one boot to another. The serial number (reported by smartclt -i <device-path>) and the BTRFS ID (reported by btrfs filesystem show) is not changing, but neither of these is present in your initial post for cross-referencing.

So my suspicion would be - if you have rebooted in between - that the drive with Serial Number ZDHB4MNY which is now /dev/sdf and was producing the errors during the scrub was previousely mounted under /dev/sdd …


Nevertheless, moving forward:

  • Focus on /dev/sdf (Serial Number: ZDHB4MNY) first, maybe the other drive is actually still good and has in fact not produced any error at all.
  • If you intend to replace the disk and have a way to connect this additional disk to your server motherboard, it is always suggested to use the btrfs replace command with the old and new disc (and the others as well) present at the same time.
  • Otherwise, you can use the btrfs device remove command to move data away from the failing disk onto the other pools. But be aware, that you are using a lot of the pools space already.
  • Please note, that the device path (i.e. /dev/sdd) can change after rebooting. Double check the SMART attributes and / or use the BTRFS device ID (id 5 produced the errors during the scrub) for the replace / remove commands.

Cheers Simon

3 Likes

Thank you so much for your help, I will start by replacing that one and see how it goes.

2 Likes

so disks have been replaced (devid 5 and 1, which kept giving io errors) but now my partition keep remounting readonly a while after reboot, here is what was in the dmesg after the last scrub:

[Jan 9 09:25] BTRFS error (device sdd): tree first key mismatch detected, bytenr=32760846909440 parent_transid=6428956 key expected=(18446744073709551606,128,44683300773888) has=(18446744073709551606,128,45138665185280)
[  +0.168537] BTRFS info (device sdd): scrub: not finished on devid 9 with status: -117
[Jan 9 09:31] BTRFS error (device sdd): tree first key mismatch detected, bytenr=32760846909440 parent_transid=6428956 key expected=(18446744073709551606,128,44683300773888) has=(18446744073709551606,128,45138665185280)
[  +0.191085] BTRFS error (device sdd): tree first key mismatch detected, bytenr=32760846909440 parent_transid=6428956 key expected=(18446744073709551606,128,44683300773888) has=(18446744073709551606,128,45138665185280)
[  +0.180223] BTRFS error (device sdd): tree first key mismatch detected, bytenr=32760846909440 parent_transid=6428956 key expected=(18446744073709551606,128,44683300773888) has=(18446744073709551606,128,45138665185280)
[Jan 9 09:32] BTRFS error (device sdd): tree first key mismatch detected, bytenr=32760846909440 parent_transid=6428956 key expected=(18446744073709551606,128,44683300773888) has=(18446744073709551606,128,45138665185280)
[  +0.181168] ------------[ cut here ]------------
[  +0.060103] BTRFS: Transaction aborted (error -117)
[  +0.058305] WARNING: CPU: 7 PID: 16330 at ../fs/btrfs/extent-tree.c:2880 __btrfs_free_extent+0x1093/0x11c0 [btrfs]
[  +0.114455] Modules linked in: af_packet rfkill ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 nls_cp437 vfat fat kvm_intel iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm dcdbas(X) ipmi_si irqbypass pcspkr lpc_ich tg3 mei_me ipmi_devintf i2c_algo_bit ioatdma mfd_core libphy mei dca joydev ipmi_msghandler button loop fuse dm_mod configfs dmi_sysfs ip_tables x_tables hid_generic usbhid ahci libahci mpt3sas libata crc32_pclmul raid_class scsi_transport_sas ghash_clmulni_intel sd_mod sha512_ssse3 t10_pi sha256_ssse3 sha1_ssse3 crc64_rocksoft_generic aesni_intel ehci_pci crc64_rocksoft sg ehci_hcd crypto_simd crc64 cryptd usbcore scsi_mod wmi btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq
[  +0.589450] Supported: Yes, External
[  +0.059348] CPU: 7 PID: 16330 Comm: btrfs-transacti Not tainted 6.4.0-150600.23.81-default #1 SLE15-SP6 7a638265cb74f5894e77323b38d5130ae84e62a5
[  +0.119729] Hardware name: Dell Inc. PowerEdge R520/051XDX, BIOS 2.9.0 01/09/2020
[  +0.117941] RIP: 0010:__btrfs_free_extent+0x1093/0x11c0 [btrfs]
[  +0.059516] Code: 44 89 fa 48 c7 c6 50 a7 86 c0 48 8b 78 60 e8 44 53 0d 00 41 b8 01 00 00 00 eb 83 44 89 fe 48 c7 c7 20 a7 86 c0 e8 4d e8 5b e0 <0f> 0b 41 b8 01 00 00 00 e9 2a ff ff ff 44 89 fe 48 c7 c7 20 a7 86
[  +0.176074] RSP: 0018:ffffd456423f3b68 EFLAGS: 00010286
[  +0.057317] RAX: 0000000000000000 RBX: ffff8cf874673230 RCX: 0000000000000000
[  +0.112035] RDX: 0000000000000001 RSI: ffff8cfb633a3500 RDI: ffff8cfb633a3500
[  +0.112329] RBP: 000028a3a4712000 R08: 0000000000000000 R09: c0000000fffeffff
[  +0.111950] R10: ffffd456423f3a80 R11: ffffd456423f3990 R12: ffff8cf874673270
[  +0.114187] R13: 000000000000008e R14: 0000000000002265 R15: 00000000ffffff8b
[  +0.114359] FS:  0000000000000000(0000) GS:ffff8cfb63380000(0000) knlGS:0000000000000000
[  +0.114451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.057389] CR2: 00007f2cbfb18002 CR3: 0000000281234005 CR4: 00000000001706e0
[  +0.112243] Call Trace:
[  +0.054705]  <TASK>
[  +0.052948]  __btrfs_run_delayed_refs+0x228/0xfd0 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.106351]  btrfs_run_delayed_refs+0x69/0x110 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.107933]  btrfs_write_dirty_block_groups+0x16d/0x3b0 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.108376]  commit_cowonly_roots+0x1ea/0x260 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.108643]  btrfs_commit_transaction+0x42d/0xf20 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.109825]  ? start_transaction+0xd6/0x7e0 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.111468]  ? __pfx_autoremove_wake_function+0x10/0x10
[  +0.056654]  transaction_kthread+0x169/0x1c0 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.111865]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs e26e7ab6eb47ce133050dbef307416b024c87d16]
[  +0.112211]  kthread+0xe1/0x120
[  +0.054864]  ? __pfx_kthread+0x10/0x10
[  +0.053876]  ret_from_fork+0x2c/0x50
[  +0.052628]  </TASK>
[  +0.051001] ---[ end trace 0000000000000000 ]---
[  +0.050735] BTRFS: error (device sdd: state A) in do_free_extent_accounting:2880: errno=-117 Filesystem corrupted
[  +0.000033] BTRFS error (device sdd: state A): fail to start transaction for status update: -117
[  +0.100918] BTRFS info (device sdd: state EA): forced readonly
[  +0.000004] BTRFS error (device sdd: state EA): failed to run delayed ref for logical 44683303657472 num_bytes 262144 type 178 action 2 ref_mod 1: -117
[  +0.309880] BTRFS: error (device sdd: state EA) in btrfs_run_delayed_refs:2186: errno=-117 Filesystem corrupted
[  +0.106417] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 2890454516 wd_nsec: 2890449057
[  +0.106376] BTRFS warning (device sdd: state EA): Skipping commit of aborted transaction.
[  +0.104254] BTRFS: error (device sdd: state EA) in cleanup_transaction:2051: errno=-117 Filesystem corrupted
[  +0.120722] BTRFS info (device sdd: state EA): scrub: not finished on devid 7 with status: -125
[  +0.000697] BTRFS info (device sdd: state EA): scrub: not finished on devid 6 with status: -125
[  +0.000601] BTRFS info (device sdd: state EA): scrub: not finished on devid 8 with status: -125
[  +0.001145] BTRFS info (device sdd: state EA): scrub: not finished on devid 5 with status: -125
[  +0.103009] BTRFS info (device sdd: state EA): scrub: not finished on devid 1 with status: -125

If anybody can help with that one that would be really appreciated.

Umm, don’t know what your hardware setup is, but it has been my experience that often unexplained intermittent errors are often cause by things one would never suspect.

I have experienced the following many times:

  1. PSU seems okay, measures okay but slightly low in voltage.

  2. SATA cables went bad or simply didn’t work well with one drive vs. others.

  3. Worn or dirty or corroded pins in back-plane connectors or weak cable pins etc.

  4. I have had motherboard that stopped working on 2 SATA ports for some reason.

  5. Memory pins, connections, or RAM modules going bad. (Lost two ECC sticks last year)

  6. Had a fan running a tiny bit slow (not bearings bad) but it would glitch the motherboard pretty regularly.

Good Luck!

:smiling_face_with_sunglasses:

3 Likes

Hi, thank you for your answer. Maybe it is a good time to get the server out of the rack and clean it then :slight_smile:

yes, you should probably run an extensive memtest to ensure you don’t have any issues there, before proceeding with further recovery. From what I know on the above errors you have listed, it’s a serious fault in the filesystem, often caused by some memory error (in previous boots) which ends up corrupting portions of the filesystem. But all of @Tex1954 recommendations to look into are sensible as well.

I kind of suspect you might have to recover your filesystem from a backup, or attempt btrfs restore into a new section. But none of that should happen until you have done some more hardware checks to ensure that you’re not propagating errors any further.

Are you using ECC memory on your server (not that it’s the end-all, save-all, but can certainly be beneficial to reduce the risk of bitrot and general memory read/write errors).

2 Likes

This is a PowerEdge R520 and it is all ECC:

Part Number: HMT351R7CFR8A-H9
Part Number: HMT351R7CFR8A-H9
Part Number: HMT351R7CFR8A-H9
Part Number: 18KSF51272PDZ1G4D1
Part Number: HMT351R7CFR8A-H9
Part Number: HMT351R7CFR8A-H9

All hardware diagnostics are reporting ok but the reason i replaced a second disk was because dev 1 was going missing during some reboots. Is it fair to assume it could be an issue with the disk array controller in your opinion?

1 Like

that could be an issue as well. Doesn’t the PowerEdge have something like iDRAC that logs hardware errors or alerts if the CMOS battery is low/dead? Though, if the battery was dead I assume you would lose all kinds of config at every reboot. Sorry, I have never owned one, so no idea how “manageable” they are …

1 Like