Problem replacing a damaged disk

dadozts · December 10, 2025, 5:56pm

the answer is:

Data, RAID1: total=201.00GiB, used=190.72GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=3.00GiB, used=2.21GiB
GlobalReserve, single: total=270.98MiB, used=0.00B

at dmesg | grep -i btrfs

[ 2.032619] Btrfs loaded, assert=on, zoned=yes, fsverity=yes
[ 58.196321] BTRFS info (device sdb): using crc32c (crc32c-intel) checksum algorithm
[ 58.196810] BTRFS info (device sdb): use no compression
[ 58.197269] BTRFS info (device sdb): disk space caching is enabled
[ 58.198253] BTRFS error (device sdb): devid 1 uuid 781edd51-ed1e-4118-a0db-2bfb99a18cee is missing
[ 58.198989] BTRFS error (device sdb): failed to read the system array: -2
[ 58.200807] BTRFS error (device sdb): open_ctree failed
[ 70.182288] BTRFS info (device sdb): using crc32c (crc32c-intel) checksum algorithm
[ 70.182808] BTRFS info (device sdb): use no compression
[ 70.183302] BTRFS info (device sdb): disk space caching is enabled
[ 70.184147] BTRFS error (device sdb): devid 1 uuid 781edd51-ed1e-4118-a0db-2bfb99a18cee is missing
[ 70.184661] BTRFS error (device sdb): failed to read the system array: -2
[ 70.185416] BTRFS error (device sdb): open_ctree failed
[ 71.235425] BTRFS info (device sdb): using crc32c (crc32c-intel) checksum algorithm
[ 71.236249] BTRFS info (device sdb): use no compression
[ 71.237019] BTRFS info (device sdb): disk space caching is enabled
[ 71.238462] BTRFS error (device sdb): devid 1 uuid 781edd51-ed1e-4118-a0db-2bfb99a18cee is missing
[ 71.239300] BTRFS error (device sdb): failed to read the system array: -2
[ 71.240367] BTRFS error (device sdb): open_ctree failed
cleared)
these above many times and then these below, I cut out all the part of the other disks that did not give any errors.
[ 888.760949] BTRFS: error (device sdb: state EA) in btrfs_finish_one_ordered:3094: errno=-5 IO failure
[ 888.761221] BTRFS error (device sdb: state EA): parent transid verify failed on logical 2795422236672 m
irror 1 wanted 739658 found 741427
[ 888.763379] BTRFS: error (device sdb: state EA) in btrfs_finish_one_ordered:3094: errno=-5 IO failure
[ 888.763590] BTRFS error (device sdb: state EA): parent transid verify failed on logical 2795422236672 m
irror 1 wanted 739658 found 741427
[ 888.766387] BTRFS: error (device sdb: state EA) in btrfs_finish_one_ordered:3094: errno=-5 IO failure
[ 888.766592] BTRFS error (device sdb: state EA): parent transid verify failed on logical 2795422236672 m
irror 1 wanted 739658 found 741427
[ 888.769324] BTRFS: error (device sdb: state EA) in btrfs_finish_one_ordered:3094: errno=-5 IO failure
[ 888.769734] BTRFS: error (device sdb: state EA) in btrfs_finish_one_ordered:3094: errno=-5 IO failure

If I run btrfs scrub start -B /mnt2/Dati

ERROR: scrubbing /mnt2/Dati failed for device id 1: ret=-1, errno=30 (Read-only file system)
ERROR: scrubbing /mnt2/Dati failed for device id 2: ret=-1, errno=30 (Read-only file system)
ERROR: scrubbing /mnt2/Dati failed for device id 3: ret=-1, errno=30 (Read-only file system)
scrub canceled for c4e9692b-c4a6-4bc6-b452-63be422d8375
Scrub started: Wed Dec 10 18:35:24 2025
Status: aborted
Duration: 0:00:00
Total to scrub: 0.00B
Rate: 0.00B/s
Error summary: no errors found

Another problem I’ve encountered is that in the Rockstor interface, two disks appear, but if I check from the command line (btrfs filesystem show /mnt2/Dati), the missing disk still appears, the broken one.

Label: ‘Dati’ uuid: c4e9692b-c4a6-4bc6-b452-63be422d8375
Total devices 3 FS bytes used 192.93GiB
devid 1 size 0 used 0 path MISSING
devid 2 size 931.51GiB used 204.03GiB path /dev/sdb
devid 3 size 953.87GiB used 1.00GiB path /dev/nvme0n1

I managed to remove it without any btrfs command (btrfs device remove missing).

Honestly, it would be easier to destroy the pool and rebuild it from scratch. It would be laborious to properly configure the Rocks-on devices running on it. But honestly, with your help, I’m learning a lot, and I don’t know why I feel like I can always fix it. Thank you so much for your support, your time, and you can always tell me to STOP, throw everything away.

Davide

Hooverdan · December 10, 2025, 7:11pm

as a quick side note, create a config backup via the WebUI, then it will restore thr Rock-ons accordingly (and other stuff), after the initial install and pool import - or as in your case after you have created the pool and appropriate shares and their contents from backup (if you have one)

Hooverdan · December 10, 2025, 8:25pm

did you run that as well?

simon-77 · December 11, 2025, 7:13am

Honestly, the Segmentation fault error is very strange and should not happen, no matter how many drives are failing.

I think that can actually be called a bug …

It’s also great to hear, that you managed to remove the failed drive from the pool. What’s a bit strange though is, that you get a lot of errors for your other SATA SSD (sdb):

Moving forward, I would:

reboot the system
reset the btrfs error counters (the -z will clear the values, when you run it again, it will only show zeros):

sudo btrfs device stats -z <mount-point>

Verify that the old SSD is really removed from the pool

sudo btrfs filesystem show <mount-point>

Check whether as balance or scrub is still running

sudo btrfs balance status <mount-point>
sudo btrfs scrub status <mount-point>

Try resuming the balance operation, to restore RAID1 redundancy

sudo btrfs balance start convert=raid1,soft <mount-point>

If you encounter some errors, it is a good idea to have a look into the kernel log via dmesg as you already did multiple times.

When the balance operation does not work, try running a scrub
When the scrub does also fail, use the btrfs check command, though note that the btrfs filesystem must not be mounted for this command
When you run into some “read-only filesystem” errors again, try mounting the filesystem manually:

sudo mount /dev/sdb /mnt

When you encounter some problems mounting the filesystem, try degraded mode

sudo mount -o degraded /dev/sdb /mnt

Cheers Simon

dadozts · December 11, 2025, 7:41am

Hi,
I have regular data backups on a commercial external NAS drive. I have configuration backups, saved separately, but honestly, when I needed them, it didn’t even reconfigure the Samba shares.
I’ll take this opportunity to reinstall the Rocks-on servers one by one, optimizing them, since some were created during my first experiences with Rockstor.

Thanks for your help.
Davide

dadozts · December 11, 2025, 7:44am

many errors I’m attaching only the final part:

root 4722 inode 15961 errors 1000, some csum missing
ERROR: errors found in fs roots
found 207154229248 bytes used, error(s) found
total csum bytes: 194094524
total tree bytes: 2370945024
total fs tree bytes: 2086141952
total extent tree bytes: 74055680
btree space waste bytes: 396683207
file data blocks allocated: 2411587784704
referenced 268150251520

dadozts · December 11, 2025, 8:08am

Hi Simon,

I’ve done all the steps you suggested, and the scrubbing seems to have started. I’m starting to suspect that the SSD is also having problems, but it was purchased in September 2025. I think I’ll have it replaced just in case.

UUID: c4e9692b-c4a6-4bc6-b452-63be422d8375
Scrub started: Thu Dec 11 09:02:58 2025
Status: running
Duration: 0:02:25
Time left: 0:16:25
ETA: Thu Dec 11 09:21:52 2025
Total to scrub: 384.28GiB
Bytes scrubbed: 49.27GiB (12.82%)
Rate: 347.92MiB/s
Error summary: verify=40
Corrected: 0
Uncorrectable: 40
Unverified: 0

Thanks again.

p.s.
no, after 4 minutes this happened.

UUID: c4e9692b-c4a6-4bc6-b452-63be422d8375
Scrub started: Thu Dec 11 09:02:58 2025
Status: aborted
Duration: 0:04:34
Total to scrub: 384.28GiB
Rate: 348.85MiB/s
Error summary: verify=40
Corrected: 0
Uncorrectable: 40
Unverified: 0

Davide

simon-77 · December 11, 2025, 11:39am

That’s very strange and not looking good.

Have you had a chance to look into the output of dmesg?

Maybe really start the btrfs check command - note that the filesystem must not be mounted for this.

dadozts · December 11, 2025, 12:42pm

btrfs check
It gave many, many errors on sdb and c_tree drives like this:

[ 505.555532] BTRFS error (device sdb): failed to read the system array: -2
[ 505.556436] BTRFS error (device sdb): open_ctree failed
[ 526.934766] BTRFS info (device sdb): using crc32c (crc32c-intel) checksum algorithm
[ 526.935387] BTRFS info (device sdb): use no compression
[ 526.935982] BTRFS info (device sdb): disk space caching is enabled
[ 526.937470] BTRFS error (device sdb): devid 1 uuid 781edd51-ed1e-4118-a0db-2bfb99a18
cee is missing

And then it ended with this:

root 4722 inode 15961 errors 1000, some csum missing
ERROR: errors found in fs roots
found 207154229248 bytes used, error(s) found
total csum bytes: 194094524
total tree bytes: 2370945024
total fs tree bytes: 2086141952
total extent tree bytes: 74055680
btree space waste bytes: 396683207
file data blocks allocated: 2411587784704
referenced 268150251520

I didn’t notice or save anything during the scrub, but immediately afterward, sdb is forced to read-only, and I’d have to unmount all the shares on it, then remount it in degraded mode and try again.

Or is there a way I don’t know of to force an already mounted /mnt from ro to rw?

Davide

simon-77 · December 14, 2025, 9:47am

Hey @dadozts

part of this information makes me a bit optimistic, part of it not so much …

Let’s start with the better news:

When this was your only error line in the output of the btrfs check:

it does mean, that “only” a single inode (= probably 1 file) is damaged. To find out what, run:

find <mount-point> -inum 15961

The other error message I don’t like is, that /dev/sdb is (or was) missing …
Can you check, if this drive is still listed via lsblk command?

And also, please show again how the data is split across the devices by running:

sudo btrfs filesystem show
sudo btrfs filesystem usage <mount-point>

unmounting & mounting is fine.
If you prefere, you can also use the option remount, something like:

sudo mount -o remount,rw <device> <mount-point>

dadozts · December 15, 2025, 9:24am

Hi, Simon, and Hooverdan
first of all, thanks for all the support. Finally, after almost 10 days of having my PC set up on the floor in the middle of my office and constant errors, I proceeded like this.

I already had a backup of that data on another system, so I mounted the /dev/sdb disk to /mnt2/Data and copied all the data from the pool to another pool on the system via the command line.

I deleted the pool and recreated it with the new NVME. All the errors—SMART, btrfs, etc.—were gone.
I loaded the data, everything was fine, and reloaded the Rock-ons in the same locations. Unfortunately, this resulted in a Jellyfin that worked, but it always appeared in “starting” mode and this recurring error.

vethc8c54fe: entered allmulticast mode
[64657.455462] vethc8c54fe: entered promiscuous mode
[64657.481607] eth0: renamed from vethd329985
[64657.530529] docker0: port 2(vethc8c54fe) entered blocking state
[64657.531144] docker0: port 2(vethc8c54fe) entered forwarding state
[64658.727450] docker0: port 2(vethc8c54fe) entered disabled state
[64658.728204] vethd329985: renamed from eth0
[64658.812278] docker0: port 2(vethc8c54fe) entered disabled state
[64658.814849] vethc8c54fe (unregistering): left allmulticast mode
[64658.815728] vethc8c54fe (unregistering): left promiscuous mode
[64658.816579] docker0: port 2(vethc8c54fe) entered disabled state
[65074.303004] docker0: port 2(vethba37825) entered blocking state
[65074.303458] docker0: port 2(vethba37825) entered disabled state

At that point, I decided to do something drastic and thoroughly fix the system. With my experience, and with your help, I deleted the Rock-ons’ shares. I left the shares nice and clean and reloaded the Rockstor system from scratch, imported the pools, and reinstalled the necessary Rock-ons.

The system works perfectly, no errors, no data loss.

I learned a lot about BTRFS that will be very useful.

Thank you for your support and your time.
Davide

P.S.
Prosecco always available