RAID6 btrfs replace stuck. read-only mount

Hi,

I started replacing drives (my pool is called “Top”), i currently have RAID6 and want more capacity. I inserted a new disk (6TB) , used “btrfs replace start” all went well and it finished. i powered down, removed “old” (4tb) disk. turned on. all is good :slightly_smiling:

I did the same thing a second time. also all went perfectly.

On the third disk, the replace didnt seem to finish successfully. it said it was cancelled after a few hours.

I powered down, removed the “old” disk, powered back up. it looks like its all running… but the UI shows there as being no data.

btrfs fi show says that there is 72TB in use (which is the amount I would expect)

brfs fi df /mnt2/Top shows “Data, Single: 3GB” which is NOT what i want to see!

Accessing my two shares over the network - they both appear totally empty

accessing mnt2/Top/TV then doing ls, also shows no files in the shares

I can boot rockstor fine with EITHER the 4 or 6 TB disk inserted, and it seems pretty happy - but still btrfs fi show says 72TB, and the UI shows no data, and i cant see any files.

if i put both drives in at once btrfs fi show says there is a mismatch in the IDs

Advice on how to proceed please?

I’m not sure whether to try a balance - on the basis that some data may be missing? or a scrub? whether to put the both 4 and 6Tb drives in and try “btrfs replace resume” on them?

or simply remove both the 4 and 6TB drives on the basis that i have RAID6 and it should be able to start without either of them and repair itself from there?

Thanks in advance!

I’ve done some follow up, and i think the reasons that i see the shares as empty is because “top” isn’t mounted - or at least isnt mounted correctly?

umount /mnt2/Top

shows:

/mnt2/Top: not mounted

but then

mount /dev/sdl /mnt2/Top

gives:

mount: wrong fs type, bad option, bad superblock on /dev/sdl, missing codepage or helper program

If the replace did not succeed, I would not have removed the old/source disk just yet. Is the new disk part of the set in the output of btrfs fi show? If so, then I’d think that the new device is part of the pool now but chunks are not balanced(further speculation that that’s where repace failed). This may be preventing the Pool from being mounted. I’d then mount the Pool in degraded mode and run a balance.

I don’t think there’s such a thing as btrfs replace resume, is there? If the pool was fully balanced prior to the replacement of the third drive, then I’d be pretty optimistic. Ok, I’ll stop here, too much speculation with very little data.

Hi - Thanks for getting back to me.

I have mounted in degraded mode and my data does appear to be there (Yay/Thanks!)

So now in theory all i have to do is run a balance and it will “repair” the array back to a healthy state where it will auto mount?

tried the balance - it says it cant balance because the file system is read only?

So i went to re-mount it … and it cant because it is busy

i checked and it thinks the replace is running again… (but its not making any progress)

so i thikn when i get home (im access remotely right now)

i’ll put both disks in and see what the replace does, if it continues

Hi,

With all the disks in - it boots, and it mounts, and the data is accessible. But it has mounted read-only.

That means I cant scrub or balance or delete the disk I want to remove.

I cant unmounts because it says it is busy.

btrfs replace status says that a replace is in progress, but it is 0.0% done (and has been like that overnight), with 394 read errors, which also isn’t changing. none of the disks actually appears to be doing anything, and “btrfs replace cancel” has no effect.

So what do I do now?

Thanks

Chris

I’ve updated the title of this post to be more specific.

I think this is a good time to share specifics on btrfs mailing list and ask for guidance. Perhaps the suggestion would be to unmount(I know you are not able to, but have you tried unmounting all Shares in the Pool first? perhaps umount -l?) and run a btrfs-check. output of dmesg, btrfs fi show, btrfs fi df, btrfs fi usage may also have important clues in them to help further troubleshooting.

Hi,

Thanks for getting back to me. (and undating the title)

I’ve been doing some further investigation (and learning a lot about Linux/BTRFS in the process!)

I have managed to unmount it now… and run btrfs check …

I think the issue is that the journal is out from what is on the disks:

parent transid verify failed on 57637091033088 wanted 93587 found 93343

and I’ve found this:

http://ram.kossboss.com/btrfs-transid-issue-explained-fix/

but that talks about being one or two transactions out… not 250! so I’m reluctant to try that fix for now

(plus everything I have tried so far has been non-destructive/read only

Also… ive noticed that Rockstor mounts it as rw… but as soon as you try a write… there is an error so the mount changes to ro… which is, I presume, what is making the “replace” get stuck… since you cant change a ro fs

ha, that’s a key observation. I suggest you turn off Rockstor services altogether while you are troubleshooting and eliminate interference.

  1. stop services with systemctl stop rockstor-pre rockstor rockstor-bootstrap

  2. disable them with systemctl disable rockstor-pre rockstor rockstor-bootstrap

After these services are stopped and disabled, if you unmount all Shares and just mount that Pool in question, I wonder if replace will pick up again.

You can enable and start them back once the Pool is back to health.

It was a nice idea, but without doing anything at all, it still goes ro. I suspect its the “write” of the stuck replace operation which is causing it to error when it realises the journal is out of sync.

on the basis that it’s RAID6, and in theory i’m moving data from one disk to another , so that should only be one device so to speak, I did try removing both of the drives, the 4 and 6tb, as that should still leave me with a mountable but degraded array, but I cant mount that at all

Hello,
I am new here and also newbie with rockstor as well as btrfs.
However, I have similar problem which I am not able to solve.
I create Raid6 pool (6x8TB) everything was ok till yesterday when suddenly my pool went ro.
I dont know how to analyze more the problem here is what I get from dmesg (here is a video of full dmesg output (from 10s BTRFS errors start).
I did not try anything. I am not sure if I can find what caused this (if failed HDD or is it SW problem?)
(I have backup but I dont want to make new pool since the copying taked few days).
Thanks for help and suggestions.

you say you are using 8TB drives - are they SMR (Seagate archive drives)? if so then I suspect that is your problem