Problems replacing/upgrading a disk

I currently have 12 disk in my array - setup in a BTRFS Raid5 config. I have 4x8tb disks and 8x2tb disks. I wanted to upgrade 2 of my 2tb disks and I am having a nightmare of a time being able to do it.

I ran into difficulty pretty much exactly pointed out in this forum post: Can't remove failed drive from pool - Rebalance in progress where I tried the remove disk option in the GUI, and got an error message.

I tried to mount the array as degraded, pulled the drive that I wanted to replace and then attempted to remove the missing disk but that didn’t go over well immediately I was giving warnings of a degraded pool - a re-balance seemed to happen and when it looked like it was done, I attempted to do a btrfs delete missing /mnt2/Pool1 command and I got an Input/Output error.

so long story short I put the original disk back in, rebooted - it was added back to the pool - some drive errors showed on the disk, not sure why, I performed btrfs dev stats reset and the errors went away but I am back to square one.

It would seem to me that I did something wrong here, what is the best method to upgrade the drives with out risking the data?

getting back to this, I am also noticing that when I reboot rockstor, things seems to mount rw, but then after a few moments it flips to ro and there is nothing that I can do to stop this behavior.

After a few panic attacks – i began to fear the worst and started to copy the data off to a new drive, but when in ro, the read speeds I am getting when trying to copy the data off this volume is very slow, bearing in mind that everything was working pretty normally before I tried to replace the drive.

im at a loss here, hopefully something can be done here short of copying all the data to another drive and rebuilding everything.

@cferra Hello again and sorry I couldn’t chip in earlier.

This very much indicates a file system integrity problem. Btrfs will often go ro as a precautionary measure it if encounters issues.

I’m afraid I can’t spend much time on my answer as I’m actually working on the exact issue that threw you in the first place ie:

which incidentally is only a cosmetic issue as if you had simply waited the disk removal would most likely have completed. Apologies not to have gotten to this sooner but we have a lot going on in Rockstor currently.

Removing a missing disk was added and tested recently in:

which had my current issue as a caveat. We have to break things up into manageable parts but yes it was a shame you were caught by this one. The above was released as stable channel 3.9.2-41.

Another problem here is your use of a parity raid, which is recommended against within the UI in the tooltip and other places. It’s definitely getting better but we don’t have the most recent improvements in our kernel / btrfs-tools. Hence the openSUSE move you commented on recently, who very much do keep their btrfs kernel parts and user land progs updated. And one of the shortcommings of the btrfs parity raid levels (5/6) are their repair capabilities. There are known issues there.

Anyway given that you removed a disk and then mounted degraded,rw and then initiated a remove missing, and then in turn interrupted that delete missing (first error but salvageable), and then re-attached the prior member: it was this last step that was the worst move and if you had not done that your pool may have been fine. You have, through both short comings in Rockstor’s UI (I’m working on that now) and impatience on your part (understandable given the lack of UI feedback: hence the “…and no UI counterpart.” in my current issue) been caught between things.

Essentially you should not have re-attached that disk and should have let the missing disk removal complete, and prior to that the initial disk removal via the raid pool resize had as a known issue that UI error message and would also most likely have completed, as @Noggin’s did in your sighted forum thread. So all in this was rather poking btrfs’s soft spots and so I would say you pool is not to be relied upon.

Sorry and I feel for you and I am working to improve this situation but it is always far better to ask before doing major changes, especially if you then compound them by interruption (first attached removal attempt) / additional intervention (disk physical removal while it was mid logical removal) / disk re-insertion (mid prior 2 events) etc.

Note also that clearing the drive errors report only clears the record of what has happened. To repair a pool one uses the scrub feature. But given your pool is 1) raid 5/6 and 2) has endured enough already :slight_smile: and is as a consequence going read only I think it’s backup restore time or get what you can off.

Yes, a degraded pool is one where a device is missing, you had just removed a device. The “seemed to happen” balance is part of the problem we face here as when one removes a disk there is an internal balance that does not show up in the usual ‘btrfs balance status’ command as one would expect. So I am having to develop an algorithm of sorts that looks to the negative unallocated figures within a ‘btrfs dev usage’ command. Crazy that really but hey that is what is required and I’m in the process of developing that. So in short removing a disk, prior attached or missing, causes an almost ‘invisible’ balance which can take hours.

I’m guessing that was due to an already active internal re-arrangement.

I would also recommend that you use raid1 or raid10 (1 preferred) as these are far more robust and much better at self repair. Especially given our older kernel and btrfs-progs.

So in short if you had simply let the machine alone after:

or asked on the forum. Then you would probably still be OK. I know this is a real pain and seems like a silly error to be tripped up on but the problem is actually quite complex, as the associated issues expand upon. But I have a far better behaved version of Rockstor working in house in this regard and I have marked this forum thread also as one to update when this very rough edge of Rockstor is fixed by way of released code.

Hope that helps re what what may have gone wrong in all parties concerned but I will update this thread when I’ve sorted at least a little of our end of things. Appologies again for not getting there sooner but these things are always more complicated that it at first appears. And do please ask if you encounter any further issues as it’s best to ask as it is quite likely that others would have encountered either the same or something very similar, i.e. @Noggin’s experience.

It was actually exactly what you did at the very beginning, Pool details page Resize remove smaller disk. Wait (ignore the UI timeout related error). Once that has settled. details page Resize add new disk. Or add then remove if ports / bays allow. It’s just a shame that I didn’t have my current issue ready in time. But as we put more capabilities into Rockstor we also have more to maintain / keep workable, and we have had mention of late of slow downs to the point of other time outs so I also have to do some performance work at the same time as adding this last major UI fumble which as caught you out so unfavourably.

So sorry again for not having this ready in time for your ‘event’ and thanks for helping to support Rockstor’s development via a stable subscription. Good luck with the data, you may have use for the btrfs restore command.

well - from the sounds of that, it doesn’t look good… I am able to get the volume mounted currently , ro and I can see the data, and can copy it to some spare drives, however the volume seems to go down after about 30 minutes or so and read speeds are very slow. Using some btrfs wizardry, is there way to make this more stable so that I can get the data off or am I resigned to rebooting the server every 30 minutes?

Forgive my noobishness here. I’m unfortunately in uncharted water here.

dmesg for what it’s worth is throwing me parent:
transid verify failed
and an alarming:
btrfs_run_delayed_refs:2971: error=-5 IO failure.

currently showing on dev/sda but i’ve seen transit id warnings on /dev/sdb /sdc etc…

@cferra

I missed this topic when it came up, but found it in your post history after @bar1’s post.

Just thought I’d weigh in here that as per the other topic, mounting in recovery mode may indeed help you.
In particular mounting in recovery mode skips the log replay feature, so your transid mismatches should not error and drop your mountpoint.

I tried to get it to go in recovery mode but was never able - ultimately - after many many reboots I was able to get 99.9 percent of the data off but it was an exercise in frustration.