Advice needed on All-Drives replacement

Hi Rockstor afficinados,
I am planning on upgrading the drives of my current Rockstor appliance to higher capacity HDDs. Since I am intending on replacing all disks (since the chassis can only hold 4) I would like to ask the community for advice on how to best go about it. I am running RAID5/6 (yeah, I know, I know, but I took the risk a few years ago and -knock on wood- did not have any catastrophic failure).

My current box looks like this:
Rockstor Version - 3.9.2-48
Linux: 5.4.1-1.el7.elrepo.x86_64 (BTRFS tools @ 5.4)
Disks: 4 x 4TB RAID5
Current Fill Level:

image

I want to move to 4 x 10TB with the same RAID setup.

Any words of wisdom on the procedure to get there? I don’t really have the option to pull all data off first and then just wipe and go.

As I’ve read in a few other threads about some issues during single disk replacements, I wanted to ensure I don’t fall into the same rabbit holes.

Thanks in advance for your help.

@Hooverdan Hello again. Just a caution re:

Have you ensured that Rockstor is referencing your newer btrfs tools to match your newer kernel. Currently Rockstor is hard wired to use the btrfs at:

Also note that this is a pretty bleeding edge version of kernel and btrfs tools: latest Tumbleweed (20191202) is running:

tumbleweed:~ # /usr/sbin/btrfs --version
btrfs-progs v5.3.1 

Another consideration is you are missing some fairly significant improvements in your Rockstor version:
see the following:

Specifically:

  • pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722 @phillxnet

so you should consider using at least 3.9.2-50. Plus I would disable quotas before you do this fairly massive shuffle.

There is also another consideration re the format of the scrub output that has changed in more recent versions of btrfs such as you are hopefully now using. @kupan787 has a thread on this here:

So just chipping in ducks in a row stuff really. Let me know if your system isn’t offering you 3.9.2-50 (assuming this is an rpm based install) and I can look into it.

On the wires side, do you have a spare sata port so you can at least add one of the larger drives before you remove one of the smaller ones, just to get more working space? You mention 4 max but didn’t know if that was just drive caddies as you can always jury rig another drive some how during the migration: but only if you have a port. And don’t use a usb-sata adaptor for this if at all possible. The USB bus is just not as stable which is particularly problematic in multi disk btrfs volumes (pools), most notable when adding / removing disks and in turn also most sensitive with the parity btrfs raid levels of 5/6.

Hope that helps.

Thanks for your details on things I should consider.

Yes, I think I got them to reference the new tools. I actually changed the pointers to the tools, as for some reason I can only get them to install under /usr/local/bin (I worked through the other thread on Kernel updates, and finally realized that while I had the new tools, they were never installing in the right location, despite using things like --prefix during the configuration option.

Unfortunately, it seems I am suffering from the same issue as posted in this thread. the latest version is not offered to me. It is an RPM install (i.e. loooong time ago used the iso to install). I checked in appman and my subscription is still active and valid, so I am not sure, what’s going on there. I will PM your my appliance ID, and hopefully you can figure it out:

I am currently not using quotas, so I don’t think I have to specifically turn them off.

Thanks for referencing the btrfs-progs thread, I remember reading about that a while ago, but forgot again that this might be an issue.

I think you are correct that I still have a SATA port open (just no a caddy to permanently hold a drive). So, you’re suggesting to basically add one of the larger drives as a fifth one to extend the pool and then start with replacing the others?

@Hooverdan Re:

Super, we can continue this one in the PM once started. Thanks.

But Rockstor defaults to turning them on (btrfs quotas that is) so if they show as on in the Web-UI just turn them off their: Pool page.

I think adding a new drive first ensures you will have plenty of room. Running out during a balance is something you don’t want to happen, especially with the parity raids levels of 5 & 6 as you are then into repair territory and that is weaker in the parity raid levels.

I would recommend against the replace as I believe it’s younger code. Supposed to be quicker but given it’s younger and less well tested (I think) then I would go for disk add. Let everything settle, disk remove, let everything settle. And the settle in this case could be a very long time. Also not sure if the balance formatting is going to mess things up here. But pretty sure, if using 3.9.2-50, that the pool percent usage per drive memeber should help you see when it’s done. That and drive activity. When adding / removing drives their is what I’ve referred to in Rockstor code as an ‘internal’ balance. And they don’t show up like normal balances anyway. The Web-UI should account for this however and indicate it’s suspicion of these balances being ‘internal’ I’m hoping this caveat is done away with in time upstream. But when last I check it was still invisible but very much active !

But lets get your updated to 3.9.2-50 sorted first.

Hope that helps.

2 Likes

Alright - thanks @phillxnet. The update “delay” is now sorted, and in my case required me to update the Appliance ID using https://appman.rockstor.com/

At least I am now on 3.9.2-50. Phew!

As for this:

That makes sense. The alternative (in this case) could also be to add the large capacity disk, move all the data from the pool (since it’s still under 10TB), and then do radical surgery

  • by replacing all 4 disks, set up the RAID, etc.

  • copy data back to the new (3-disk) pool/share

  • wipe backup disk I originally added (low level format I assume) and add it to the RAID/pool, then have it perform the re-balance.

Seems safer to me, and possibly faster?

@Hooverdan Your welcome and glad your now sorted re the update.

Yes. But if I read this right you will be adding the new disk as a single device single raid btrfs pool. But if, as you explain, the 4 smaller disks are removed and put aside (data still intact) and slot in the 3 new ones and create a fresh pool ready for re-population. Your data will still be on the old pool (put aside and disconnected) and on the single no redundancy ‘transfer’ pool. I like your idea better.

Agreed and yes much faster.

Re:

I’d delete that temporary transition pool from within the Web-UI as you can then use the wipe disk facility within the Web-UI.

Do not wipe the smaller disks pool as that will delete their contents (mostly). This may well leave Rockstor hanging on for them as detached disks but that’s mostly cosmetic. But when you delete a pool subvols/data is removed and we don’t want that in this case as they are the backup during this process. A warning is given to this effect.

On major caveat to your approach with regard to Rockstor is that you must use a different name for all your new pools (to be safe). And before you create a new pool make sure Rockstor has no prior knowledge of a pool by that name (ie showing as having detached drives for instance). We have a current weakness with managing pools by their name, this must be moved to managing by their btrfs uuid. Work has begun on this and you will see it now displayed in your Pool details page but that is not yet canonical. But the pool name is currently. So Rockstor can get in to some tricky spots if you have different pools, prior or current, that share the same name. So all new names is best.

Maybe this is a good time for you to move over to a raid1 / raid10 profile :slight_smile: It’s significantly quicker and far better on the self repair front.

Hope that helps.

Great information. Thanks. I will mull this over. Fortunately, I do have time, as this will be self-inflicted pain of my choosing as opposed to having a failed disk/array to replace. and maybe I will go back to another RAID level.

The naming piece is definitely an important tidbit, I would have screwed myself likely if I didn’t know that now.

Let’s see whether I get additional recommendations and I will eventually post an update of either a glorious or a disastrous outcome :slight_smile:

An update: over a period of a couple of weeks I have successfully migrated my original 4x4TB to a 4x10TB (well, technically 9.4TB each). I was too lazy to go through all of the renaming, so I figured I’d go with remove and add for each of the drives … well, I think a backup with a straight up replacement would have been way faster, even though I would had to maneuver through the obstacles that @phillxnet had laid out.

I started with adding a new 10TB drive to ensure I would not run out of space, and the subsequently did the remove/add/remove/add, etc.
Each remove action and add action took a little over 24 hours each :wink: and we’re talking about 8 TB total data across the RAID …
I almost screwed up by removing a drive that was still being “removed” in … fortunately, this test showed me that even a RAID 5 on BTRFS can recover from such a dumbass move. As far as I know I have not lost any data (a scrub told me so, and I did a rough comparison to my backup), but who knows…
I also moved over to a RAID10 setup now as part of that. I have not had any issues with the RAID5 since I started using RockStor, but I’ll try the RAID10 for a little bit to see how that goes.

So, thanks everybody for the tips, I still made it the most drawn out process, but it’s done for now. Now what to do with the 4x4TB drives … maybe build another NAS/off-site backup …

3 Likes