As one of the drives in my Raid1 arrays has started showing some errors (most likely age-related), I need to replace it and would like to take the opportunity to expand my storage capacity as well.
Because I see multiple options in how to proceed, I wanted to ask for the community’s feedback based on people’s experience with such procedure. I’ve thus read through several relevant posts in the forum and the Github repositories, but given these span a rather wide period of time, their comments and recommendations may not be accurate anymore due to recent improvements in Rockstor and Btrfs itself.
I thus thought I would lay out the options I have in front of me and see what the consensus would be on each one of them. By providing as much information and resource as possible, I’m hoping this can benefit other users as well.
In this spirit, thanks for any experienced user who would correct any inaccuracy I might write.
Aims and requirements
My current data pool consists of:
- Drive A: 3 TB HDD
- Drive B: 3 TB HDD
Drives A and B are combined in a single Raid1 pool. Note also that the pool is rather full so I have about ~2.7 TB to deal with.
Unfortunately, Drive A needs to be replaced. In the end, I would thus like to have the following pool (still in Raid1):
- Drive B: 3 TB HDD
- Drive C: 8 TB HDD
- Drive D: 8 TB HDD
I would also like to try doing everything from Rockstor webUI and avoid the command line, as an exercise and test of Rockstor. Overall time to complete the move is paramount, however, so if a cli-approach has a substantial advantage over a webUI-only approach, I’ll pick that.
Finally, this would be conducted once the openSUSE rebase has been completed, meaning it would be running kernel and btrfs versions of Leap 15.2.
Options
Quotas would be disabled prior to any operation.
Because I need to both add and replace disks, there are several strategies combining replacement and addition of a drive. Notably, as I currently only have one SATA port free on my motherboard, I see the following options available to me:
-
option A:
- Remove Drive A from the pool
- Add Drive C and Drive D to the pool in one go
-
option B:
- Add Drive C to the pool
- Remove Drive A from the pool
- Add Drive D to the pool
-
option C:
- Replace Drive A with Drive C in the pool
- Add Drive D to the pool
Comments
Option A
While this option seems straightforward, it would lead to the pool having a single device at the end of the first step, which would thus imply a conversion from Raid1 to single, which is time-consuming and demanding on IO events as well (if I’m correct, at least). Furthermore, it would also require me to switch again from single to Raid1 at the end of step 2, thereby adding even more time and IO wear on the drives. I would thus think option A as the least favorable.
Option B
If my understanding is correct, this procedure would create a balance at the end of each step listed, resulting in a total of 3 balances. A big advantage is that no Raid level conversion would be required.
In details, the procedure would be:
- Turn off the machine
- Plug in Drive C
- Turn on Rockstor and use the “Resize/ReRaid” feature to add Drive C to the Raid1 pool
- Monitor progress of the triggered Balance procedure using the “Balance” tab.
- Once the balance has completed, use the “Resize/ReRaid” feature to remove Drive A from the pool.
- Monitor progress of the triggered Balance procedure using the “Balance” tab.
- Once the balance has completed, turn OFF the machine, unplug Drive A, plug in Drive D, turn ON Rockstor, and then use the “Resize/ReRaid” feature to add Drive D to the pool.
- Monitor progress of the triggered Balance procedure using the “Balance” tab.
- Once the balance has completed, use Rockstor as usual.
Option C
Similar to option B, option C would not require a conversion of Raid level. Furthermore, if my understanding is correct, option C would only imply two balances, an advantage over option B. Nevertheless, it would also not be possible to do entirely through the webUI as the implementation of a disk replacement is not yet implemented (see related Github tracking issue).
Although I still need to test it in a VM, the procedure would be similar to:
- Turn off the machine
- Plug in Drive C
- Turn on Rockstor
- Remote open an SSH session and use the
btrfs replace start -r <Btrfs-device-id-of-DriveA> /dev/sd<letter-of-DriveC> /mnt2/pool_name
. - Monitor status with:
btrfs replace status /mnt2/pool_name
. - Once completed, resize the filesystem to take advantage of the bigger drive:
btrfs fi resize <Btrfs-device-id-of-DriveC>:max /mnt2/pool_name
.
Note: here, I’m not sure how disks and pools would look like in Rockstor webUI… still need to test that one.
- Once completed, turn OFF the machine, unplug DriveA, plug in Drive D, turn Rockstor ON, and use the “Resize/ReRaid” feature to add Drive D to the pool.
- Monitor progress of the triggered Balance procedure using the “Balance” tab.
- Once the balance has completed, use Rockstor as usual.
Overall
Between options B and C, it seems to me that the biggest difference lies in how efficient the add_C+remove_A procedure is when compared to replace_A_with_C. As I haven’t tested it yet, that’s something I wonder.
On one hand, the Github issue linked above reads:
N.B. it is generally considered to be a longer process to use replace rather than:
“btrfs dev add” and then “btrfs dev delete”, it might make sense to suggest this course of action in the same UI.
On the other hand–and a year more recent–the following forum post reads:
I think that the general opinion is that a ‘btrfs replace’ is the more preferred, read efficient, method to btrfs dev add, btrfs dev delete, or the other way around.
This is furhter supported by the Btrfs wiki, which reads:
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
Replacing failed devices
Using btrfs replace
When you have a device that’s in the process of failing or has failed in a RAID array you should use the btrfs replace command rather than adding a new device and removing the failed one. This is a newer technique that worked for me when adding and deleting devices didn’t however it may be helpful to consult the mailing list of irc channel before attempting recovery.
Interpretations
Based on the information above, the btrfs replace
route (option C) now seems to be the preferred method from an efficiency perspective but I’m unsure of how Rockstor would “deal” with it. Due to recent improvements in drive removal and pool attribution, I believe I should be able to remove any physically removed disk that would be detected as detached without too much problem, however.
If the gain in efficiency over option B is not that substantial, though, it may not be worth the additional “hassle”.
As mentioned, I still plan on testing option B and C in a VM and compare the overall time needed to complete each, for instance, but I would appreciate if others had any experience in similar operations as of late.
In advance, thanks!