Cannot remove "detached" disks

Azzazel · December 8, 2019, 5:08pm

[Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support for expedited support directly from the developers.]

Brief description of the problem

I had to restore rockstor running as a VM on ESXi from backup via Veeam B&R. Process completed successfully and all data is available, but in “Disks” I now see old disks (before restore) as “detached” along with new disks. When I try to remove them via “resize/reraid”, i get an error.

Detailed step by step instructions to reproduce the problem

Backup Rocktor VM via Veeam B&R
Delete Rockstor VM
Restore Rockstor VM via Veeam B&R
Go to Pools and try to delete “detached” disk

Web-UI screenshot

Error Traceback provided on the Web-UI


            Traceback (most recent call last):
  File "/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py", line 68, in run_for_one
    self.accept(listener)
  File "/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py", line 27, in accept
    client, addr = listener.accept()
  File "/usr/lib64/python2.7/socket.py", line 202, in accept
    sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

Azzazel · December 8, 2019, 5:23pm

Additional info:

[root@astr-rst-01 ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile

base: dedic.sh

epel: fedora-mirror02. rbc. ru

extras: centos-mirror. rbc. ru

updates: dedic.sh
Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 50
Size : 85 M
Repo : installed
From repo : Rockstor-Stable
Summary : Btrfs Network Attached Storage (NAS) Appliance.
URL : http ://rockstor. com/
License : GPL
Description : Software raid, snapshot capable NAS solution with built-in file integrity protection.
: Allows for file sharing between network attached devices.

[root@astr-rst-01 ~]# yum list installed | grep kernel
kernel-ml.x86_64 4.10.6-1.el7.elrepo @anaconda/3
kernel-ml.x86_64 4.12.4-1.el7.elrepo @rockstor

[root@astr-rst-01 ~]# uname -a
Linux astr-rst-01 4.12.4-1.el7.elrepo.x86_64 #1 SMP Thu Jul 27 20:03:28 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

phillxnet · December 8, 2019, 9:37pm

@Azzazel Welcome to the Rockstor community.

Thanks for this report.

From the message displayed:

“Disks cannot be removed from a pool with this raid (raid0) configuration”

It would seem, as these detached disks are still considered as part of the pool, and that the pool is raid0, which can have zero disks removed, we may be blocking on that basis prematurely. I’m due to look at this area of the code again soonish so will see if I can work out what’s going wrong.

But on a related note, from a quick look, the detached disks are no longer found as their appears to be no matching serial number. Rockstor tracks drives by their serial number. Does the backup system you used some how change device serial numbers. In which case it is not going to be suitable for Rockstor use, without caveats. Did you for instance have to re-import this pool post the restore. I’m just trying to fathom what’s happened here. Rockstor tracks devices it is asked to manage (pools created within the Web-UI and imports there in also) by their serial number so all the missing drives I’m assuming are the prior instances of what we see the newly attached disks as. But if as I suspect their serial numbers have changed they are to all intents and purposes, from Rockstor’s perspective, new disks. Hence the confusion.

This is an interesting one and I will try and fathom whats happened but it you could comment on the devices serial number stability through this ‘backup restore’ process that would help.

And if this backup restore of devices is configurable to maintain serial numbers through this process then that would probably avoid all these ghost devices appearing in the first place.

The following technical Wiki entry is our stance on device serial numbers and why:

Thanks for the details report and I look forward to hearing confirmation of if this VM based backup restore systems does in fact change drive serial numbers. If so then if their is an option to persist these that would be a pre-requisit from Rockstor’s perspective.

Hope that helps and thanks again for the report. Haven’t seen that one before.

Azzazel · December 8, 2019, 11:40pm

Hi, @phillxnet!
The backup system has no option to manipulate disk’s serial numbers, or at least I was not able to find any reference to such ability in the manuals or forum posts - what it does is preserve virtual SCSI bus address. I performed a full restore (deleted the VM and restored it from a backup) and the VM just booted without requiring me to re-import the pool.

phillxnet · December 9, 2019, 12:59pm

@Azzazel Thanks for this this info. So does look like it changes the serial numbers then.

If your up for a bit of experimentation you could try the same experiment, starting from a similar install but with the pool in btrfs raid1 with no detached disks. This may help to see if the raid0 warning in this case is a red herring. Btrfs raid1 does allow for disk removal from a pool, within limits, so in your case we side step this potentially bogus error message. If you can then remove the currently unavoidable with this backup/restore method ‘detached’ disks that result post restore we will have more information to work with.

Incidentally if you take the output of the following command pre and post restore method we may be able to see more what is happening with regard to drive info:

lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

Which is the first pass of device info retrieval:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L124-L139


def scan_disks(min_size, test_mode=False):
    """
    Using lsblk we scan all attached disks and categorize them according to
    if they are partitioned, their file system, if the drive hosts our / mount
    point etc. The result of this scan is used by:-
    view/disk.py _update_disk_state
    for further analysis / categorization.
    N.B. if a device (partition or whole dev) hosts swap or is of no interest
    then it is ignored.
    :param min_size: Discount all devices below this size in KB
    :param test_mode: Used by unit tests for deterministic 'fake-serial-' mode.
    :return: List containing drives of interest
    """
    base_root_disk = root_disk()  # /dev/sda if /dev/sda3, or md126 if md126p2
    cmd = [LSBLK, '-P', '-p', '-o',
           'NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID']

Although I suspect these drives end up not presenting their serial via lsblk and in which case via:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L426-L430


# First try harder with udev if lsblk failed on serial retrieval.
if (dmap['SERIAL'] == '' or always_use_udev_serial):
    # lsblk fails to retrieve SERIAL from VirtIO drives and some
    # sdcard devices and md devices so try specialized function.
    dmap['SERIAL'] = get_disk_serial(dmap['NAME'], dmap['TYPE'])

we retrieve it from via the following udev command:

udevadm info --name=/dev/sda

within the linked code:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L1052-L1066


def get_disk_serial(device_name, device_type=None, test=None):
    """Returns the serial number of device_name using udevadm to match that
    returned by lsblk. N.B. udevadm has been observed to return the following:-
    ID_SCSI_SERIAL  rarely seen
    ID_SERIAL_SHORT  often seen
    ID_SERIAL        thought to always be seen (see note below)
    N.B. if used in this order the serial is most likely to resemble that shown
    on the device label as well as that returned by the lsblk. ID_SERIAL seems
    always to appear but is sometimes accompanied by one or both of the others.
    When ID_SERIAL is accompanied by ID_SERIAL_SHORT the short variant is
    closer to lsblk and physical label. When they are both present the
    ID_SERIAL appears to be a combination of the model and the ID_SERIAL_SHORT
    --------- Additional personality added for md devices ie md0p1 or md126,
    these devices have no serial so we search for their MD_UUID and use that
    instead.

So you could use this latter command on a single device pre and post its restore. Just to confirm what I suspect that it’s serial changes over that period.

Thanks again for the additional info re that VM orientated backup/restore method. I suspect that if it does in deed not persist device serials then it is essentially not compatible with how Rockstor works. But their may, within limits, be work-a-rounds. Rockstor sees one of it’s prime directives as ‘tracking’ devices irrespective of their content, to account for corruption or catastrophic failure there in, and the serials are the only way we can do that. If they get changed under it, it’s kind of all-at-sea and has to re-jig as best as it can work out.

Hope that helps.

Azzazel · December 10, 2019, 12:38am

Hi, @phillxnet!
I love experiments
So I spun up a new rockstor VM and created a raid1 from two 200G virtual drives, then backed it up with Veeam, deleted from vSphere and restored from a backup.
One thing of note is that this test VM has and reports as the latest rockstor version 3.9.1-16, probably due to the testing updates channel, while my main rockstor VM has version 3.9.2-50 with stable updates channel enabled.

After restore the detached drives were there, along with the new ones but this time I could successfully remove them.

Now, to the requested lsblk outputs.
Before restore:

[root@rockstor-test ~]# lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID
NAME=“/dev/sdb” MODEL=“Virtual disk " SERIAL=“6000c29a33240a77529fccf8ce00ff19” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:1:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME=”/dev/sr0" MODEL=“VMware SATA CD00” SERIAL=“00000000000000000001” SIZE=“791M” TRAN=“sata” VENDOR=“NECVMWar” HCTL=“2:0:0:0” TYPE=“rom” FSTYPE=“iso9660” LABEL=“Rockstor 3 x86_64” UUID=“2017-07-02-03-11-01-00”
NAME=“/dev/sdc” MODEL=“Virtual disk " SERIAL=“6000c2964bd0af6aa755d22714f7e770” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:2:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME=”/dev/sda" MODEL=“Virtual disk " SERIAL=“6000c2994c673296d1cc4f7a9a87e24d” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:0:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sda2” MODEL=“” SERIAL=“” SIZE=“1.6G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“swap” LABEL=“” UUID=“2c7814b9-a807-45f4-b6db-ce9eeab30180”
NAME=“/dev/sda3” MODEL=“” SERIAL=“” SIZE=“13.9G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_rockstor” UUID=“b2f9da06-7a2e-461f-abc2-755f020b4f31”
NAME=“/dev/sda1” MODEL=“” SERIAL=“” SIZE=“500M” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“ext4” LABEL=“” UUID=“a0a2fad0-fe63-445a-8cbd-84d3f5fe8c9c”

And after restore:

[root@rockstor-test ~]# lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID
NAME=“/dev/sdb” MODEL=“Virtual disk " SERIAL=“6000c2981f1f665fa00609faff46782f” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:1:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME=”/dev/sr0" MODEL=“VMware SATA CD00” SERIAL=“00000000000000000001” SIZE=“1024M” TRAN=“sata” VENDOR=“NECVMWar” HCTL=“3:0:0:0” TYPE=“rom” FSTYPE=“” LABEL=“” UUID=“”
NAME=“/dev/sdc” MODEL=“Virtual disk " SERIAL=“6000c2944e04e183a591cc26bd7c7a36” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:2:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME=”/dev/sda" MODEL=“Virtual disk " SERIAL=“6000c29d42b475d90dd0df22a2391814” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:0:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sda2” MODEL=“” SERIAL=“” SIZE=“1.6G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“swap” LABEL=“” UUID=“2c7814b9-a807-45f4-b6db-ce9eeab30180”
NAME=“/dev/sda3” MODEL=“” SERIAL=“” SIZE=“13.9G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_rockstor” UUID=“b2f9da06-7a2e-461f-abc2-755f020b4f31”
NAME=“/dev/sda1” MODEL=“” SERIAL=“” SIZE=“500M” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“ext4” LABEL=“” UUID=“a0a2fad0-fe63-445a-8cbd-84d3f5fe8c9c”

So what do we do now? I would very much like these detached drives gone as they are messing up the reporting (rockstor thinks it has double the amount of disk capacity it has).
I will also keep the test VM powered off in case we need it.

phillxnet · December 10, 2019, 1:14pm

@Azzazel Thanks for the tests.

Yes currently the testing channel, for our legacy CentOS base, is way behind. 3.9.1-16 was the last CentOS based testing channel release but is now over 2 years old:

It’s essentially the same as our 3.9.2 stable channel release, linked above, at that time.See the following thread for the background and some more links relating to this more recent history:

So this is good news and confirms my suspicion that we may be putting at least one cart before at least one horse; but ideally we want to confirm this with 3.9.2-50 code. For this Appman is at your service:

https://appman.rockstor.com/

As a Stable subscriber you can now change your Appliance ID as and when you fancy with immediate affect. So in your case you could change it to your temporary btrfs raid1 test setup such that it can be updated to latest stable via your existing subscription: just use your existing activation code after the Appliance ID change has been entered. You would of course want to change this back to your main machine their after as otherwise it would no longer receive updates. So given this facility you can easily upgrade your temporary test machine to latest stable for the purpose of this test.

Note however that since this test machine is already on latest testing, if you move it now to the Stable subscription it will say it’s done it (instantly) but that is a bug. It’s now fixed but only in the code that it doesn’t offer to upgrade to because it thinks it’s already updated. It’s a rather dark bug that way
So make sure to run

yum update rockstor

when going from last CentOS testing (3.9.1-16) to Stable to ensure you are actually running 3.9.2-50 or later as of now. As otherwise the Web-UI states available version as installed. Not good. If going direct from a prior stable install, ie the iso itself, then now worries it should be good and should update as expected. Confirm via:

yum info rockstor

So in short, via our yet to be officially announced here on the forum (though it’s referenced in the Update Channels doc entry, through Appman you can upgrade your testing machine to latest Stable. Just remember to change your Appliance ID in Appman back to your regular machine their after. Otherwise it will receive no more updates. This new Appman facility is meant as a convenience for those that help to support Rockstor’s development and who, in doing so, have incurred an inconvenience: the whole activation code rigmarole. This was found to be required as a prior donations only system just didn’t work. I didn’t donate for one, before I became a contributor that is. So I’m hoping Appman helps to re-address this balance at least a bit.

Anyway if you could please repeat this experiment but with the 2 years newer code of 3.9.2-50 which, most relevant in this case, has very many changes to device management (hopefully only improvements), that would be great. Sorry to ask this of you but you do seem to be up for it

Thanks for these:
using different quotes to ease comparing them here and omitting the partitions for clarity,
and ordering them as sda, sdb, sdc we have:

## Before
NAME=”/dev/sda" MODEL=“Virtual disk " SERIAL=“6000c2994c673296d1cc4f7a9a87e24d” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:0:0” TYPE=“disk” FSTYPE=”" LABEL="" UUID=""
NAME="/dev/sdb" MODEL=“Virtual disk " SERIAL=“6000c29a33240a77529fccf8ce00ff19” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:1:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME="/dev/sdc" MODEL=“Virtual disk " SERIAL=“6000c2964bd0af6aa755d22714f7e770” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:2:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
## after
NAME=”/dev/sda" MODEL=“Virtual disk " SERIAL=“6000c29d42b475d90dd0df22a2391814” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:0:0” TYPE=“disk” FSTYPE=”" LABEL="" UUID=""
NAME="/dev/sdb" MODEL=“Virtual disk " SERIAL=“6000c2981f1f665fa00609faff46782f” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:1:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”
NAME="/dev/sdc" MODEL=“Virtual disk " SERIAL=“6000c2944e04e183a591cc26bd7c7a36” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“2:0:2:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7123d669-be9c-4ca0-bf5c-c3a18622836d”

We have confirmation that all 3 drives have effectively been replace by, from Rockstor’s perspective, 3 new drives. And this is what it indicates within the Web-UI also. So it is behaving exactly as expected. This situation does not, as far as I’m aware, have a parallel with real disks/devices. And Rockstor’s remit is to track real devices. So yes your backup restore system is at least confusing for Rockstor as it works to track devices irrespective of their content, at least once a pool is request as managed via creation within the Web-UI or import.

An interesting test case as it goes. Do please consider re-running, via the Appman trick, on our latest code if you would. I’m expecting the same behaviour re detached disks but would like to have confirmed that the newer code also lets you remove those ‘detached’ disks post restore as per your testing code experiement, ie with btrfs raid1.

You could, for the time being, by way of a work around, convert the pool to raid1 via the Web-UI and once that has finished it’s potentially lengthy balance, remove the detached disks their after. I’m not likely to get to look more closely at that code for a bit but your report has highlighted a potential improvement. But the consequences in your case are due to a corner case of all drives being supplanted by re-incarnations of themselves. And our current blanket ban on removing a device from a raid0 pool is legitimate as that is not something that is possible under any circumstances (other than the group simultaneous re-incarnation scenario your have reproduced). If we were to allow the removal of a device from a btrfs raid0 only when it was detached then in the scenario where a SATA cable becomes detached a user may inadvertently think they can just remove that disk. But they cannot. Re-attaching it and then changing the raid profile to one with redundancy is their option. I’ll have a think as their are likely many more common corner cases than yours that we are probably missing and I’m hoping to start the testing channel off again soon to explore these, as well as required modernisation, and as these are potentially non trivial critical code changes we have to tread carefully.

But if you are really game (read you have good and proven backups) the following code segment is what is giving your grief (correct as of posting this but may change later):

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool.py#L569-L572


      
          )
          for d in disks:
              if d.pool is not None:
                  e_msg = (

this is true for stable channel only as there have been many changes throughout since testing.

You might also note a further constraint in place just before the above code here:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool.py#L559-L564


      
              ]
          
          num_disks_selected = len(disks)
          dnames = self._role_filter_disk_names(disks, request)
          new_raid = request.data.get("raid_level", pool.raid)

Hope that helps and thanks for helping to test Rockstor’s limits re drive re-incarnation

Azzazel · December 16, 2019, 9:21am

Hi, @phillxnet!
I’ve reinstalled and updated my test VM to the latest stable:

[root@rockstor-test ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile

base: mirror.reconn .ru

epel: mirror.yandex .ru

extras: mirror.logo .ru

updates: mirror.logol .ru
Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 50
Size : 85 M
Repo : installed
From repo : Rockstor-Stable
Summary : Btrfs Network Attached Storage (NAS) Appliance.
URL : http ://rockstor.com/
License : GPL
Description : Software raid, snapshot capable NAS solution with built-in file integrity protection.
: Allows for file sharing between network attached devices.

And attached two 200G-drives:

[root@rockstor-test ~]# lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID
NAME=“/dev/sdb” MODEL=“Virtual disk " SERIAL=“6000c29195c9b7bd6da33d072ffc9e93” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:1:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sr0” MODEL=“VMware SATA CD00” SERIAL=“00000000000000000001” SIZE=“791M” TRAN=“sata” VENDOR=“NECVMWar” HCTL=“2:0:0:0” TYPE=“rom” FSTYPE=“iso9660” LABEL=“Rockstor 3 x86_64” UUID=“2017-07-02-03-11-01-00”
NAME=“/dev/sdc” MODEL=“Virtual disk " SERIAL=“6000c2934a84d330cfa1e51cbb1385ac” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:2:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sda” MODEL=“Virtual disk " SERIAL=“6000c294a6968d6eea88ce6b8fa37767” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:0:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sda2” MODEL=“” SERIAL=“” SIZE=“1.6G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“swap” LABEL=“” UUID=“d0290533-2287-4b11-8624-cd2add21c4c4”
NAME=“/dev/sda3” MODEL=“” SERIAL=“” SIZE=“13.9G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_rockstor” UUID=“52e366e4-fd43-4e2b-9f7c-8a983e1a7c94”
NAME=“/dev/sda1” MODEL=“” SERIAL=“” SIZE=“500M” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“ext4” LABEL=“” UUID=“2965d40d-0e23-4020-811a-51a40fbe8156”

Now we back it up and restore it.
After restore we once more have “detached” and “new” disks.
lsblk:

[root@rockstor-test ~]# lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID
NAME=“/dev/sdb” MODEL=“Virtual disk " SERIAL=“6000c296606e8df2b18b85d789111362” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:1:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7ad0e38b-ba82-4a24-a9a3-fa9297e0536f”
NAME=”/dev/sr0" MODEL=“VMware SATA CD00” SERIAL=“00000000000000000001” SIZE=“1024M” TRAN=“sata” VENDOR=“NECVMWar” HCTL=“2:0:0:0” TYPE=“rom” FSTYPE=“” LABEL=“” UUID=“”
NAME=“/dev/sdc” MODEL=“Virtual disk " SERIAL=“6000c2985e435f212ff6034df3402b44” SIZE=“200G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:2:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“test_raid1” UUID=“7ad0e38b-ba82-4a24-a9a3-fa9297e0536f”
NAME=”/dev/sda" MODEL=“Virtual disk " SERIAL=“6000c2908419326f8cfdc897d6a7d100” SIZE=“16G” TRAN=”" VENDOR=“VMware " HCTL=“3:0:0:0” TYPE=“disk” FSTYPE=”" LABEL=“” UUID=“”
NAME=“/dev/sda2” MODEL=“” SERIAL=“” SIZE=“1.6G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“swap” LABEL=“” UUID=“d0290533-2287-4b11-8624-cd2add21c4c4”
NAME=“/dev/sda3” MODEL=“” SERIAL=“” SIZE=“13.9G” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_rockstor” UUID=“52e366e4-fd43-4e2b-9f7c-8a983e1a7c94”
NAME=“/dev/sda1” MODEL=“” SERIAL=“” SIZE=“500M” TRAN=“” VENDOR=“” HCTL=“” TYPE=“part” FSTYPE=“ext4” LABEL=“” UUID=“2965d40d-0e23-4020-811a-51a40fbe8156”

But now, even though I can initiate disk removal without errors, the disks do not get removed and persist, even after restart.

phillxnet · December 16, 2019, 12:42pm

@Azzazel Thanks for the further testing. Much appreciated. And glad Appman served as intended.

OK, thank. That’s good to know. Pretty sure I can reproduce this scenario here and this looks like a bug in our newer disk handling re removal of detached disks. As such I’ve created the following issue:

github.com/rockstor/rockstor-core

failure to remove detached disks via pool resize

opened 12:20PM - 16 Dec 19 UTC

closed 03:41PM - 24 Jun 20 UTC

phillxnet

Thanks to form member Azzazel for highlighting this issue. Post our recent disk …management code revamp in 3.9.2-49 we fail to deal appropriately with detached prior pool members. The sighted forum thread details a reproducer but it is believed to be a generic problem with detached disk removal. This has been demonstrated on 3.9.2-50. The observed behaviour is a pool resize detached disk removal operation silently failing; leaving the requested detached disks in place. Please update the following forum thread with this issues resolution: https://forum.rockstor.com/t/cannot-remove-detached-disks/6572/8

On first guess I would say this newer code has taken 2 steps forward and one back. Our prior disk removal code failed for attached disk from a Web-UI perspective, but worked a treat for the detached disks. And the new code now works for attached disks and can further monitor the (attached) disk removal’s data egress. But I now suspect fails for detached disks.

Once I’m done with my current release backlog and doing a couple of announcements I due again to look at this code. This issue looks like a good place to start.

Thanks again for this report, in a round about way I think you’ve hit on a relatively easy to reproduce issue. Apologies for not catching this earlier but I’m hoping it should be a relatively easy fix. My focus on the last fairly major revamp, and one of it’s dependencies, was on sorting the attached disk removal and I now suspect I’ve overlooked making a special case for detached disks and failing through silently without actually removing the said detached disk.

I’ve put a note on the issue to update this thread upon it’s resolution so ‘watch this space’.

Thanks again for your diligent testing and reporting, I’m really hoping to get some openQA based automated end to end testing in place in the not to distant future so that silly omissions such as this can be caught earlier, or at least so that anyone can submit new tests to catch such things and serve as regression testing for the future.

Azzazel · December 16, 2019, 10:37pm

Hi, @phillxnet!
Glad to help.
In the meantime, what about raid0 detached disks on my main VMs? What is the course of action for them?

phillxnet · December 17, 2019, 2:03pm

@Azzazel, Re removing detached disks in a raid0, the existing block is a valid safeguard but even if defeated, such as removing the indicated code doing this block, it’s the same code later on that is failing to remove detached drives of any sort. At least that’s my initial guess before I look properly into the issue I’ve now created for this bug. But I hope to get to this when time allows and will update here appropriately. We have to tread carefully here as this is sensitive code in a key area. So in short I now suspect it’s the same problem, ultimately, where we fail on pool resize remove detached disks. I intend to look to this in our up and coming testing channel; still working towards my:

Hopefully you will be up for a spot more testing once I’ve had a look? As this backup restore arrangement you are using is quite the full on test as it goes given that it requires Rockstor to re-jig it’s entire device db. But I suspect you are always going to have to remove ‘detached’ disks their after given how Rockstor tracks disks. But assuming the buggy removal of detached disks can be sorted you may have a workable arrangement which would be good.