ZTASK device removal task continues to try to execute long after device has been successfully removed

Earlier this year I replaced all my drives with higher capacity ones (Advice needed on All-Drives replacement - #8 by Hooverdan), and followed the approach of adding one new one to have enough capacity overhead and then one after the other remove and replace my other drives (with all the steps required in between using the WebUI).
While investigating another issue 9 months later I found that one of the btrfs tasks for removing a device keeps on attempting to do so (and obviously failing because the device has not been there in a long time).
Looking at ztask.log I can see this:

When I look at the ztask.log I am seeing something curious (have not looked in that log I think, ever) 


2020-10-09 07:57:48,306 - ztaskd - INFO - Calling fs.btrfs.start_resize_pool
2020-10-09 07:57:48,308 - ztaskd - ERROR - Error calling fs.btrfs.start_resize_pool. Details:
Error running a command. cmd = /usr/sbin/btrfs device delete /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S /mnt2/4xRAID5. rc = 1. stdout = ['']. stderr = ['ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS', 'ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S', '']

I have not triggered or scheduled a pool-resizing, since I replaced all my Hard drives earlier in the year. I don’t even have these HGST drives physically in the system anymore (I replaced those with WDs). In the WebUI I have not seen any errors surfaced related to this, and obviously this is not a regularly scheduled task one would want to set up (if it were possible).

@phillxnet you had already mentioned in another topic that you will look into it, just not now with everything going on for the rebase. I can alternatively create an issue on GitHub, but if I’m the only one experiencing this then I don’t want to clutter up the issue lists with this.

1 Like

@Hooverdan Thanks for breaking this find out, Would be good to get to the bottom of it.

As per GitHub issue, I think ideally we would need a reproducer first so lets keep it here for now and work here removing these from the db, or whatever is re-triggering them. They may even be a previously unidentified side effect of:

Which was a large, long awaited change to do with disk removals. Maybe we have some more clean up to do after that. Anyway can’t look just yet myself but it’s good we have this observation broken out now.

Can you remember what version of Rockstor you were running when you did these removals. I know it’s a big ask but just in case you happen to remember.

Cheers.

1 Like

@phillxnet, Thank you. I installed 3.9.2-50 on 12/4/2019 (you helped me with that to get to that level within that post I linked above) and then to 3.9.2-51, and had taken any updates that came along between that and my January 2020 disk exchange project.
Shortly before that I went with a newer kernel 5.4.1-1 (and associated btrfs-tools) to take advantage of the latest btrfs improvements at that time.
And I believe, shortly after 3.9.2-57 came out I also installed it, but that was after I completed the exchange project.

So, it looks like the github issue you’re quoting was not part of the release under which I did the activity.
Thanks

2 Likes

I logged into the storageadmin db and looked at the table entries of the table

django_ztask_task

. There are 3 entries in there, the last one related to the device removal task. Not sure whether the other 2 should be in that table either, though:

I left off the last_exception column for space reasons, but it contains the error message pointed out above.
The first two had “not a gzipped file” and “ConfigBackup matching query does not exist” as exception message in there, respectively.

I presume, that all three should not be there 


1 Like