ZTASK device removal task continues to try to execute long after device has been successfully removed

Hooverdan · October 12, 2020, 10:34pm

Earlier this year I replaced all my drives with higher capacity ones (Advice needed on All-Drives replacement - #8 by Hooverdan), and followed the approach of adding one new one to have enough capacity overhead and then one after the other remove and replace my other drives (with all the steps required in between using the WebUI).
While investigating another issue 9 months later I found that one of the btrfs tasks for removing a device keeps on attempting to do so (and obviously failing because the device has not been there in a long time).
Looking at ztask.log I can see this:

When I look at the ztask.log I am seeing something curious (have not looked in that log I think, ever) …

2020-10-09 07:57:48,306 - ztaskd - INFO - Calling fs.btrfs.start_resize_pool
2020-10-09 07:57:48,308 - ztaskd - ERROR - Error calling fs.btrfs.start_resize_pool. Details:
Error running a command. cmd = /usr/sbin/btrfs device delete /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S /mnt2/4xRAID5. rc = 1. stdout = ['']. stderr = ['ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS', 'ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S', '']

I have not triggered or scheduled a pool-resizing, since I replaced all my Hard drives earlier in the year. I don’t even have these HGST drives physically in the system anymore (I replaced those with WDs). In the WebUI I have not seen any errors surfaced related to this, and obviously this is not a regularly scheduled task one would want to set up (if it were possible).

@phillxnet you had already mentioned in another topic that you will look into it, just not now with everything going on for the rebase. I can alternatively create an issue on GitHub, but if I’m the only one experiencing this then I don’t want to clutter up the issue lists with this.

phillxnet · October 13, 2020, 10:40am

@Hooverdan Thanks for breaking this find out, Would be good to get to the bottom of it.

As per GitHub issue, I think ideally we would need a reproducer first so lets keep it here for now and work here removing these from the db, or whatever is re-triggering them. They may even be a previously unidentified side effect of:

github.com/rockstor/rockstor-core

pool resize disk removal unknown internal error and no UI counterpart. Fixes #1722

rockstor:master ← phillxnet:1722_pool_resize_disk_removal_unknown_internal_error_and_no_UI_counterpart

opened 05:46PM - 27 Jan 19 UTC

phillxnet

+871 -414

Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remo…ve" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extended to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI. Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes. Summary: - Execute device removal asynchronously. - Monitor the consequent 'internal' balances by existing mechanisms where possible. - Only remove pool members pool associations once their associated 'internal' balance has finished. - Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided. - Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'. - Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI. - Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability. - Improve pool uuid tracking and add related Web-UI element. - Add Web-UI element in balance status tab to identify regular or 'internal' balance type. - Add devid tracking and related Web-UI element. - Use devid Disk model info to ascertain pool info for detached disks. - Add per device allocation tracking and related Web-UI element. - Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32. - Fix prior TODO: re unlabeled pools caveat. - Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0. - Add guidance on common detached disk removal reboot requirement (only affects older kernels). - Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id look-ups. - Add TODO re removing legacy formatted disk raid role pre openSUSE move. - Update scan_disks() unit tests for new 'path included' output. - Address TODO in scan_disks() unit tests and normalise on pre-sort method. Fixes #1722 And by way of a trivial application of the added per device allocation: Fixes #1918 "Incorrect size calculation while removing disk from disk pool" @suman Ready for review. Please note that this pr assumes the prior merge of: "regression in unit tests - environment outdated since 3.9.2-45. Fixes #1993" #1994 (Fixes unit tests) "pin python-engineio to 2.3.2 as recent 3.0.0 update breaks gevent. Fixes #1995" #1996 (Fixes basic build fail) and: "Implement Add Labels feature for already-installed Rock-Ons. Fixes #1998" #1999 (has a prior storageadmin db migration 0007_auto_20181210_0740.py - I’m trying to keep our migrations path simple) Testing: All existing osi and btrfs unit test were confirmed to pass prior to and post pr (given #1994) however as indicated above the scan_disks() unit tests required modification but only to accommodate the new behaviour introduced in scan_disks() where we request from lsblk all device paths. From the osi unit tests point of view this was a cosmetic change in test data: and no functional changes were made bar a trivial robustness improvement by way of an existing TODO. Many of the system configurations used to originally generate the osi unit test data were also tested in their install instance counterparts (ie bios raid system disk, LUKS, btrfs in partition, etc) and were also used during development to help ensure minimal regression. A full functional test on real hardware was also conducted over multiple cycles of removing (and re-adding a post 'wipefs -a' disk where appropriate). These tests are details in the comments below and indicate expected behaviour in both legacy CentOS and openSUSE (Tumbleweed in this case) installs. Caveats: Our keying from devid = 0 (for 'Page refresh required' UI element) may cause confusion during a disk replace (as yet unimplemented: see issue #1611 ) as it is understood that currently within btrfs one of the two disks involved during a 'btrfs replace start ...' operation is temporarily assigned a devid of 0. The cited issue can address this as and when needed.

Which was a large, long awaited change to do with disk removals. Maybe we have some more clean up to do after that. Anyway can’t look just yet myself but it’s good we have this observation broken out now.

Can you remember what version of Rockstor you were running when you did these removals. I know it’s a big ask but just in case you happen to remember.

Cheers.

Hooverdan · October 13, 2020, 2:31pm

@phillxnet, Thank you. I installed 3.9.2-50 on 12/4/2019 (you helped me with that to get to that level within that post I linked above) and then to 3.9.2-51, and had taken any updates that came along between that and my January 2020 disk exchange project.
Shortly before that I went with a newer kernel 5.4.1-1 (and associated btrfs-tools) to take advantage of the latest btrfs improvements at that time.
And I believe, shortly after 3.9.2-57 came out I also installed it, but that was after I completed the exchange project.

So, it looks like the github issue you’re quoting was not part of the release under which I did the activity.
Thanks

Hooverdan · October 13, 2020, 5:08pm

I logged into the storageadmin db and looked at the table entries of the table

django_ztask_task

. There are 3 entries in there, the last one related to the device removal task. Not sure whether the other 2 should be in that table either, though:

I left off the last_exception column for space reasons, but it contains the error message pointed out above.
The first two had “not a gzipped file” and “ConfigBackup matching query does not exist” as exception message in there, respectively.

I presume, that all three should not be there …