phillxnet  
              
                  
                    December 15, 2019, 12:22pm
                   
                  2 
               
             
            
              @iecs  Hello again and thanks for the report.
However as this is from our now retired CentOS testing channel it has been superseded by out Stable channel. In part your issue may be due to a still ongoing balance but as the legacy testing channel was unable to monitor / assess this from the Web-UI it’s difficult to tell if one is still running. Especially in the case of a disk removal ongoing.
As from Stable release version 3.9.2-49:
  
  
    3.9.2-49
Merged end September 2019 
Released 2nd October 2019 
I am chuffed, and somewhat relieved, to finally release what is one of our largest and longest awaited releases for some time now. I present the 81st Rockstor. 
In this release we welcome and thank first time Rockstor contributor @Psykar  for a 2 in one fix for long standing issues of table sorting; nice: 
Fix size sorting on snapshots / shares pages. Fixes #1368  #1878  @Psykar  on-Github 
Thanks also to @Flox  for a wide range of non …
   
 
specifically:#1722  1 @phillxnet 
where #1722  was a biggie:
  
  
    
    
  
      
    
      rockstor:master ← phillxnet:1722_pool_resize_disk_removal_unknown_internal_error_and_no_UI_counterpart
    
      
        
          opened 05:46PM - 27 Jan 19 UTC 
        
        
        
       
   
 
  
    Fix disk removal timeout failure re "Unknown internal error doing a PUT .../remo… ve" by asynchronously executing 'btrfs dev remove'. The pool_balance model was extended to accommodate for what are arbitrarily named (within Rockstor) 'internal' balances: those automatically initiated upon every 'btrfs dev delete' by the btrfs subsystem itself. A complication of 'internal' balances is their invisibility via 'btrfs balance status'. An inference mechanism was thus constructed to 'fake' the output of a regular balance status so that our existing Web-UI balance surfacing mechanisms could be extended to serve these 'internal' variants similarly. The new state of device 'in removal' and the above mentioned inference mechanism required that we now track and update devid and per device allocation. These were added as disk model fields and surfaced appropriately at the pool details level within the Web-UI.
Akin to regular balances, btrfs dev delete 'internal' balances were found to negatively impact Web-UI interactivity. This was in part alleviated by refactoring the lowest levels of our disk/pool scan mechanisms. In essence this refactoring significantly reduces the number of system and python calls required to attain the same system wide dev / pool info and simplifies low level device name handling. Existing unit tests were employed to aid in this refactoring. Minor additional code was required to account for regressions (predominantly in LUKS device name handling) that were introduced by these low level device name code changes.
Summary:
- Execute device removal asynchronously.
- Monitor the consequent 'internal' balances by existing mechanisms where possible.
- Only remove pool members pool associations once their associated 'internal' balance has finished.
- Improve low level efficiency/clarity re device/pool scanning by moving to a single call of the lighter get_dev_pool_info() rather than calling the slower get_pool_info() btrfs disk count times; get_pool_info() is retained for pool import duties as it’s structure is ideally suited to that task. Multiple prior temp_name/by-id conversions are also avoided.
- Improve user messaging re system performance / Web-UI responsiveness during a balance operation, regular or 'internal'.
- Fix bug re reliance on "None" disk label removing a fragility concerning disk pool association within the Web-UI.
- Improve auto pool labeling subsystem by abstracting and generalising ready for pool renaming capability.
- Improve pool uuid tracking and add related Web-UI element.
- Add Web-UI element in balance status tab to identify regular or 'internal' balance type.
- Add devid tracking and related Web-UI element.
- Use devid Disk model info to ascertain pool info for detached disks.
- Add per device allocation tracking and related Web-UI element.
- Fix prior TODO: re btrfs in partition failure point introduced in git tag 3.9.2-32.
- Fix prior TODO: re unlabeled pools caveat.
- Add pool details disks table ‘Page refresh required’ indicator keyed from devid=0.
- Add guidance on common detached disk removal reboot requirement (only affects older kernels).
- Remove a low level special case for LUKS dev matching (mapped devices) which affected the performance of all dev name by-id look-ups.
- Add TODO re removing legacy formatted disk raid role pre openSUSE move.
- Update scan_disks() unit tests for new 'path included' output.
- Address TODO in scan_disks() unit tests and normalise on pre-sort method.
Fixes #1722 
And by way of a trivial application of the added per device allocation:
Fixes #1918 
"Incorrect size calculation while removing disk from disk pool"
@suman Ready for review.
Please note that this pr assumes the prior merge of:
"regression in unit tests - environment outdated since 3.9.2-45. Fixes #1993" #1994 (Fixes unit tests)
"pin python-engineio to 2.3.2 as recent 3.0.0 update breaks gevent. Fixes #1995" #1996 (Fixes basic build fail)
and:
"Implement Add Labels feature for already-installed Rock-Ons. Fixes #1998" #1999 (has a prior storageadmin db migration 0007_auto_20181210_0740.py - I’m trying to keep our migrations path simple)
Testing:
All existing osi and btrfs unit test were confirmed to pass prior to and post pr (given #1994) however as indicated above the scan_disks() unit tests required modification but only to accommodate the new behaviour introduced in scan_disks() where we request from lsblk all device paths. From the osi unit tests point of view this was a cosmetic change in test data: and no functional changes were made bar a trivial robustness improvement by way of an existing TODO.
Many of the system configurations used to originally generate the osi unit test data were also tested in their install instance counterparts (ie bios raid system disk, LUKS, btrfs in partition, etc) and were also used during development to help ensure minimal regression.
A full functional test on real hardware was also conducted over multiple cycles of removing (and re-adding a post 'wipefs -a' disk where appropriate). These tests are details in the comments below and indicate expected behaviour in both legacy CentOS and openSUSE (Tumbleweed in this case) installs.
Caveats:
Our keying from devid = 0 (for 'Page refresh required' UI element) may cause confusion during a disk replace (as yet unimplemented: see issue #1611 ) as it is understood that currently within btrfs one of the two disks involved during a 'btrfs replace start ...' operation is temporarily assigned a devid of 0. The cited issue can address this as and when needed. 
   
   
  
    
    
  
  
 
which in turn required quite a few other post legacy testing channel improvements to be in place first.
The main linked issue in that pull request was:
  
  
    
  
  
    
    
      
        opened 06:19PM - 01 Jun 17 UTC 
      
        
          closed 02:10PM - 09 Jul 19 UTC 
        
      
     
    
    
   
 
  
    Thanks to forum member Noggin for highlighting this behaviour. Occasionally when…  removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation:

There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress).
```
btrfs balance status /mnt2/time_machine_pool/
No balance found on '/mnt2/time_machine_pool/'
```
The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of **btrfs fi show**:
```
Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 31.57GiB
	devid    3 size 149.05GiB used 17.03GiB path /dev/sdd
	devid    4 size 0.00B used 5.00GiB path /dev/sda
	devid    5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7
```
and then later on:
```
Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 32.57GiB
	devid    3 size 149.05GiB used 18.03GiB path /dev/sdd
	devid    4 size 0.00B used 2.00GiB path /dev/sda
	devid    5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7
```
As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time.
Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr):
https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319
where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge). 
   
   
  
    
    
  
  
 
which in turn links to the following forum thread:
  
  
    [Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support  for expedited support directly from the developers.] 
 
 
In effect you just have to wait for the initial ‘remove disk’ internal balance to finish. You system as is should then return to normal function. It was a non trivial task for Rockstor’s Web-UI to track these ‘internal’ disk removal events but it is now accomplished post 3.9.2-49 + versions. But our last legacy CentOS testing channel release was way before this improvement.
Hope that helps.