Can't remove disk even though I have enough space

juchong · October 9, 2018, 5:51am

Hi, I’m trying to remove a failing disk from my storage array (RAID 1), but I keep getting the error below:

        Traceback (most recent call last):

File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

The error states that I don’t have enough room to remove a drive, but it looks like I do. Here’s the error text:
_Removing disks ([u’ata-WDC_WD60EFRX-68MYMN1_WD-WX11D74RH1NA’]) may shrink the pool by 5905580032 KB, which is greater than available free space 19137753968 KB. This is not supported. _

phillxnet · October 9, 2018, 6:53pm

@juchong Hello again.

From your screen grab you look to be running the latest stable channel, always best to give the version you are running however, just in case we cant’ see it.

We do have an outstanding issue on this ‘play safe’ calculation and although the message indicates one space value the pie chart is giving 4.12 TB free. It may be that this message is keying from one free space value and displaying another. The issue for reference is:

https://github.com/rockstor/rockstor-core/issues/1918

Where we plan to move to space used on the device rather than the entire size of the device, ie currently 5.5 TB device when only (from pie chart) 4.2 TB free).

The issue references the code concerned but it’s essentially here:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool.py#L563-L572


      
                  pool.raid, new_raid
              )
              handle_exception(Exception(e_msg), request)
          
          
if new_raid == "raid10" and num_total_attached_disks < 4:
              e_msg = (
                  "A minimum of 4 drives are required for the "
                  "raid level: raid10."
              )
              handle_exception(Exception(e_msg), request)

You could, for the time being, carefully alter that code so that it doesn’t trigger if you are certain you have enough free space to perform this action. I.e. you could just comment out (adding a “#” to each of those lines) being careful to maintain the indentation (Python is very fussy that way) and then the check will not be performed at all: after a reboot or rockstor service restart to pick up the code change.

I’ve added your report to that issue and hopefully someone will step up to that one soon.

I’d also make sure of the reported usage; given the conflicting values of the pie chart and the error dialog on the actual pool usage:

btrfs fi usage /mnt2/MAIN

Hope that helps and do post your usage figures as we may be looking at a different issue / bug here.

Thanks again for the report and let us know how it goes.

juchong · October 10, 2018, 4:57am

Thanks for the info Phil, but it looks like we’ve gone from bad to worse (see error below). I should add that I opted to clear out more space instead of modifying files:

juchong · October 10, 2018, 5:00am

Closing that error window out and attempting to remove the drive again yields the following:

        Traceback (most recent call last):

File “/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py”, line 41, in _handle_exception
yield
File “/opt/rockstor/src/rockstor/storageadmin/views/pool.py”, line 581, in put
resize_pool(pool, dnames, add=False) # None if no action
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 365, in resize_pool
return run_command(resize_cmd)
File “/opt/rockstor/src/rockstor/system/osi.py”, line 121, in run_command
raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /sbin/btrfs device delete /dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX11D74RH1NA /mnt2/MAIN. rc = 1. stdout = [‘’]. stderr = [“ERROR: error removing device ‘/dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX11D74RH1NA’: add/delete/balance/replace/resize operation in progress”, ‘’]

I suspect something is happening in the background, but now I have no way of tracking progress.

juchong · October 10, 2018, 5:11am

Trying to start a balance using the terminal yields a balance that ends immediately.

But it looks like there’s still data present.

juchong · October 10, 2018, 5:37am

Update: The web gui refuses to show the “Disks” or “Pools” pages and just threw an unknown error. Not sure whether something is going on in the background, but this post from 2017 seems to indicate that this is normal. I would’ve thought that this issue would have been taken care of sooner rather than later, but maybe that’s not the case?

phillxnet · October 10, 2018, 9:03am

@juchong So it does look like you have progress on this one as:

and your sighted image containing the:

“Unknown internal error doing a PUT to /api/pools/9/remove”

is as you later surmised a known issue which has recently received some updates that pin down exactly why it’s happening (see the latest comments by me) in the following issue:

github.com/rockstor/rockstor-core

pool resize disk removal unknown internal error and no UI counterpart

opened 06:19PM - 01 Jun 17 UTC

closed 02:10PM - 09 Jul 19 UTC

phillxnet

Thanks to forum member Noggin for highlighting this behaviour. Occasionally when… removing a disk from a pool there can be a UI time out directly after the last dialog entitled "Resize Pool / Change RAID level for ..." which acts as last confirmation of the configured operation: ![harmless-put-timeout-on-dev-remove](https://cloud.githubusercontent.com/assets/2521585/26693598/a6a7f03a-46fc-11e7-80d2-f45df1f32e28.png) There is then no UI 'balance' indicated while the removal is in progress, yet the UI indicates that a balance is in progress when a balance is attempted (only attempted by Noggin as I did not attempt to execute a balance whilst the removal was in progress). ``` btrfs balance status /mnt2/time_machine_pool/ No balance found on '/mnt2/time_machine_pool/' ``` The pool resize is however indicated by the requested disk's having their size 'demoted' to zero and showing a reduced usage with subsequent executions of **btrfs fi show**: ``` Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 4 FS bytes used 31.57GiB devid 3 size 149.05GiB used 17.03GiB path /dev/sdd devid 4 size 0.00B used 5.00GiB path /dev/sda devid 5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe devid 6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7 ``` and then later on: ``` Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 4 FS bytes used 32.57GiB devid 3 size 149.05GiB used 18.03GiB path /dev/sdd devid 4 size 0.00B used 2.00GiB path /dev/sda devid 5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe devid 6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7 ``` As can be seen devid 4 is having it's pool usage reduced (from 5 to 3 GB) between runs. In the above example the disk removal completed successfully however there was never an UI indication of it's 'in progress' nature or any record of a balance having taken place at that time. Reference to Noggins's forum thread suspected as indicating the same as my observations in final testing of pr #1716 which lead also to this issue creation (details of the precedence steps available in that pr): https://forum.rockstor.com/t/cant-remove-failed-drive-from-pool-rebalance-in-progress/3319 where a 3.8.16-16 (3.9.0 iso install) version exhibited the same behaviour (pre #1716 merge).

The code essentially times out when waiting for a device removal (missing or otherwise) and so ends up not being able to update the database appropriately; which in turn throws a bunch of stuff out. But the behaviour should self correct once you are able to refresh the disks / pools page but part of that database update was establishing the new state of affairs, but as you have also surmised later on, that is in flux (the pool can take hours for a drive to be removed).

Disk removal (existing or missing) kicks off an internal to btrfs balance where the data that was on that disk is re-distributed, accorting to the raid level, to the ‘to be remaining’ disks. But this is not a balance that, at least currently, can be read from a ‘btrfs balance status’ command. Hence one of the difficulties in this arrangement. I do have plans to abstract a status summary from the following command and surface that in the Web-UI:

btrfs dev usage /mnt2/pool-name-here

And look for a negative Unallocated value that should change over time:
ie in the case of removing a missing device we have the following:

btrfs dev usage /mnt2/rock-pool/
/dev/sda, ID: 1
   Device size:           465.76GiB
   Device slack:              0.00B
   Data,RAID1:             33.00GiB
   Metadata,RAID1:          1.00GiB
   Unallocated:           431.76GiB

/dev/sdb, ID: 2
   Device size:           465.76GiB
   Device slack:              0.00B
   Data,RAID1:             34.00GiB
   Metadata,RAID1:          1.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           430.73GiB

missing, ID: 3
   Device size:               0.00B
   Device slack:              0.00B
   Data,RAID1:             15.00GiB
   System,RAID1:           32.00MiB
   Unallocated:           -15.03GiB

I.e. the Device size is update to 0.00B and it’s allocated data is, by the internal balance, redistributed bit by bit.

So we have 2 elements to this. Run our ‘btrfs device delete’ asynchronously (as we already do with balance operations): see the caveat in the recently merged code by way of the following pull request:

https://github.com/rockstor/rockstor-core/pull/1971

and secondly develop a way to surface the progress of these internal balance operations. To which there are also plans which include surfacing the usage of each device within a pool.

And from your second attempt to remove the device you see that it is actually in operation:

To that I would try the aforementioned:

btrfs dev usage /mnt2/pool-name-here

and then again a little while later to see what has changed. I suspect you will see the reduced negative Unallocated space against you ‘in progress’ resize / dev delete operation.

It’s a tricky one as one would expect this internal balance to show in a command line btrfs balance status command but it doesn’t. However if one tries:

Because there is one in progress, it’s just an internal one. Oh well, hopefully this will improve over time but in the mean time we should be able to surface the negative Allocated device ‘flag’ value and use that.

Give it time to complete that existing device delete / resize operation and then get back to us as the Web-UI should then be able to gather what it needs and correct itself.

And incidentally the sighted pool resize issue is in turn linked in the forum post you correctly sighted issue and in fact grew out of that and my own experience dealing with a disk management issue early on.

Hope that helps and yes this is a very rough edge but if you take a look at the sighed pull request #1700 above and it’s in turn sighed “Fixed” labels you will see that we are creeping up to these last rough edge bugs. But we have to prioritise carefully. For instance the pool devices table that informed you of you pool errors seemed like it should come first. Always a juggle as to what order to do things in but we are getting closer to managing the error states and providing guidance. This time out issue when removing devices is ‘in my queue’ but I have a couple of other more common place issues to address first. But we do at least know it’s cause and possible solution, however as stated the difficulty in identifying this internal balance / resize operation does complicate things as we now need an additional monitoring sub system to get around the fact that a btrfs balance status will not tell us it’s state, yet subsequent balance operations are blocked. We will see if this inconsistency is rectified going forward with our newer kernel once that is in place. And we are in the throws of re basing on a disto that has more of an interest in btrfs (mid term that one).

Thanks again for reporting your findings and do remember that the output requested for various commands can help to speed up the development of issues referenced, ie you didn’t post your prior pool usage command output that may have helped to identify if you were actually experiencing the same issue suspected as the cause or another one. We depend in part on user reports such as yours to make things better so do try and respond with the information requested as it should help to make Rockstor better for all users as we go.

juchong · October 25, 2018, 4:25am

Hi,

I eventually did discover that the balance was happening in the background very slowly. I ended up stopping the balance operation and decided to move all of my data to external drives. I then killed the old array and created a new one using Rockstor’s default configuration. At the time, I had disk quotas and compression enabled which likely was the cause of my very slow balance. 24 hours into the balance process, Rockstor had only processed about 1TB worth of data (on a 22TB array) which is why I opted to lean the array as much as possible.

All is well after re-building the array and re-populating the data.

phillxnet · October 25, 2018, 7:22pm

@juchong Thanks for the update and glad you got things sorted in the end.

Yes, it can slow things down quite considerably. Especially as snapshot / subvol count increases, and as the amount of data to track increases of course.

But there is constant work going on upstream for the quotas so hopefully things will only improve.

You could also have disabled quotas mid balance, as an option.