Can't remove failed drive from pool - Rebalance in progress

Noggin · June 1, 2017, 12:28pm

[Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support for expedited support directly from the developers.]

Brief description of the problem

Can’t remove a device from the pool due to a rebalance in progress, but I don’t think a rebalance is actually in progress.

One of my hard drives has failed. I have a RAID 1 setup containing 6x2TB and 2x6TB drives. I’m attempting to remove 2 of the 2TB drives, the failed one and one other one to keep the RAID configuration happy. When I attempt to do that, I get the error message shown below. However, if I run “btrfs balance status /mnt2/main_pool” at bash, I see “No balance found on ‘/mnt2/main_pool’”

My pool has been mounted using the command “mount -t btrfs -o degraded /dev/sdb /mnt2/main_pool”

From the last time I performed a balance, I recall that it didn’t handle progress indication well. however, I think it just said it was at 0% progress until it was completed and then it jumped to 100%. So this doesn’t match what I recall happening in the past.

Should I unmount and remount the pool and try again? Should I reboot? I’m just worried about interrupting a rebalance and causing everything to fail.

Detailed step by step instructions to reproduce the problem

[write here]

Web-UI screenshot

Error Traceback provided on the Web-UI


            Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
    yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/pool.py", line 461, in put
    resize_pool(pool, dnames, add=False)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 201, in resize_pool
    return run_command(resize_cmd)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 110, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /sbin/btrfs device delete /dev/disk/by-id/ata-Hitachi_HUA723020ALA641_YFG4B0HA /mnt2/main_pool. rc = 1. stdout = ['']. stderr = ["ERROR: error removing device '/dev/disk/by-id/ata-Hitachi_HUA723020ALA641_YFG4B0HA': add/delete/balance/replace/resize operation in progress", '']

phillxnet · June 1, 2017, 5:41pm

@Noggin Hello again.
This is something that need to be improved, but is known behaviour. When you remove a disk it can look like a balance is in progress and in effect it is but more accurately it’s a ‘btrfs disk delete’ and yes the ‘btrfs balance status’ indicates no balance. However I think if you execute the following command:

btrfs fi show

you will hopefully see that the size of the removed disks is 0 and the used is decreasing from one run of the command to another (if given enough time between runs).

I came across this behaviour just recently and meant to open an issue but haven’t had time to pin down exactly what’s happening yet. It seems that sometimes (not sure how often) a pool resize involving removed devices is mis-interpreted as a balance, and I think the code actually tries to kick off a balance directly after the resize and sometimes it sticks and sometimes it doesn’t. Sorry to be so vague but I haven’t had time to look at exactly what happens. However it is definitely something I’ve seen.

Let us know if you also see this decreasing used on the 0 sized devices.

The example I noted for the issue I have yet to open on this behaviour were as follows:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 31.57GiB
	devid    3 size 149.05GiB used 17.03GiB path /dev/sdd
	devid    4 size 0.00B used 5.00GiB path /dev/sda
	devid    5 size 149.05GiB used 23.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 23.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

and then shortly after:

Label: 'time_machine_pool'  uuid: 8f363c7d-2546-4655-b81b-744e06336b07
	Total devices 4 FS bytes used 32.57GiB
	devid    3 size 149.05GiB used 18.03GiB path /dev/sdd
	devid    4 size 0.00B used 2.00GiB path /dev/sda
	devid    5 size 149.05GiB used 24.03GiB path /dev/mapper/luks-d36d39ea-c0b3-4355-b0c5-bd3248e6bbfe
	devid    6 size 149.05GiB used 24.00GiB path /dev/mapper/luks-d7524e90-4d9e-4772-932f-d1407b6b5fe7

Note devid 4 used value decreasing from used=5GB to used=3GB. Shortly after that the device was successfully removed. During which there was also no balance indicated.

Yes this has been improved some since then via pr:

github.com/rockstor/rockstor-core

Web-UI initiated balance status enhancement. Fixes #1405

rockstor:master ← phillxnet:1405_Web-UI_initiated_balance_status_enhancement

opened 04:50PM - 24 Sep 16 UTC

phillxnet

+52 -11

Fixes #1405 - When a Web-UI initiated balance is in progress and the Pool page i…s refreshed there is no indication of the percentage of progress until the balance task initiated has reached 100%. Given balance tasks can be very long lived processes it is a better user experience to have an indication of the current status of percent complete. Prior behaviour is to only indicate status as started or complete. These alterations allow for in progress 'percentage finished' updates via 'Pools' page refresh. They also extend the balance status reporting to include running, pausing, paused, cancelling, and cancelled, however these status updates are only available via a page update. But this at least paves the way for future improvements on balance status reporting. Also note that with the addition of these changes it will be easier to accommodate and report accordingly the status intervention affected by a balance pause/resume button and a balance cancel button, see #1413. Also partially addresses elements of #862. Origin Note: This pr is as a result of a requested split of previously submitted #1418 with the intention to return to a more single issue focused approach. Previously it was developed along with it's associated unit tests that are now part of this pr's sister (pr #1444) which was also split out from #1418. Ready for review.

There after the current balance state is reflected but not live, you still have to refresh the browser; so improvements to be had but at least it’s only a browser refresh currently. But of course this only works if the command line btrfs balance status also indicates a balance as that is what is run in the background to attain balance progress info.

If you try the above command and you see in progress removed (0 sized) devices having their used space decreasing then I think it’s definitely best to leave them to it until they are finally removed from the pool. Essentially all is well but we just don’t have an indication of this removal happening. Odd really but stick with it and let us know how it goes.

I’ll make the issue as I should have done previously and update this thread accordingly.

Noggin · June 1, 2017, 5:42pm

I think this is actually working. “watch -n 10 btrfs fi show” shows

[code]Every 1.0s: btrfs fi show Thu Jun 1 12:40:13 2017

Label: ‘rockstor_rockstor’ uuid: 5c2b67f2-9f3a-4e52-b979-32e9dc159423
Total devices 1 FS bytes used 1.65GiB
devid 1 size 51.57GiB used 4.04GiB path /dev/sda3

Label: ‘main_pool’ uuid: 07f188d4-44d6-49fd-86ac-0380734fe1d2
Total devices 8 FS bytes used 6.15TiB
devid 2 size 1.82TiB used 745.00GiB path /dev/sdb
devid 3 size 0.00B used 88.00GiB path /dev/sdc
devid 4 size 1.82TiB used 763.00GiB path /dev/sdd
devid 5 size 1.82TiB used 763.00GiB path /dev/sde
devid 6 size 1.82TiB used 763.03GiB path /dev/sdg
devid 7 size 5.46TiB used 4.38TiB path /dev/sdh
devid 8 size 5.46TiB used 4.38TiB path /dev/sdf
*** Some devices missing[/code]

/dev/sdc is the device I’m attempting to remove

Noggin · June 1, 2017, 5:46pm

@phillxnet - Thanks a ton for your help, here and in my last thread.

phillxnet · June 1, 2017, 5:47pm

@Noggin Your welcome. Let us know if that disk removal works out OK. Mind did here, in the end.

phillxnet · June 1, 2017, 5:52pm

@Noggin I meant to ask, when you kicked off this disk removal via UI did you receive a UI error to the effect of a red message:
"Houston, we’ve had a problem.
“Unknown internal errro doing a PUT to /api/pools/pool-name/remove”
?

Noggin · June 1, 2017, 6:49pm

I DID receive an error, I don’t recall what it said though. Would it still be in the logs? I looked there, but I’m having another issue with Sonarr that is packing the logs with errors so I didn’t spend much time searching for it.

From memory, here’s what happened:

I attempted to remove the failed drive and another from the pool
Error reported that the failed drive didn’t exist
I found the thread where you suggested to someone else to rescan then try again
I rescanned and tried again
This time, if I remember correctly, it hung for a relatively long time showing me a white popup box containing a table. The left side of the table had the current state of the pool while the right side showed the pool after removing the disks. There was a “confirm” or a “next” button on the bottom right that initiated the procedure.
After hanging for maybe 3 minutes, I think I got an error? I don’t remember what it said, but it was NOT the same error as I posted about at the top of this thread.
I tried again.
It didn’t hang, but an error came back quickly. The Error message was blank. I got the “Houston, we’ve had a problem” prompt, but the actual error was blank.
I tried again and got the error listed in my initial post of this thread.

The above happened last night. I tried again this morning and got the same error as the initial post of this thread. We’ve since determined that although not ideal, this is the expected result at this time.

The errors presented in steps 6 and 8 above might be swapped. In other words, the error at step 6 might have been blank and the error in step 8 might have had text in it.

phillxnet · June 1, 2017, 7:09pm

@Noggin OK yes I think this is consistent with my experience. I think your experience was complicated by the additional errors. No worries, I’ve documented what I’ve found and it very much looks like you have run into the same only with additional retries etc.
Linking to the issue I’ve opened up to at least address the disk removal time out / lack of UI representation whilst in progress:

This would definitely be a nice one to get sorted so thanks for reporting it and consequently reminding me of my prior notes to open an issue for it. I’ve made a note in the above issue to update this thread once it’s sorted.

Hope that disk removal ends up working out OK.

phillxnet · June 1, 2017, 7:36pm

@Noggin I’ve noticed that you are removing a single device from a pool that has 7 devices attached but thinks it should have 8. Hence your stated requirement to mount degraded. Once this single disk removal is done you should have a pool that has 6 disks attached but thinks it should have 7. Check once the current operation is over that it still thinks it has a missing disk, via “btrfs fi show” (which is my suspicion) in which case you will then need a:

btrfs device delete missing /mnt2/main_pool

In order to return things to a fully healthy state. Please see the outstanding issue for this:

as there is no UI way to accomplish this.

Tread carefully here as you may only have this one mount to get this pool into a healthy state.

Hope this helps

Noggin · June 1, 2017, 9:26pm

The pool resize appears to have completed successfully. I deleted the detached drive from the UI before seeing your post here, but I then went to the CLI and issued the command as you specified. My SSH session was disconnected, but when I reconnected it looks like the delete is occurring. It appears to be working, just slowly.

I’m assuming that it might take a few hours to complete.

Noggin · June 2, 2017, 2:05pm

Is it expected to take this long to delete a detached drive? I’ve already removed the detached device as well as another 2TB drive. That took about 14 hours. Deletion of the detached drive has been going for 16 hours now. It seems like it should have already been pretty much removed at this point and that deleting it shouldn’t be much more than a little housekeeping.

phillxnet · June 2, 2017, 3:14pm

@Noggin

If you are referring to a device named detached- within the UI and your removal was via the bin icon then this does nothing on the btrfs front but simply removed it from Rockstor’s db. Hence my suggestion and double check to do the delete missing command to bring things up to scratch again.

It could well take longer than removing an existing device as you did earlier. Is there evidence of progress, ie as before when you were removing an existing device, can you see the usage changing on the existing devices? Essentially it has to work backwards to find what was on the missing device and re-create it from the other copy on the remaining devices. Pretty sure this will make it slower than removing an existing device.

It’s a little more than that really as it has to return the appropriate redundancy to all chunks / blocks that were previously stored on the now missing device.

Noggin · June 2, 2017, 4:47pm

Interesting. I figured that since I removed it via the UI by resizing the pool that it had already done that.

btrfs fi show

[code]Every 10.0s: btrfs fi show Fri Jun 2 11:43:23 2017

Label: ‘rockstor_rockstor’ uuid: 5c2b67f2-9f3a-4e52-b979-32e9dc159423
Total devices 1 FS bytes used 1.67GiB
devid 1 size 51.57GiB used 4.04GiB path /dev/sda3

Label: ‘main_pool’ uuid: 07f188d4-44d6-49fd-86ac-0380734fe1d2
Total devices 7 FS bytes used 6.15TiB
devid 2 size 1.82TiB used 849.00GiB path /dev/sdb
devid 4 size 1.82TiB used 848.00GiB path /dev/sdd
devid 5 size 1.82TiB used 848.00GiB path /dev/sde
devid 6 size 1.82TiB used 771.03GiB path /dev/sdg
devid 7 size 5.46TiB used 4.47TiB path /dev/sdh
devid 8 size 5.46TiB used 4.47TiB path /dev/sdf
*** Some devices missing[/code]

The sizes aren’t changing, at least not significantly.

Yesterday, at around 5 PM I issued the command “btrfs device delete missing /mnt2/main_pool”, this was after resizing the pool from the UI to remove the detached device and another 2TB drive.

Just now, I issues the “btrfs device delete missing /mnt2/main_pool” command again and it started doing something based on the fact that it didn’t immediately return to the prompt. I opened a new SSH window and did the same command, and it gave me an error stating that a deletion was in progress. Yesterday, when I did the delete missing command, and did it again shortly thereafter, it gave me the same error message that a deletion was in progress.

I’ll just let it keep on going.

phillxnet · June 2, 2017, 7:02pm

@Noggin

Yes that was for the devid 3 device (at the time /dev/sdc) which has now gone so we are at:

and the delete missing is our attempt to address just the missing device which would appear to be devid 1 (of the main_pool) which has never featured in the btrfs fi show ouput here (presumably the fully dead drive).

Rockstor’s UI can’t do a removal of a device that is detached (or missing in btrfs fi show parlance). Hence the suggestion to use the delete missing. This I think is where we differ in what’s happened. Presumably you tried to re-size by removing a device called detached, and presumably this accounts for one of the errors you received. That operation essentially did nothing. Only the resize involving the devid 3 device was effective and kicked off the first and only resize to date. I’m also not quite sure what you meant by the

But that was already underway when this thread was started. Btrfs raid 1 can work with an odd number of drives!

Sizes shouldn’t change but hopefully the used column is as the copies that were on the missing device are re-created on the remaining attached devices. The size changed on the delete named device as it was there for it’s size to be reduced to 0 but we can’t see the size of a missing device from the command line.

Oh well it appears that we are getting there anyway (at least on the devid 3 removal). It may well be that an un-mount, rescan, and remount via reboot is going to be necessary; but we must make sure we do all that we can first to get thing in order. But 2 attempts to remove a missing device should be enough.

Given the difficulties and strangeness re delete missing here you might want to make sure you have this data elsewhere prior to the reboot though, just in case. Especially given you have a successful mount as is.

Noggin · June 2, 2017, 8:59pm

The deletion has completed successfully!

Once I get everything set up to my satisfaction, I’m going to buy a new license for Rockstore instead of restoring my old one as a way to say, “Thank you!”

I bought 2 8TB reds to replace all of my 2TB drives. I’ll copy all of my data over to it before I reboot. Most of it is backed up to a 5TB already anyway. Its also on ACD through rclone, but as you most likely, know that is pretty much gone.