Added Disk - Rebalance doesn't seem to be happening

HarryHUK · December 2, 2020, 11:41pm

Brief description of the problem

Hi again. I’m replacing another disk (4TB) with a larger one (8TB) as it has become available.
Over 90 minutes after the rebalance started, nothing seems to be happening. I know it’s going to take a while, because I’ve already replaced one.

Anyway, nearly 2 hours after starting, 0% progress, and the new drive is still showing 1GB allocated. The new pool size is showing correctly though.

Detailed step by step instructions to reproduce the problem

While shutdown, added ZA1FZPBY physically to /dev/sdc (having moved the 4TB drive it will replace (Z306JZ39 ) to /dev/sde.
Wiped the new disk, as last time, but this time on the Rockstor box, not my ubuntu desktop. (fdisk to remove the partitions, then wipe the disk in the Assign role page, from Disks)
In Pools, Resize/Re-raid, Add disk, select the new drive and Submit

(The last time, I added the 8TB (WKD391VC) to /dev/sde first, and after removing the 4TB it replaced, and rebalanced, shutdown and moved the new drive to the old drive’s position. From this I figured that the drives are identified by their serial number, not which file in /dev they are mapped to… maybe I was wrong… but, everything started fine before adding the disk to the pool.)

Web-UI screenshots

See both the existing 8TB drive (WKD391VC, not the one being replaced, but the previous one added) and the new one ( ZA1FZPBY ) are showing unknown power status.

New disk still only has 1 GB allocated.

Balance is running, still 0% progress

See how quiet the disks are. I would expect a flurry of writing going on.

ssh’d in. btrfs fi us shows the new drive, /dev/sdc, has no data, and 1GB of metadata, (the other 8TB drive, /dev/sdd, has 8GB of metadata)
Existing data seems not to have budged a bit.

Error Traceback provided on the Web-UI

What’s the best way to deal with a hung re-balance?
(Fortunately I have backup copies of the data…)

I’ve tried a btrfs balance pause, and that just hung.

HarryHUK · December 3, 2020, 12:54am

Update

As btrfs balance pause, and btrfs balance cancel both hung, I tried shutting down, but that seemed to be hung on the balance that wasn’t progressing. So…
power button with extreme prejudice.

Restarted, and the logs show that it noticed things weren’t right, and seems to be fixing sda at the moment. And data seems to be moving onto the new disk.
I’m going to leave it overnight.

HarryHUK · December 3, 2020, 10:09am

Things always look better in the morning

See the difference in disk IO and CPU.

I haven’t found anything in the logs yet for the failed balance start cmd, but I’m positive things were stalled last night. I really tried to look for signs of life.

I did find this thread on reddit. One user says a single block can take a long time, and to wait up to an hour after cancel. OP waited several.

Just given me an idea for a dashboard widget to show task progress (e.g. balance, scrub) and maybe a ‘top’ widget.

Anyway, not much to go on, really, but now that it’s moving, I’m happy, so we can leave it.
Just sharing my experience as a new user (who is learning).

Question

Would it be worth upgrading to the latest 4.0 RC? Is it at least as stable as 3.9.2-57 on 4.12 kernel?

phillxnet · December 3, 2020, 10:47pm

@HarryHUK Thanks for the details report and status update.
Re:

Take a look at my recent response to a similar question from @Noggin recently:

And the following forum thread references the changes made from 3.9.2-57 to 4.0.4:

Hope that helps.

HarryHUK · December 4, 2020, 9:55am

Many thanks for your reply. The disk add is nearly done now,
and after that I’ll remove the 4TB disk. Will I still need to scrub, given I will have done 2 re-balances?

(I hope my issue wasn’t hardware… it hasn’t crashed )

Will have a look at 4.0 soon.

Noggin · December 4, 2020, 2:21pm

FYI, I did a several rebalances when swapping out disks and when changing from RAID10 to RAID6. Disabling quotas made everything go much faster.