Error running a command. cmd = /usr/sbin/btrfs device add /dev/disk/by-id/wwn-0x5000c500e2005763 /mnt2/04.NET_8.0TB. rc = 1. stdout = ['']. stderr = ['ERROR: use the -f option to force overwrite of /dev/disk/by-id/wwn-0x5000c500e2005763', '']

phemmy22 · April 6, 2024, 12:31am

ROCKSTOR Version 4.6.1-0

Brief description of the problem

I cannot add more disk to pool

Detailed step by step instructions to reproduce the problem

Adding disk to pool

Web-UI screenshot

Error Traceback provided on the Web-UI


  Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
    yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/pool.py", line 632, in put
    self._resize_pool_start(pool, dnames)
  File "/opt/rockstor/src/rockstor/storageadmin/views/pool.py", line 356, in _resize_pool_start
    start_resize_pool.call_local(cmd)
  File "/opt/rockstor/.venv/lib/python2.7/site-packages/huey/api.py", line 784, in call_local
    return self.func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 2059, in start_resize_pool
    raise e
CommandException: Error running a command. cmd = /usr/sbin/btrfs device add /dev/disk/by-id/wwn-0x5000c500e2005763 /mnt2/04.NET_8.0TB. rc = 1. stdout = ['']. stderr = ['ERROR: use the -f option to force overwrite of /dev/disk/by-id/wwn-0x5000c500e2005763', '']

Hooverdan · April 6, 2024, 4:42pm

@phemmy22, does the disk you want to add already have a file system on it? If so, you might want to wipe the entire drive (including all existing partitions and boot sector).

phemmy22 · April 7, 2024, 9:19am

I have gone through it again, and I don’t think the issue is the disk I am adding.
It looks like I corrupted this in the pool, but I do not know which disk it is because the whole pool has been corrupted.

Ran Cumulative pool errors per device - ‘btrfs dev stats -z /mnt2/04.NET_8.0TB’ to reset. has suggested in the screenshot, and had a clean pool for a few secs.

The below shows a clean pool for a few seconds

How can I distinguish which disk is faulty in that pool? The whole pool is not usable, and I can’t add or delete files from the pool.

The network mapping for that drive appears to be filled to the brim but should have 1.21 TB free, as shown below.

Hopefully, it will be resolved without losing the data.

Thank you!

Hooverdan · April 7, 2024, 10:01pm

I have to defer to @phillxnet on this…

I think the “clean” pool you only had because you reset the stats, which essentially only resets the counters for these errors, but not the actual errors themselves.

I am also wondering whether one or more of your disks are actually completely full (which can cause issues under btrfs, less today but certainly way more disastrous a few years ago). Don’t start them right now (lest might cause more issues if full disks are not the root cause), but have you run automatic or manual balances in the past (and scrubs)?

If you don’t see any error messages in the pool (beyond the one you have right now for device errors) you can also check out the error logs under each disk (cumbersome, I know, since you have quite a few disks) to see whether any of them are showing reported SMART failures.

EDIT:
Somewhere I did read that one should plan for ~10% of btrfs be unallocated to stay healthy, but I do not remember whether that’s based on anecdotal or other evidence. You can also check at the command line:

btrfs filesystem usage -h /mnt2/04.NET_8.0TB

And free space is not the same as unallocated, as I realized.

Finally, do you have a lot of snapshots on your system (when looking at the list, ignore the ones related to Rockons, as there’s a special treatment for docker containers and their layers on a btrfs system.

And, does your system use ECC memory? Because memory often times can be the culprit for corruptions as well.

phemmy22 · April 8, 2024, 4:51am

The output for btrfs filesystem usage -h /mnt2/04.NET_8.0TB is below

Disk /dev/sdbc is possibly the corrupted disk from the last two screenshots below.

I am not able to run Scrub on Pool, as shown below.

Does the error in the Balance outline in red below mean anything?
Looking at the date, I added the last disk on that day and that might have been the issue

Can you suggest any resolution, please?

I need help removing the corrupt disk, and I don’t mind losing just the data on that single disk. Is that possible?
I am considering converting the raid level from RAID 0 to a single disk raid. Do you think that may help

OR

Is there a command to force disk extension by adding more disks?

I need a work around.

Thank you

phillxnet · April 8, 2024, 9:35am

@phemmy22 Hello there, I just wanted to chip-in on this a little:

There-after the system displays errors across all drives: These errors are all corruption, not read, or write. That suggests that you have an issue higher up than the individual drives themselves. I.e: @Hooverdan 's question:

Failure in memory could given errors across all drives. If it was just the newly added drive, the errors would be confined to that drive. But they are not. Likewise the system history, from my only quick read here, was that this all started when you added the last drive. The error you noted form that time in the Web-UI balance logs by the way was due to a read-only pool status. Btrfs goes read-only to protect data integrity it if finds something iffy. Thousands of corruption errors in seconds across all pool members definitely counts as iffy. So back to that drive addition. You may have overly stretched your PSU. This then causes a system wide instability that could well cause power fluctuations that lead to corruption.

Do nothing else with the pool until you prove your memory good. Then look into your PSU capacity/health (not easy) and the massive not insubstantial load it is now under with this drive count. Our following doc may be of help on the memory test front:

Pre-Install Best Practice (PBP): Pre-Install Best Practice (PBP) — Rockstor documentation

Another cause could be a drive controller interface: but again; power instability (voltage fluctuations etc) could also show up as all sorts of hardware failure as all components need a steady power supply. Just thinking that the adding of this last drive could have been the last straw for the PSU and pushed it into failure (maybe only at that load but potentially all loads there-after).

With the system in this state: unreliable with all these across the board errors on all drives, you can do nothing with the pool: it likely is the victim here; not the cause.

From the output no individial disk is any more suspect than any other. They are all receiving corruption reports: ergo not likely an individual disk is my initial thought here.

Always the hope, but btrfs-raid0, and btrfs-single for that matter have not meaningful redundancy: so are only appropriate for disposable data purposes: see our:

https://rockstor.com/docs/interface/storage/pools-btrfs.html#redundancyprofiles

But as stated earlier: you pool has gone read-only (the “ro”) in the mount status reports. That is likely as a result of the corruptions accross the board. But with no redundancy (btrfs-raid0) if there is a failure anywhere then the entire pool can be at risk. But check your memory and PSU health first as it is common place for additional load on a system to surface failure across the board.

Hope that helps,at least for some context.