Disk structure under /mnt2 and replication question

That is what I did. I have a small number of files in share ‘testing’ I replicated 3 times sucessfully and the 4th failed trying to create the share. I do not have a share named ;testing; on the receiver

1 Like

also on the sending system I all of a sudden have this
image

I rebooted and it is the same. HELP

@phillxnet I setup a new share with a different name so there would not be any overlap and replication fails on the 4th and 5th run with the same error. unable to copy rep to share.

Once it fails it deletes on of the weird shares with all numbers and the share name at the end but the replication share it does not remove and the file system for that share ends up being RO and then everything after that is garbage. The only way I can get riid of that replication is to use the btrfs command @Hooverdan gave me.

Just tried going from valt2 to Vault which is opposite direction from all of the above reported problem so I’,m thinking it is a bug.

I set up a similar scenario just now. In my case it also failed after the third replication. I will post my setup and details I can find tomorrow.

1 Like

Once I set this up at around 19:45, so 5 mins before the first replication would start, I didn’t touch the system until an hour later or so, to ensure that I would not interfere with anything during the replication process. As one can see during the second and third replication, really no data is sent, since I did not do anything with the data set I had generated prior to the replication configuration.

Sender:

Receiver:

Sending machine: quotas enabled, shares set up below with some data (the warning is a quota warning, the disk itself has 6GB of which 5.74GB were consumed, but the quota is set to 1GB:

Receiving machine only had these shares:


rockwurst:/mnt2/Replication # tree
.
├── dir_1
│ ├── dir_1
│ │ ├── file_1
│ │ ├── file_2
│ │ └── file_3
│ ├── dir_2
│ │ ├── file_1
│ │ ├── file_2
│ │ └── file_3
│ └── dir_3
│ ├── file_1
│ ├── file_2
│ └── file_3
└── index.sh

image

Replication Send Process setup:

Send Process:

Receive process:

After the first replication there were two additional shares on the receiving machine, however I forgot to capture it. After disabling the Replication, this is how the share picture looks like:

Replication Receiver History:

(After first three replications, it showed still as successful, the screenshot is after the last attempt):

Snapshots on sending system after last failure (I don’t remember whether there were more initially, or not):

Snapshots list on receiving system (same here, not sure whether/how those listings changed over time):

Log on sending machine between 7:50 and 8:50pm (/opt/rockstor/var/log/rockstor.log):

...
[16/Sep/2024 19:50:04] INFO [smart_manager.replication.sender:341] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. Sending full replica: /mnt2/rockwurst/.snapshots/Replication/Replication_5_replication_1

[16/Sep/2024 20:00:03] INFO [smart_manager.replication.sender:335] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. Sending incremental replica between /mnt2/rockwurst/.snapshots/Replication/Replication_5_replication_1 -- /mnt2/rockwurst/.snapshots/Replication/Replication_5_replication_24

[16/Sep/2024 20:10:02] INFO [smart_manager.replication.sender:335] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. Sending incremental replica between /mnt2/rockwurst/.snapshots/Replication/Replication_5_replication_24 -- /mnt2/rockwurst/.snapshots/Replication/Replication_5_replication_25

[16/Sep/2024 20:20:03] ERROR [smart_manager.replication.sender:79] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. b'unexpected reply(b\'receiver-error\') for 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. extended reply: b"b\'Failed to promote the oldest Snapshot to Share.\'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/15/snapshots/Replication_5_replication_1/repclone". Aborting'. Exception: b'unexpected reply(b\'receiver-error\') for 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. extended reply: b"b\'Failed to promote the oldest Snapshot to Share.\'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/15/snapshots/Replication_5_replication_1/repclone". Aborting'

[16/Sep/2024 20:30:02] ERROR [storageadmin.util:44] Exception: Snapshot (Replication_5_replication_26) already exists for the share (Replication). NoneType: None

[16/Sep/2024 20:30:02] ERROR [smart_manager.replication.sender:79] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. b'Failed to create snapshot: Replication_5_replication_26. Aborting.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/14/snapshots/Replication_5_replication_26

[16/Sep/2024 20:40:02] ERROR [storageadmin.util:44] Exception: Snapshot (Replication_5_replication_26) already exists for the share (Replication). NoneType: None

[16/Sep/2024 20:40:02] ERROR [smart_manager.replication.sender:79] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. b'Failed to create snapshot: Replication_5_replication_26. Aborting.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/14/snapshots/Replication_5_replication_26

[16/Sep/2024 20:50:02] ERROR [storageadmin.util:44] Exception: Snapshot (Replication_5_replication_26) already exists for the share (Replication). NoneType: None

[16/Sep/2024 20:50:02] ERROR [smart_manager.replication.sender:79] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. b'Failed to create snapshot: Replication_5_replication_26. Aborting.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/14/snapshots/Replication_5_replication_26

[16/Sep/2024 21:00:02] ERROR [storageadmin.util:44] Exception: Snapshot (Replication_5_replication_26) already exists for the share (Replication). NoneType: None

[16/Sep/2024 21:00:02] ERROR [smart_manager.replication.sender:79] Id: 6f32cb58-f849-4c93-bc65-6ebda422c66d-5. b'Failed to create snapshot: Replication_5_replication_26. Aborting.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/14/snapshots/Replication_5_replication_26
…

Log for receiving machine (/opt/rockstor/var/log/rockstor.log):

...
[16/Sep/2024 20:20:02] INFO [storageadmin.views.snapshot:61] Supplanting share (6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication) with snapshot (.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_5_replication_1).

[16/Sep/2024 20:20:03] ERROR [system.osi:287] non-zero code(1) returned by command: ['/usr/sbin/btrfs', 'qgroup', 'destroy', '0/257', '/mnt2/fresse_storage']. output: [''] error: ['ERROR: unable to destroy quota group: Device or resource busy', '']

[16/Sep/2024 20:20:03] ERROR [storageadmin.util:44] Exception: Error running a command. cmd = /usr/sbin/btrfs qgroup destroy 0/257 /mnt2/fresse_storage. rc = 1. stdout = ['']. stderr = ['ERROR: unable to destroy quota group: Device or resource busy', '']

Traceback (most recent call last):
File "/opt/rockstor/src/rockstor/storageadmin/views/clone_helpers.py", line 92, in create_repclone
remove_share(share.pool, share.name, PQGROUP_DEFAULT)
File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1196, in remove_share
qgroup_destroy(qgroup, root_pool_mnt)
File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1501, in qgroup_destroy
return run_command([BTRFS, "qgroup", "destroy", qid, mnt_pt], log=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rockstor/src/rockstor/system/osi.py", line 289, in run_command
raise CommandException(cmd, out, err, rc)
system.exceptions.CommandException: Error running a command. cmd = /usr/sbin/btrfs qgroup destroy 0/257 /mnt2/fresse_storage. rc = 1. stdout = ['']. stderr = ['ERROR: unable to destroy quota group: Device or resource busy', '']

[16/Sep/2024 20:20:03] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/15/snapshots/Replication_5_replication_1/repclone

[16/Sep/2024 20:30:02] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: Share matching query does not exist.

[16/Sep/2024 20:40:02] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: Share matching query does not exist.

[16/Sep/2024 20:50:02] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: Share matching query does not exist.

[16/Sep/2024 21:00:02] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: Share matching query does not exist.
...

Journalctl sending system for first replication:

...
Sep 16 19:50:00 fresse CRON[13351]: (root) CMD (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 19:50:01 fresse CRON[13350]: (root) CMDEND (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 19:50:01 fresse postfix/pickup[9736]: 3855A50354: uid=0 from=<root>
Sep 16 19:50:01 fresse postfix/cleanup[13441]: 3855A50354: message-id=<20240917025001.3855A50354@localhost.localdomain>
Sep 16 19:50:01 fresse postfix/qmgr[1306]: 3855A50354: from=<root@localhost.localdomain>, size=1290, nrcpt=1 (queue active)
Sep 16 19:50:01 fresse postfix/local[13443]: 3855A50354: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.18, delays=0.16/0.01/0/0.02, dsn=2.0.0, status=sent (delivered to mailbox)
Sep 16 19:50:01 fresse postfix/qmgr[1306]: 3855A50354: removed
Sep 16 19:51:56 fresse systemd[1]: Starting Hostname Service...
Sep 16 19:51:56 fresse systemd[1]: Started Hostname Service.
Sep 16 19:52:26 fresse systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Sep 16 19:53:00 fresse NetworkManager[866]: <info> [1726541580.7696] dhcp4 (enp0s8): state changed new lease, address=192.168.56.112
Sep 16 19:53:00 fresse systemd[1]: Starting Network Manager Script Dispatcher Service...
Sep 16 19:53:00 fresse systemd[1]: Started Network Manager Script Dispatcher Service.
Sep 16 19:53:01 fresse dns-dnsmasq.sh[14517]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Sep 16 19:53:01 fresse nm-dispatcher[14528]: /etc/NetworkManager/dispatcher.d/ntp: line 11: netconfig: command not found
Sep 16 19:53:01 fresse nm-dispatcher[14498]: req:1 'dhcp4-change' [enp0s8], "/etc/NetworkManager/dispatcher.d/ntp": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 19:53:01 fresse NetworkManager[866]: <warn> [1726541581.2191] dispatcher: (20) /etc/NetworkManager/dispatcher.d/ntp failed (failed): Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 19:53:11 fresse systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Sep 16 19:55:00 fresse CRON[15233]: (root) CMD (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 19:55:02 fresse CRON[15232]: (root) CMDEND (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 19:55:02 fresse postfix/pickup[9736]: 5654E503BF: uid=0 from=<root>
Sep 16 19:55:02 fresse postfix/cleanup[15367]: 5654E503BF: message-id=<20240917025502.5654E503BF@localhost.localdomain>
Sep 16 19:55:02 fresse postfix/qmgr[1306]: 5654E503BF: from=<root@localhost.localdomain>, size=1290, nrcpt=1 (queue active)
Sep 16 19:55:02 fresse postfix/local[15369]: 5654E503BF: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.16, delays=0.13/0.01/0/0.02, dsn=2.0.0, status=sent (delivered to mailbox)
Sep 16 19:55:02 fresse postfix/qmgr[1306]: 5654E503BF: removed
Sep 16 19:58:00 fresse NetworkManager[866]: <info> [1726541880.7556] dhcp4 (enp0s8): state changed new lease, address=192.168.56.112
Sep 16 19:58:00 fresse systemd[1]: Starting Network Manager Script Dispatcher Service...
Sep 16 19:58:00 fresse systemd[1]: Started Network Manager Script Dispatcher Service.
Sep 16 19:58:00 fresse dns-dnsmasq.sh[16366]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Sep 16 19:58:00 fresse nm-dispatcher[16375]: /etc/NetworkManager/dispatcher.d/ntp: line 11: netconfig: command not found
Sep 16 19:58:00 fresse nm-dispatcher[16353]: req:1 'dhcp4-change' [enp0s8], "/etc/NetworkManager/dispatcher.d/ntp": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 19:58:00 fresse NetworkManager[866]: <warn> [1726541880.8460] dispatcher: (21) /etc/NetworkManager/dispatcher.d/ntp failed (failed): Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 19:58:10 fresse systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
...

This is essentially repeated every 10 minutes for the first three replications. For the fourth replication at 8:20pm (which is the first one failing), some additional message are logged during that time (not sure they’re related or not):

...
Sep 16 20:20:00 fresse CRON[23607]: (root) CMD (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 20:20:00 fresse CRON[23606]: (root) CMDEND (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 20:20:00 fresse postfix/pickup[9736]: DD97E505ED: uid=0 from=<root>
Sep 16 20:20:00 fresse postfix/cleanup[23696]: DD97E505ED: message-id=<20240917032000.DD97E505ED@localhost.localdomain>
Sep 16 20:20:00 fresse postfix/qmgr[1306]: DD97E505ED: from=<root@localhost.localdomain>, size=1290, nrcpt=1 (queue active)
Sep 16 20:20:00 fresse postfix/local[23698]: DD97E505ED: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.18, delays=0.15/0.01/0/0.02, dsn=2.0.0, status=sent (delivered to mailbox)
Sep 16 20:20:00 fresse postfix/qmgr[1306]: DD97E505ED: removed
Sep 16 20:20:03 fresse systemd[1]: mnt2-6f32cb58\x2df849\x2d4c93\x2dbc65\x2d6ebda422c66d_Replication.mount: Deactivated successfully.
Sep 16 20:23:00 fresse NetworkManager[866]: <info> [1726543380.7602] dhcp4 (enp0s8): state changed new lease, address=192.168.56.112
Sep 16 20:23:00 fresse systemd[1]: Starting Network Manager Script Dispatcher Service...
Sep 16 20:23:00 fresse systemd[1]: Started Network Manager Script Dispatcher Service.
Sep 16 20:23:00 fresse dns-dnsmasq.sh[24626]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Sep 16 20:23:00 fresse nm-dispatcher[24635]: /etc/NetworkManager/dispatcher.d/ntp: line 11: netconfig: command not found
Sep 16 20:23:00 fresse nm-dispatcher[24612]: req:1 'dhcp4-change' [enp0s8], "/etc/NetworkManager/dispatcher.d/ntp": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 20:23:00 fresse NetworkManager[866]: <warn> [1726543380.8472] dispatcher: (26) /etc/NetworkManager/dispatcher.d/ntp failed (failed): Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 20:23:10 fresse systemd[1]: Started Daily Cleanup of Snapper Snapshots.
Sep 16 20:23:10 fresse systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Sep 16 20:23:10 fresse systemd[1]: Starting DBus interface for snapper...
Sep 16 20:23:10 fresse systemd[1]: Started DBus interface for snapper.
Sep 16 20:23:10 fresse systemd[1]: snapper-cleanup.service: Deactivated successfully.
Sep 16 20:24:11 fresse systemd[1]: snapperd.service: Deactivated successfully.
Sep 16 20:25:00 fresse CRON[25204]: (root) CMD (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 20:25:01 fresse CRON[25203]: (root) CMDEND (/opt/rockstor/.venv/bin/st-snapshot 2 \*-*-*-*-*-*)
Sep 16 20:25:01 fresse postfix/pickup[9736]: BB32B50655: uid=0 from=<root>
Sep 16 20:25:01 fresse postfix/cleanup[25294]: BB32B50655: message-id=<20240917032501.BB32B50655@localhost.localdomain>
Sep 16 20:25:01 fresse postfix/qmgr[1306]: BB32B50655: from=<root@localhost.localdomain>, size=1290, nrcpt=1 (queue active)
Sep 16 20:25:01 fresse postfix/local[25296]: BB32B50655: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.16, delays=0.13/0.01/0/0.02, dsn=2.0.0, status=sent (delivered to mailbox)
Sep 16 20:25:01 fresse postfix/qmgr[1306]: BB32B50655: removed
Sep 16 20:28:00 fresse NetworkManager[866]: <info> [1726543680.7618] dhcp4 (enp0s8): state changed new lease, address=192.168.56.112
Sep 16 20:28:00 fresse systemd[1]: Starting Network Manager Script Dispatcher Service...
Sep 16 20:28:00 fresse systemd[1]: Started Network Manager Script Dispatcher Service.
Sep 16 20:28:00 fresse dns-dnsmasq.sh[26163]: <debug> NETWORKMANAGER_DNS_FORWARDER is not set to "dnsmasq" in /etc/sysconfig/network/config -> exit
Sep 16 20:28:00 fresse nm-dispatcher[26172]: /etc/NetworkManager/dispatcher.d/ntp: line 11: netconfig: command not found
Sep 16 20:28:00 fresse nm-dispatcher[26149]: req:1 'dhcp4-change' [enp0s8], "/etc/NetworkManager/dispatcher.d/ntp": complete: process failed with Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
Sep 16 20:28:00 fresse NetworkManager[866]: <warn> [1726543680.8576] dispatcher: (27) /etc/NetworkManager/dispatcher.d/ntp failed (failed): Script '/etc/NetworkManager/dispatcher.d/ntp' exited with status 127
...

For some reason on the receiving system, no journalctl entries before today are visible (receiving system runs on Tumbleweed with Kernel 6.10.9 and btrfs-progs 6.10.1, while sending system runs on Leap 15.6 with Kernel 6.4.0 and btrfs-progs 6.5.1, both on Rockstor 5.0.14-0).

@Stevek from your recollection, did you observe anything in addition to the above?

2 Likes

to ensure that there wasn’t any quota processing getting in the way, I recreated the scenario but this time on the receiving system I disabled quotas at the pool level.

Did not seem to make a difference. it still failed at the fourth replication with the same error message on the receiving system:

[17/Sep/2024 13:55:02] INFO [storageadmin.views.snapshot:61] Supplanting share (6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication) with snapshot (.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_6_replication_1).
[17/Sep/2024 13:55:03] ERROR [storageadmin.util:44] Exception: Error running a command. cmd = /usr/sbin/btrfs property set /mnt2/fresse_storage/.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_6_replication_1 ro false. rc = 1. stdout = ['']. stderr = ['ERROR: cannot flip ro->rw with received_uuid set, use force option -f if you really want unset the read-only status. The value of received_uuid is used for incremental send, consider making a snapshot instead. Read more at btrfs-subvolume(8) and Subvolume flags.', '']`
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/clone_helpers.py", line 94, in create_repclone
    set_property(snap_path, "ro", "false", mount=False)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 2314, in set_property
    return run_command(cmd)
           ^^^^^^^^^^^^^^^^
  File "/opt/rockstor/src/rockstor/system/osi.py", line 289, in run_command
    raise CommandException(cmd, out, err, rc)
system.exceptions.CommandException: Error running a command. cmd = /usr/sbin/btrfs property set /mnt2/fresse_storage/.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_6_replication_1 ro false. rc = 1. stdout = ['']. stderr = ['ERROR: cannot flip ro->rw with received_uuid set, use force option -f if you really want unset the read-only status. The value of received_uuid is used for incremental send, consider making a snapshot instead. Read more at btrfs-subvolume(8) and Subvolume flags.', '']
[17/Sep/2024 13:55:03] ERROR [smart_manager.replication.receiver:100] b'Failed to promote the oldest Snapshot to Share.'. Exception: 500 Server Error: Internal Server Error for url: http://127.0.0.1:8000/api/shares/18/snapshots/Replication_6_replication_1/repclone

Using the -f option in the above command manually seems to change the ro property to rw:

/usr/sbin/btrfs property get /mnt2/fresse_storage/.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_6_replication_1
ro=false

Interestingly, when I shoehorned the -f into the code and tried this again, it suddenly didn’t seem to recognize the -f flag.

system.exceptions.CommandException: Error running a command. cmd = /usr/sbin/btrfs property set -f /mnt2/fresse_storage/.snapshots/6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication/Replication_7_replication_1 ro false. rc = 1. stdout = ['']. stderr = ["btrfs property: unknown token 'set -f'", 'usage: btrfs property get/ ...

but in any case, still failing after the third replication.

@Hooverdan Dan you seem to have done a great job documenting exactly what I am seeing. Including the disappearing number.replication name share being removed and the replication going RO and having to use btfrs. One thing I noticed is that I started getting error messages BTRFS info (device xxxx_: qgroup scan completed (inconsistency flag cleared).especially if I touch any files on my systems. Before the failing replication I was not getting these errors. RE: your last comment… It seems that once the failure happens the system acts differently.

Both my systems are on the latest code so I would expect the same behavior but even the logwatch command acts different. On one system if entered with no parms and on the other it runs with no output.

Let me know if I can do anything to help figure this out and I would still like to understand why it waits till the third rep to move to a share? When I rep something I expect I will see the data when the job completes not aft three runs which might be days apart.

UPDATE… I also have this issue now. I can not re-enable quotas. If I do it in the GUI it completes but then comes right back to disabled. Any thoughts how I can fix this?
Show

entries

Search:

Name Size Usage Quotas Raid Active mount options / Status Compression Extra mount options Disks Actions
Data-2_1TB 931.51 GB 251.20 GB (26.97 %) Enabled single rw,relatime,space_cache=v2,subvolid=5,subvol=/ no None "ata-ST31000528AS_5VP1AAS4 "
Data-3_1TB 931.51 GB 251.20 GB (26.97 %) Enabled single rw,relatime,space_cache=v2,subvolid=5,subvol=/ no None "ata-ST31000528AS_6VP3GE7X "
Data_1 931.51 GB 220.22 GB (23.64 %) Disabled single rw,relatime,space_cache=v2,subvolid=5,subvol=/ no None "ata-WDC_WD10EARX-00N0YB0_WD-WMC0S0882589 "
1 Like

You could try to do this at the command line. I seem to remember some thread, where that was happening in the UI, but it was possible to do it using the command line directly. E.g.:

btrfs quota enable /mnt2/Data-1

1 Like

Ran the command and there was no change. I have made copies of the data. Should I delete the original shares and delete and readd the disk or is there another option?

Documents 931.51 GB 0 bytes 0 bytes rw,relatime,space_cache=v2,subvolid=261,subvol=/Documents Data_1 (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Disabled Same as Pool
Documents-1 150.00 GB 10.02 GB 10.02 GB rw,relatime,space_cache=v2,subvolid=262,subvol=/Documents-1 Data-3_1TB (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Enabled Same as Pool
home 57.56 GB 16.00 KB 16.00 KB rw,relatime,ssd,discard=async,space_cache,subvolid=259,subvol=/@/home ROOT (rw,relatime,ssd,discard=async,space_cache,subvolid=256,subvol=/@) Quotas Enabled Same as Pool
Music 931.51 GB 0 bytes 0 bytes rw,relatime,space_cache=v2,subvolid=259,subvol=/Music Data_1 (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Disabled Same as Pool
Music-1 250.00 GB 193.72 GB 193.72 GB rw,relatime,space_cache=v2,subvolid=263,subvol=/Music-1 Data-3_1TB (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Enabled Same as Pool
Newone 1.00 GB 824.00 KB 824.00 KB rw,relatime,space_cache=v2,subvolid=257,subvol=/Newone Data-3_1TB (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Enabled Same as Pool
Photos 931.51 GB 0 bytes 0 bytes rw,relatime,space_cache=v2,subvolid=260,subvol=/Photos Data_1 (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Disabled Same as Pool
Photos-1 200.00 GB 46.50 GB 46.50 GB rw,relatime,space_cache=v2,subvolid=264,subvol=/Photos-1 Data-3_1TB (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Enabled Same as Pool
ROCK-ONS 5.00 GB 232.00 KB 232.00 KB rw,relatime,space_cache=v2,subvolid=256,subvol=/ROCK-ONS Data-3_1TB (rw,relatime,space_cache=v2,subvolid=5,subvol=/) Quotas Enabled Same as Pool

Fundamentally, Rockstor does (still) not respect quotas as a constraint, however it helps/shows on the dashboard more relevant usage of a given share. So, you don’t have to go down the path you described above, unless you really want to.

Did you get any output when you enabled the quotas via command line?

Since I am not familiar whether that quota status is persisted somewhere or re-determined every time you refresh that screen, can you run:

btrfs qgroup show -reF /mnt2/Data-1 - as an aside, are you doing the command line using a root user (e.g. via PuTTy, or elevated prompt in the built-in shell on the UI)?

Thx for all your help…

no output after cmd.
Vault:~ # btrfs quota enable /mnt2/Data_1
Vault:~ # btrfs quota

Here is output from your suggestion.

Vault:~ # btrfs qgroup show -reF /mnt2/Data_1
ERROR: cannot find the qgroup 0/266
ERROR: can’t list qgroups: No such file or directory

I do get output if I list the other drives but not on the one with the issue. Should I create one?

I am using TERMIUS and logged on as root.
Vault:~ # whoami
root

mhm, again reaching the end of my knowledge here. I know based on some of your reporting on resizing something was done on the quotas, but I suspect no fundamental changes recently … I’ll think some more on it

@Hooverdan Thanks for detailed exposition here. Nice.
Re:

This has definitely helped to narrow down what’s going on. The strangly named early shares are a know artefact of the first 3 replication events however. There-after it is suplanted (promoted).

But I do see a difference here re the failure report:

Quotas enabled:

Receiving machine: which is where the failure appears:

I.e. a quote problem, hence you further investigation and disabling quotes.

Quotas disabled

Receiving machine again:

which looks different, and likely the difference is informative for us.

As an initial guess the former may be down to a quota change we made, can’t think of anything else currently, and the later could be a btrfs change: especially given you are running Tumbleweed on the receiver. I.e. not the -f advice. And given we no longer use the as the send/receive difference target, the reason we use a cascade of snapshots, should mean we can safely use this -f option, which we have never required before. Take a peek and see if you can jury-rig a -f in there for the proof of function with target (destination) Supplanting process. That at least leaves only the initial quotas blocker.

@Stevek Thanks for your continued endeavour on this one, as always much appreciated. This would be a great find if we can get to the bottom on it. We are still in testing, if late, so it would be great to see this functional, as earlier in the testing phase. We do juggle several moving targets here so always a challenge. But this type of contirbution is super usefull to highlight these issues. The Quota strangeness you saw looks tallies with @Hooverdan finding regarding a new quota sensitivity. Try disabling quotas via Web-UI then re-enabling them: they can take a few minutes to re-calculate the share usage after re-enabling. If Web-UI disable/reenable failes, try via the command line: it’s a Pool level thing.

Thanks folks, we do have some concrete info here now, so all good. Not observed after last replication maintenance but definitely exists now !!!

Hope that helps.

@Hooverdan my apologies if I’ve mis-quoted your logs there. But it does looks different after quota disabled on target (receive) end: it just fails a little later on. The two could be related however. I.e. quota change blocked by way of protection against the sort of thing we want to do later as well: that the snapshot is know be associated with a replication: but we no longer use the older/oldest snapshots, only the most recent: then cascade the orphaned oldest snapshot to being a real Share (cosmetic actually as all snapshots, shares, clones, and subvols). That cascade mechanism was intended to divorce that receiving snapshot from the replication process. But it looks like we may need a little tweak re newer btrfs.

2 Likes

Thanks for providing your thoughts on this. I did notice that difference, but assumed that, while a symptom of something not working, it wasn’t at the core of failure after the third replication (or at least I think so).

I added the force flag in btrfs.py on the receiver side here:

def set_property(mnt_pt, name, val, mount=True):
    if mount is not True or is_mounted(mnt_pt):
        cmd = [BTRFS, "property", "set", mnt_pt, name, val]
        return run_command(cmd)

like this:

def set_property(mnt_pt, name, val, mount=True):
    if mount is not True or is_mounted(mnt_pt):
        cmd = [BTRFS, "property", "set -f", mnt_pt, name, val]
        return run_command(cmd)

but that provided an error unfortunately. Interestingly, when I copied the command string printed in the error log exactly how it was captured, it worked, so not sure what happened. I will try again differently

Hah, my lack of python development experience shows. I made a slight change in the coding to:

def set_property(mnt_pt, name, val, mount=True):
    if mount is not True or is_mounted(mnt_pt):
        cmd = [BTRFS, "property", "set", "-f", mnt_pt, name, val]
        return run_command(cmd)

and I now have 5 successful replications and counting!

And as I found somewhere else in the replication coding, only the last three snapshots are kept:

This was all with quotas disabled on the receiving system:

Turning them on after the fifth replication, led to a failure of promoting the oldest share to primary. This is in line (unfortunately) with the errors described earlier in this thread where the system struggles with deleting the quota groups (qgroups):

[19/Sep/2024 14:55:03] INFO [storageadmin.views.snapshot:61] Supplanting share (                                                                                                                     6f32cb58-f849-4c93-bc65-6ebda422c66d_Replication) with snapshot (Replication_8_r                                                                                                                     eplication_47).
[19/Sep/2024 14:55:03] ERROR [system.osi:287] non-zero code(1) returned by comma                                                                                                                     nd: ['/usr/sbin/btrfs', 'qgroup', 'destroy', '0/268', '/mnt2/fresse_storage']. o                                                                                                                     utput: [''] error: ['ERROR: unable to destroy quota group: Device or resource bu                                                                                                                     sy', '']
[19/Sep/2024 14:55:03] ERROR [storageadmin.util:44] Exception: Error running a c                                                                                                                     ommand. cmd = /usr/sbin/btrfs qgroup destroy 0/268 /mnt2/fresse_storage. rc = 1.                                                                                                                      stdout = ['']. stderr = ['ERROR: unable to destroy quota group: Device or resou                                                                                                                     rce busy', '']
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/clone_helpers.py", line 92                                                                                                                     , in create_repclone
    remove_share(share.pool, share.name, PQGROUP_DEFAULT)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1196, in remove_share
    qgroup_destroy(qgroup, root_pool_mnt)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 1501, in qgroup_destroy
    return run_command([BTRFS, "qgroup", "destroy", qid, mnt_pt], log=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/src/rockstor/system/osi.py", line 289, in run_command
    raise CommandException(cmd, out, err, rc)
system.exceptions.CommandException: Error running a command. cmd = /usr/sbin/btr                                                                                                                     fs qgroup destroy 0/268 /mnt2/fresse_storage. rc = 1. stdout = ['']. stderr = ['                                                                                                                     ERROR: unable to destroy quota group: Device or resource busy', '']
[19/Sep/2024 14:55:03] ERROR [smart_manager.replication.receiver:100] b'Failed t                                                                                                                     o promote the oldest Snapshot to Share.'. Exception: 500 Server Error: Internal                                                                                                                      Server Error for url: http://127.0.0.1:8000/api/shares/25/snapshots/Replication_                                                                                                                     8_replication_47/repclone

So, apparently two root causes that were masking each other.

3 Likes

@phillxnet @Hooverdan WWOOOOWWWW you guys are amazing…
Let me know when u have a fix and i’ll test it. I run from the testing channel.

A couple of questions

Again why the need to do the 3 reps before seeing data?

If I have an existing share on both machines because I manually made a copy from windows is there a way to start replicating those same shares? In other words I have a share named photos on both machines and I want the second machine to always be in sync with the first. Do I need to delete the share on receiving machine to use replicate? Is there a better way of doing this?

2 Likes

I am not aware of a way to leverage that existing copy as part of the btrfs send/receive. The btrfs manpage states here:

You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides—both for the sender and the receiver. For implications of changed read-write status of a received snapshot please see section SUBVOLUME FLAGS in btrfs-subvolume(8).

This really means not only the files have to exactly match, but all the stuff in the “btrfs sense”, subvolume layout, checksums, etc.

If you wanted to leverage the copies then I think something like rsync or other file level sync/backup solutions might be the choice for that scenario. You won’t get the benefits (and challenges as seen above) of btrfs send/receive. With very large photo libraries rsync might take some time to analyze every time it needs to sync.

1 Like

For now I’ve created 2 separate Issues, differentiating between the quota destroy error and the property set error during send/receive. We can restructure/combine as needed:

2 Likes

@Stevek Re:

We now have at least one of the issues address as a result of this thread and your & @Hooverdan perseverance :).

See:

In that release we now have the official variant of @Hooverdan ‘force’ flag implementation, given we know we are cascading our snapshots in the replication setup.

So thus-far we may still have an issue with the newer kernels re changed behaviour on quotas. Notes on this in the above release notes.

Thanks again for your invaluable input here, the quota issue is still open as I’ve yet to get a reproducer. Likely it requires a certain quota payload that was not evident in my current test machines. And without a reproducer I could not approach a fix, and so moved onto the force flag issue. But bit by bit.

Hope that helps.

2 Likes