Replication fails

Jorma_Tuomainen · November 27, 2020, 3:59pm

Hi,

Added both appliances (old rockstor and new 4.0 as receiving). I get following when trying to replicate Configs-share (would like to just replicate whole pool).

[27/Nov/2020 17:52:01] ERROR [storageadmin.middleware:32] Exception occurred while processing a request. Path: /api/commands/refresh-share-state method: POST
[27/Nov/2020 17:52:01] ERROR [storageadmin.middleware:33] Error running a command. cmd = /usr/sbin/btrfs property get /mnt2/Data/.snapshots/Configs/Configs_4_replication_1 ro. rc = 1. stdout = [’’]. stderr = [‘ERROR: failed to open /mnt2/Data/.snapshots/Configs/Configs_4_replication_1: No such file or directory’, ‘ERROR: failed to detect object type: No such file or directory’, ‘usage: btrfs property get [-t ] []’, ‘’, ’ Gets a property from a btrfs object.’, ‘’, ’ If no name is specified, all properties for the given object are’, ’ printed.’, ’ A filesystem object can be a the filesystem itself, a subvolume,’, " an inode or a device. The ‘-t ’ option can be used to explicitly", ’ specify what type of object you meant. This is only needed when a’, ’ property could be set for more then one object type. Possible types’, ’ are s[ubvol], f[ilesystem], i[node] and d[evice].’, ‘’, ‘’]
Traceback (most recent call last):
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/handlers/base.py”, line 132, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/decorators/csrf.py”, line 58, in wrapped_view
return view_func(*args, **kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/generic/base.py”, line 71, in view
return self.dispatch(request, *args, **kwargs)
File “/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py”, line 452, in dispatch
response = self.handle_exception(exc)
File “/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py”, line 449, in dispatch
response = handler(request, *args, **kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py”, line 145, in inner
return func(*args, **kwargs)
File “/opt/rockstor/src/rockstor/storageadmin/views/command.py”, line 348, in post
import_shares(p, request)
File “/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py”, line 86, in import_shares
shares_in_pool = shares_info(pool)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 721, in shares_info
snap_idmap[vol_id])
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 768, in parse_snap_details
writable = not get_property(full_snap_path, ‘ro’)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 1844, in get_property
o, e, rc = run_command(cmd)
File “/opt/rockstor/src/rockstor/system/osi.py”, line 176, in run_command
raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/sbin/btrfs property get /mnt2/Data/.snapshots/Configs/Configs_4_replication_1 ro. rc = 1. stdout = [’’]. stderr = [‘ERROR: failed to open /mnt2/Data/.snapshots/Configs/Configs_4_replication_1: No such file or directory’, ‘ERROR: failed to detect object type: No such file or directory’, ‘usage: btrfs property get [-t ] []’, ‘’, ’ Gets a property from a btrfs object.’, ‘’, ’ If no name is specified, all properties for the given object are’, ’ printed.’, ’ A filesystem object can be a the filesystem itself, a subvolume,’, " an inode or a device. The ‘-t ’ option can be used to explicitly", ’ specify what type of object you meant. This is only needed when a’, ’ property could be set for more then one object type. Possible types’, ’ are s[ubvol], f[ilesystem], i[node] and d[evice].’, ‘’, ‘’]
[27/Nov/2020 17:52:01] ERROR [storageadmin.middleware:32] Exception occurred while processing a request. Path: /api/commands/refresh-snapshot-state method: POST
[27/Nov/2020 17:52:01] ERROR [storageadmin.middleware:33] Error running a command. cmd = /usr/sbin/btrfs property get /mnt2/Data/.snapshots/Configs/Configs_4_replication_1 ro. rc = 1. stdout = [’’]. stderr = [‘ERROR: failed to open /mnt2/Data/.snapshots/Configs/Configs_4_replication_1: No such file or directory’, ‘ERROR: failed to detect object type: No such file or directory’, ‘usage: btrfs property get [-t ] []’, ‘’, ’ Gets a property from a btrfs object.’, ‘’, ’ If no name is specified, all properties for the given object are’, ’ printed.’, ’ A filesystem object can be a the filesystem itself, a subvolume,’, " an inode or a device. The ‘-t ’ option can be used to explicitly", ’ specify what type of object you meant. This is only needed when a’, ’ property could be set for more then one object type. Possible types’, ’ are s[ubvol], f[ilesystem], i[node] and d[evice].’, ‘’, ‘’]
Traceback (most recent call last):
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/handlers/base.py”, line 132, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/decorators/csrf.py”, line 58, in wrapped_view
return view_func(*args, **kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/generic/base.py”, line 71, in view
return self.dispatch(request, *args, **kwargs)
File “/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py”, line 452, in dispatch
response = self.handle_exception(exc)
File “/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py”, line 449, in dispatch
response = handler(request, *args, **kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py”, line 145, in inner
return func(*args, **kwargs)
File “/opt/rockstor/src/rockstor/storageadmin/views/command.py”, line 353, in post
import_snapshots(share)
File “/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py”, line 209, in import_snapshots
share.name)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 819, in snaps_info
stripped_path)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 768, in parse_snap_details
writable = not get_property(full_snap_path, ‘ro’)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 1844, in get_property
o, e, rc = run_command(cmd)
File “/opt/rockstor/src/rockstor/system/osi.py”, line 176, in run_command
raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/sbin/btrfs property get /mnt2/Data/.snapshots/Configs/Configs_4_replication_1 ro. rc = 1. stdout = [’’]. stderr = [‘ERROR: failed to open /mnt2/Data/.snapshots/Configs/Configs_4_replication_1: No such file or directory’, ‘ERROR: failed to detect object type: No such file or directory’, ‘usage: btrfs property get [-t ] []’, ‘’, ’ Gets a property from a btrfs object.’, ‘’, ’ If no name is specified, all properties for the given object are’, ’ printed.’, ’ A filesystem object can be a the filesystem itself, a subvolume,’, " an inode or a device. The ‘-t ’ option can be used to explicitly", ’ specify what type of object you meant. This is only needed when a’, ’ property could be set for more then one object type. Possible types’, ’ are s[ubvol], f[ilesystem], i[node] and d[evice].’, ‘’, ‘’]
[27/Nov/2020 17:52:01] ERROR [smart_manager.replication.sender:74] Id: 00000000-0000-0000-0000-AC1F6B14152E-4. Failed to create snapshot: Configs_4_replication_1. Aborting… Exception: HTTPConnectionPool(host=‘127.0.0.1’, port=8000): Max retries exceeded with url: /api/shares/5/snapshots/Configs_4_replication_1 (Caused by <class ‘httplib.BadStatusLine’>: ‘’)

Jorma_Tuomainen · November 27, 2020, 4:01pm

For testing made it run every 5 mins and now I just get:

[27/Nov/2020 18:00:02] ERROR [storageadmin.util:44] Exception: Snapshot (Configs_4_replication_1) already exists for the share (Configs).
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[27/Nov/2020 18:00:02] ERROR [smart_manager.replication.sender:74] Id: 00000000-0000-0000-0000-AC1F6B14152E-4. Failed to create snapshot: Configs_4_replication_1. Aborting… Exception: 500 Server Error: INTERNAL SERVER ERROR

phillxnet · November 27, 2020, 10:30pm

@Jorma_Tuomainen Hello again.
Re:

If this is not a Stable variant but our last released testing channel in CentOS flavour (3.9.1-16) then replication was broken in that run. Replication only works in the stable variant of the CentOS flavours and in our more recent 4.0 ‘Build on openSUSE’ variant. That last testing channel ended up making a lot of changes that broke replication and we only got to fixing it in the Stable branch. But those fixes were carried thought to the 4.0 testing channel.

Also note that if replication fails due to network issues then we have a failing much like:

Where the sender creates a share and leave it there when it can’t contact the receiver. Deleting this snapshot by hand only front the sender can free up the next replication event. This is a know bug. Not sure it you situation is just this or the broken replication in CentOS testing.

Hope that helps.

Jorma_Tuomainen · November 28, 2020, 5:44am

Old meaning 3.9.2-57, logs are from that machine. Probably not network error, replication shares appear to replication-receive on the other end. I have tried to delete shares multiple times. And that is the result.

Jorma_Tuomainen · November 28, 2020, 7:43am

Might be problem on the newer, logs from the offsite nas (and before that I deleted the share on confirmed that there was no such subvolume and it was created by rockstor):

[28/Nov/2020 09:40:02] ERROR [smart_manager.replication.receiver:95] Failed to create the replica metadata object for share: 00000000-0000-0000-0000-AC1F6B14152E_Configs… Exception: 500 Server Error: INTERNAL SERVER ERROR for url: http://127.0.0.1:8000/api/sm/replicas/rshare
[28/Nov/2020 09:40:02] ERROR [storageadmin.util:45] Exception: Replicashare(00000000-0000-0000-0000-AC1F6B14152E_Backup) already exists.
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 206, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

Any idea how to manually do the replication so it’ll hopefully work in the future when I update the main NAS to rockstor 4 (where it’s hopefully fixed)?

I’m little time pressed so I can do initial transfer over LAN (I have 1000/1000 but other does not )

phillxnet · November 28, 2020, 8:31pm

@Jorma_Tuomainen, Hello again.
Re:

Is that at both the sender and receiver?

If either one is old CentOS testing then it will fail. But from a quick look your report looks like it could be the known issue.

It’s not the shares that need deleting, it’s the last snapshot created on the sender. Note that sender and receiver report on each replication event, sometimes one is more informative than the other. The only known fault on replication from 3.9.2-57 and newer is this blocking snapshot that gets left when there is a failure to ‘see’ the receiver. See:

github.com/rockstor/rockstor-core

[suse leap] Replication failure

opened 02:01PM - 18 Sep 20 UTC

Marenz

When replicating a share from one instance to another, if it once failed to re…ach the other instance: ``` Receiver(icebear:10002) is unreachable. Aborting.. Exception: Receiver(icebear:10002) is unreachable. Aborting. ``` it will from then on fail with every new attempt saying ``` Failed to create snapshot: video_1_replication_2. Aborting.. Exception: 500 Server Error: INTERNAL SERVER ERROR for url: http://127.0.0.1:8000/api/shares/4/snapshots/video_1_replication_2 ``` From what I gather this is because that snapshot with that name already exists? After I deleted it manually, it worked again. I suspect the previous failed attempt created a new snapshot to be transferred and didn't properly clean it up when it failed?

Which specifies our leap variant but is shared with our soon to be legacy CentOS variant.

Rockstor’s Replication takes 5 iterations to settle down into it’s stable state. Before that it can look quite confusing with a strangely named ‘shares’ that is meant to be hidden but is, due to a cosmetic bug, visible. But once it settles in this share/snapshot is removed.

We need a proper tecnical write up/wiki for our replication, and a while ago I almost got done preparing one. But unfortunately I’ve not managed to return to this to complete it. I’ll try harder when I next get the time to prioritise this.

Rockstor’s replication is based on btrfs send/recieve but due to it’s specific structure/naming etc it would be very trick, without prior indicated ‘missing’ tecnical write-up to duplicate this. But that does not stop you doing a regular btrfs send/receive via a cron or the like. It just wont integrate with the Web-UI.

Again, sorry for not having the tecnical doc done on this. All a matter of time and priorities I’m afraid.

Hope that helps.

Jorma_Tuomainen · November 28, 2020, 8:47pm

The sender is the 3.9.2-57, receiver is 4.0.4. And I’ve been deleting shares on the receiver and snapshots on the sender and re-enabling replication. Does now work, probably will need to fallback to btrbk.

phillxnet · November 28, 2020, 9:28pm

OK. Once you delete shares and associated snapshots on the receiver, you may very well have to re-setup replication from scratch as the existing one expects a certain structure at the receiver. It’s basically build around a share and 3 ro snapshots (btrfs send/receive requires read only snapshots). And the Rockstor code then initiates a replication send of differences since the last replication from the most recent snapshots at each end. So once you delete anything other than the last snapshot on the sender if it’s failing because that snapshot didn’t get replicated. You pretty much have to start a fresh.

Yes I think I’ve heard good things about that project. Let us know how you get on. It may be a nice thing to integrate in the future as they specialise in replication.

We definitely have work to do on the replication front, but it’s just not a priority as it’s at feature parity ready for our next Stable release. And once everything else is likewise we can move forward on technical debt and enhancements as we go. One of which will hopefully be the sighted issue. Assuming someone finds the time/inclination that is.

Appologies for not being of more help here. I am myself using replication without issue between to Rockstor 4s and have been doing for quite some time now during our move to the ‘Built on openSUSE’ variant. A hinderance re replication is that it’s entirely non trivial within our current setup so it’s tricky for folks to just drop in a do fixes. Hopefully in time we will receive more contributions in this area of the code.

Thanks again for your reports and let us know how you get on with btrbk.

Jorma_Tuomainen · November 29, 2020, 6:12am

Ok, I redid everything and it went on for a while (put 5 min intervals), worked the I got:

[29/Nov/2020 05:20:01] ERROR [storageadmin.util:44] Exception: Snapshot (Configs_5_replication_122) already exists for the share (Configs).
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

Sorted it by deleting the snapshot and it seemed to re-create it and upload it.

logs are also full of stuff like:
[29/Nov/2020 02:49:09] ERROR [system.osi:174] non-zero code(1) returned by command: [’/usr/sbin/btrfs’, ‘qgroup’, ‘show’, ‘/mnt2/Data/.snapshots/Backup/bb_202010042300’]. output: [’’] error: [“ERROR: can’t list qgroups: quotas not enabled”, ‘’]

now doing bigger shares (does this feature have any resume capabilities?).

phillxnet · November 29, 2020, 10:33pm

@Jorma_Tuomainen Re:

OK, so that’s at least a workaround.

This looks to be just quotas disabled. If the Web-UI shows them as enabled, this could be a bug. If you imported this pool I have seen this happen. If you want quotas enabled then just enable them via the command line. The Web-UI should then be back in sync and will then be able to successfully disable / re-enable from there on. It seems pool quota state can get out of sync on pool imports and this command line switch to can then help re-sync it.

No. btrfs send/receive has no such capability and Rockstor’s replication is a light convenience wrapper around that we also have no resume capability.

Hope that helps.