Replication failing

Stefan · May 4, 2020, 3:53am

Hi All

I am setting up replication and I am getting a receiver-error.

Primary: rockstor
Secodary: backup

I noticed when I add a receiver the backup appliance shows up at Rockstor in the interface. However i set up the appliance communication using ip addresses.

Receiver:
[04/May/2020 14:14:02] ERROR [storageadmin.util:44] Exception: Share (1E484D56-64A3-9411-8BCA-716EC6808932_data) already exists. Choose a different name.
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[04/May/2020 14:14:02] ERROR [smart_manager.replication.receiver:91] Failed to verify/create share: 1E484D56-64A3-9411-8BCA-716EC6808932_data… Exception: 500 Server Error: INTERNAL SERVER ERROR

Sender:

[04/May/2020 14:14:05] ERROR [system.osi:119] non-zero code(1) returned by command: [’/usr/sbin/b
trfs’, ‘qgroup’, ‘assign’, ‘0/6155’, ‘-1/-1’, ‘/mnt2/data_pool’]. output: [’’] error: [‘ERROR: un
able to assign quota group: Invalid argument’, ‘’]
[04/May/2020 14:14:05] ERROR [fs.btrfs:1189] “ERROR: unable to assign quota group: Invalid argume
nt” received on fs (/mnt2/data_pool), skipping qgroup assign: child (0/6155), parent (-1/-1). Thi
s may be related to an undetermined quota state.
[04/May/2020 14:14:05] ERROR [smart_manager.replication.sender:74] Id: 1E484D56-64A3-9411-8BCA-71
6EC6808932-2. unexpected reply(receiver-error) for 1E484D56-64A3-9411-8BCA-716EC6808932-2. extend
ed reply: Failed to verify/create share: 1E484D56-64A3-9411-8BCA-716EC6808932_data… Exception: 5
00 Server Error: INTERNAL SERVER ERROR. Aborting. Exception: unexpected reply(receiver-error) for
1E484D56-64A3-9411-8BCA-716EC6808932-2. extended reply: Failed to verify/create share: 1E484D56-
64A3-9411-8BCA-716EC6808932_data… Exception: 500 Server Error: INTERNAL SERVER ERROR. Aborting
Best regards

phillxnet · May 4, 2020, 11:13am

@Stefan H Hello again.

We have have a naming bug in this regard but if you’ve used the IP you should be good.

There is an outstanding fragility in our replication where upon a network / communication failure can result in a snapshot being created ready for send that is then not sent. But upon the next scheduled send then blocks that and all subsequent sends. The work around is to delete that snapshot so that a fresh one can be created and then sent ‘in one go’ as it were. It’s a work around but does free up subsequent replication steps.

I’ve not seen this reported in the wild but it’s actually something I’ve just fixed as part of a pull request I’m currently working on and is, as indicated, down to an

And a bug in how we protect against making such a call -1/-1 is our short hand ‘flag’ for pool native quota group not available either via quotas disabled, try enabling them if they are disabled, or for a read only file system, or the indicated ‘undetermined quota state’.

So in short your current quota state on one of the pools may be throwing the replication at least until I get to publish this fix. I’ve actually only seen this -1/-1 error on an openSUSE install. Are either of your machines using our new ‘Built on openSUSE’ testing channel?

Could you also indicate your setup. For replication to work both machines must be running a Stable subscription, bugs in replication fixed since we terminated the testing channel for our CentOS rpms, and due to a bug in our now legacy CentOS testing channel, if you have updated from that channel to the stable channel you may not actually have stable installed, even if the Web-UI indicates this. See the following forum thread for a one of fix for this if that is the case:

Also note that as replication uses the Appliance ID each machine must be unique in this regard.

So in short make sure both machines are actually running latest Stable, or latest openSUSE based testing rpms (though they are less well tested in this regard currently) via the command line (above forum post) and that both machines have unique Appliance IDs.

Have you had replication working in this setup before. In which case it may be you are just affected by the known failure due to a network outage when the sender can’t see the receiver but makes the initial ‘blocking’ snapshot anyway.

As a potentially related element here, I believe one of your machines may be affected by an issue we had when migrating to Appman where older subscriptions are unfortunately not correctly authenticated if they have had their Appliance ID change since install with our migration period. If you could sign-up to our Appman service and ensure both your machines Appliance ID’s are current, you can edit them otherwise, and if they are and one still doesnt’ receive the latest 3.9.2-57 update then please Private Message me here on the forum and I will sort it out there where we can more easily exchange new credentials if required.

Let us know how it goes and additionally if both machines are current via the command line as that is the first step really as our last CentOS based testing channel release had a known broken replication subsystem.

Hope that helps and remember to PM me if both machines are not current and updating/correcting their Appliance ID’s in Appman doesn’t sort their updates so we can get that sorted and return to this public discussion on the replication issue.

Stefan · May 5, 2020, 11:52am

Thank @phillxnet

I got replication work till then. Now is start and then it stops but I am expecting my Switch to have issues.

I PMed you regarding my primary, it is currently not updating to the latest version. The Appliance ID has changed a while ago when I migrated to new HW but i received a new licenses for it.

Best regards
Stefan

phillxnet · May 5, 2020, 11:57am

@Stefan Cheers, that clears things up a little

And yes a network issue can currently break the replication in the manner I detailed.

Lets first get your machine sorted with the Appman issue first in the PM and then return to this and your other threads replication issue.

Linking to your other thread for context: