Replication failing

phillxnet · May 4, 2020, 11:13am

@Stefan H Hello again.

We have have a naming bug in this regard but if you’ve used the IP you should be good.

There is an outstanding fragility in our replication where upon a network / communication failure can result in a snapshot being created ready for send that is then not sent. But upon the next scheduled send then blocks that and all subsequent sends. The work around is to delete that snapshot so that a fresh one can be created and then sent ‘in one go’ as it were. It’s a work around but does free up subsequent replication steps.

I’ve not seen this reported in the wild but it’s actually something I’ve just fixed as part of a pull request I’m currently working on and is, as indicated, down to an

And a bug in how we protect against making such a call -1/-1 is our short hand ‘flag’ for pool native quota group not available either via quotas disabled, try enabling them if they are disabled, or for a read only file system, or the indicated ‘undetermined quota state’.

So in short your current quota state on one of the pools may be throwing the replication at least until I get to publish this fix. I’ve actually only seen this -1/-1 error on an openSUSE install. Are either of your machines using our new ‘Built on openSUSE’ testing channel?

Could you also indicate your setup. For replication to work both machines must be running a Stable subscription, bugs in replication fixed since we terminated the testing channel for our CentOS rpms, and due to a bug in our now legacy CentOS testing channel, if you have updated from that channel to the stable channel you may not actually have stable installed, even if the Web-UI indicates this. See the following forum thread for a one of fix for this if that is the case:

Also note that as replication uses the Appliance ID each machine must be unique in this regard.

So in short make sure both machines are actually running latest Stable, or latest openSUSE based testing rpms (though they are less well tested in this regard currently) via the command line (above forum post) and that both machines have unique Appliance IDs.

Have you had replication working in this setup before. In which case it may be you are just affected by the known failure due to a network outage when the sender can’t see the receiver but makes the initial ‘blocking’ snapshot anyway.

As a potentially related element here, I believe one of your machines may be affected by an issue we had when migrating to Appman where older subscriptions are unfortunately not correctly authenticated if they have had their Appliance ID change since install with our migration period. If you could sign-up to our Appman service and ensure both your machines Appliance ID’s are current, you can edit them otherwise, and if they are and one still doesnt’ receive the latest 3.9.2-57 update then please Private Message me here on the forum and I will sort it out there where we can more easily exchange new credentials if required.

Let us know how it goes and additionally if both machines are current via the command line as that is the first step really as our last CentOS based testing channel release had a known broken replication subsystem.

Hope that helps and remember to PM me if both machines are not current and updating/correcting their Appliance ID’s in Appman doesn’t sort their updates so we can get that sorted and return to this public discussion on the replication issue.