Replication is failing on v3.9.1-16

I have successfully paired 2 rockstor servers. Both are running v3.9.1-16.
Unfortunately, I can’t get replication to work. I am seeing the following error:

[08/Nov/2017 14:00:36] ERROR [smart_manager.replication.sender:73] Id: 423B4A9B-3B81-99CA-55E5-1A7620713001-2. Failed to create snapshot: aim_2_replication_1. Aborting… Exception: [u’Invalid api end point: http://127.0.0.1:8000/api/shares/aim/snapshots/aim_2_replication_1’]

@clawes Thanks for the report. It may be that this is an as yet unreported regression from a recent share api change. At least this is what it looks like on first examination.

As such I have created the following Github issue to address this:

Thanks again for the report, I though we had managed to deal with all these regressions during the testing release but it would appear not. Apologies. Hopefully this should be picked up pretty quickly and I have make a note in that issue to update this forum thread upon the issues resolution. I’m not personally familiar with that area of the code but I am due to be so I may well be able to pick this one up myself in a bit (if nobody else beats me to it of course). But I can’t adopt it directly unfortunately.

Hope that helps.

Have the exact same issue after pairing 2 recent installs and attempting replication. They can see each other, the remote share is being created successfully but fails on every attempt with same log entries. Both are on version 4.12.4-1

Sender log

[24/Nov/2017 13:00:02] ERROR [smart_manager.replication.sender:73] Id: 00000000-0000-0000-0000-0xxxxxxxxxxx-1. Failed to create snapshot: test_1_replication_1. Aborting… Exception: [u’Invalid api end point: http://127.0.0.1:8000/api/shares/test/snapshots/test_1_replication_1’]

Receiver log

[24/Nov/2017 13:00:02] ERROR [smart_manager.replication.receiver:89] Failed to create the replica metadata object for share: 00000000-0000-0000-0000-0xxxxxxxxxxx_test… Exception: 500 Server Error: INTERNAL SERVER ERROR

@clawes, @ecook280, and @Prabu_Raj (from a thread linking to this one). As per:

I finally got around to working on this replication regression issue.

As per Rockstor stable channel release version 3.9.12-13 3.9.2-13 the indicated issue is now closed via it’s associated pull request:

There is more robustness work to be done in this area but the indicated

is now fixed; along with some code and UI improvements that should help with this feature going forward.

Hope that helps and thanks for your reports and patience on this one.

I encountered this issue today and found this discussion. Then I subscribed to stable channel for both my 2 rockstor instances for 3 years expecting an update. But after I activated the stable channel, it says now I am on 3.9.2-17 and is the latest version… My replication is still in trouble with this bug…

@niu_lin Hello again, and thanks for helping to support Rockstor development.

Pretty sure you are running into an issue we have with displaying the available version as the installed version when switching from testing to stable. It’s a pain as the fix is available in an update that is not actually applied, bit of a chicken and egg that will be resolved when the next iso is released. The relevant issue is:
https://github.com/rockstor/rockstor-core/issues/1870
and it’s fix (from 10th December):

Essentially you need to manually update initially. To confirm you have this issue you will see from the following command:

yum info rockstor

That your Web-UI indicated ‘installed’ version is actually the available version, not the installed version. Bit of a nasty one really. Hence the almost instantaneous update directly after subscribing to stable. Anyway if you:

yum update

on each machine as the root user you should be up and running with all the last few months of goodness. There after the Web-UI should function normally re displaying the installed version and prompting when a newer version is available. You will probably have to remove you current replication config and it’s associated snapshots but see how it goes.

Hope that helps and let us know how you get on.

Thanks @phillxnet for your quick reply. As suggested, I made a manual yum update and it brought me to version 3.9.2.
However, on web ui, it still says I am on 3.9.2-17 and is the latest. As you mentioned the replication fix is in 3.9.12-13. What should I do to receive the up-to-date version?
Following is my yum info output:

**[root@nas2 ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.ustc.edu.cn
 * epel: mirror01.idc.hinet.net
 * extras: mirrors.nju.edu.cn
 * updates: mirrors.nju.edu.cn
Installed Packages
Name        : rockstor
Arch        : x86_64
Version     : 3.9.2
Release     : 17
Size        : 77 M
Repo        : installed
From repo   : Rockstor-Stable
Summary     : RockStor -- Store Smartly
License     : GPL
Description : RockStor -- Store Smartly**

@niu_lin

I believe he meant 3.9.2-13, as I don’t believe they’re up to 3.9.12
Hell, according to github, they’re not even up to 3.9.2-17, so…

Anyway, I am still getting the same error in replication. I even rebooted both instances without luck. Any suggestions?

@niu_lin as per:

Thanks @Haioken.

Yes, sorry about that. You Web-UI indicator should now be working as expected and you should now have the updated replication code.

Yes:

So remove the remains of the previously failed replication:

On sender:

  • Delete replication related snapshots of the share you are replicating (they are named obviously)
  • Delete the replication send task.

On receiver:

  • Delete replication related snapshots of the below shares:
  • all replication related shares (obviously named as they have really long names which include the sender application id).
  • Delete the receiver task.

Given the prior code failed early and hadn’t worked for a while there is nothing really but a skeleton to remove.

This way the fixed code gets a clean start so when you configure a fresh send task it should be good.

Let us know how it goes. And please be patient as it take 5 scheduled replication task events to fully ‘settle in’ to it’s final state of 3 snapshots of the source share and a destination share that doesn’t begin with ‘.’ that also has 3 snapshots of it’s own.

Hope that helps. Also note that btrfs send / receive, the system we use for the replication, is not currently robust to interruptions.

Thank you very much @phillxnet, I followed your steps one by one, and finally get it working now! My replication is going on right now, really appreciate your patience.

@niu_lin Glad it’s on it way. We do need to update our docs on this and add a technical wiki to explain what’s going on in the background. Oh well, bit by bit; I’ll hopefully have a go at this myself soon, while it’s still fresh.

Keep an eye on it and once the initial transfer is done it should be a little less touchy as it then only has to send the changes, which are far less time consuming. We do have improvements planned in this area and my tests, prior to releasing that last regression bug fix, indicated a lower than expected transfer rate so do keep that in mind. I expect everything to improve with time.

Do report any errors and issue you have with this feature: as with all non trivial projects it’s a perpetual ‘work in progress’ and there are already improvements earmarked.

Thanks for the update.

Now I got another issue. After several replications, it promoted the oldest snapshot to share, however, the receiver responded with an error No response received from the broker, remaining retries 0, terminating the receiver and the replication is marked as failed on both sender and receiver side. However, on the receiver side, actually everything goes good, I can mount the replicated share and read the data normally. Any suggestions?

And here’s the logs:

[07/Mar/2018 05:40:08] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 9
[07/Mar/2018 05:40:14] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 8
[07/Mar/2018 05:40:20] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 7
[07/Mar/2018 05:40:26] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 6
[07/Mar/2018 05:40:32] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 5
[07/Mar/2018 05:40:38] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 4
[07/Mar/2018 05:40:44] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 3
[07/Mar/2018 05:40:50] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 2
[07/Mar/2018 05:40:56] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 1
[07/Mar/2018 05:41:02] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 0
[07/Mar/2018 05:41:02] ERROR [smart_manager.replication.receiver:91] No response received from the broker. remaining tries: 0. Terminating the receiver.. Exception: No response received from the broker. remaining tries: 0. Terminating the receiver.

I restarted the receiver nas, the errors disappeared.
I now encountered another issue, I put a new file on the sender’s share, after a while the replication task starts and everything goes well. However, the new file was not visible on the receiver’s share… No error reported, I’m still studying it.

Sorry, it’s my mistake, I should do a cd . in the directory so it can refresh… things are going good now. Thank you. Finally I think I have finished all the basic functionality test.

@niu_lin

Afraid not. Do both machines have a stable ip or better still static. It could be that one has changed it’s ip. mid session. You will probably find that one or other task has disabled itself as they do that after a set number of failures to avoid log/email spamming (if Email Notifications is configured). It can definitely work for an extended period although I only went up to around the 200 scheduled rep task mark before creating the pull request. Also note that it may complain about a snapshot already existing (on the sender) after a failure of this type (auto off from repeated fails). This can be worked around by simply deleting the indicated snapshot: it’s the next thing on the list to sort in this area (see TODO comments):

Also make sure to look at the error indicated on the first failed replication task (click on the task to get it’s history) at both the sender and the receiver, later indicators are often less informative. And depending on the issue one may be more informative than the other, though not necessarily at the end one would expect as they pass messages back and forth.

Take care with this as that share will be supplanted by the oldest snapshot at every replication task, so you would probably want to share it, if at all, as read only. I’m unsure but it could be that sharing it and having a file open at the time of an oldest snap to share promotion could block that promotion process and consequently break the replication. You could test for that. Try leaving the receiver share unshared to rule that out. You could try snapping the share once the rep process has settled down and between rep tasks and then clone that snap (if it’s rw) and you have a second, unmanaged copy. Think of the target / receiver share as a managed off-line copy not as a live duplicate, it’s not a clustering file system solution but a btrfs send / receive maintained scheduled copy type thing.

So try not exporting the receiver share initially and doing nothing with it until you have stability and take it from there. I did do multiple >150 cycles of varying data sizes to test it’s stability though. Although that was without network changes and with no exporting on the receiver side.

See how you get on.

OK it seems we have an overlap on the thread posts, I’ll post what I have above anyway in case it helps others catch known gotchas.

It will take 4 rep task events I think to appear on the receiver share as it get promoted through the 3 snapshots.

I hope it is working for you and yes this ‘feature’ could definitely do with updated docs / explanation. I’ll try and get to that as it’s ripe to be miss-understood / confusing otherwise.

I was quite excited to get this up and running again as it did work for a goodly while before the recent regression and it’s quite a nice bit of code all in (by @suman) but would obviously benefit from more attention. Take careful notes if you fancy reporting anything and hopefully we can weed out the week points one at a time.

Thanks for persevering.