Replication is failing on v3.9.1-16

clawes · November 8, 2017, 10:21pm

I have successfully paired 2 rockstor servers. Both are running v3.9.1-16.
Unfortunately, I can’t get replication to work. I am seeing the following error:

[08/Nov/2017 14:00:36] ERROR [smart_manager.replication.sender:73] Id: 423B4A9B-3B81-99CA-55E5-1A7620713001-2. Failed to create snapshot: aim_2_replication_1. Aborting… Exception: [u’Invalid api end point: http://127.0.0.1:8000/api/shares/aim/snapshots/aim_2_replication_1’]

phillxnet · November 9, 2017, 5:37pm

@clawes Thanks for the report. It may be that this is an as yet unreported regression from a recent share api change. At least this is what it looks like on first examination.

As such I have created the following Github issue to address this:

Thanks again for the report, I though we had managed to deal with all these regressions during the testing release but it would appear not. Apologies. Hopefully this should be picked up pretty quickly and I have make a note in that issue to update this forum thread upon the issues resolution. I’m not personally familiar with that area of the code but I am due to be so I may well be able to pick this one up myself in a bit (if nobody else beats me to it of course). But I can’t adopt it directly unfortunately.

Hope that helps.

ecook280 · November 24, 2017, 9:04pm

Have the exact same issue after pairing 2 recent installs and attempting replication. They can see each other, the remote share is being created successfully but fails on every attempt with same log entries. Both are on version 4.12.4-1

Sender log

[24/Nov/2017 13:00:02] ERROR [smart_manager.replication.sender:73] Id: 00000000-0000-0000-0000-0xxxxxxxxxxx-1. Failed to create snapshot: test_1_replication_1. Aborting… Exception: [u’Invalid api end point: http://127.0.0.1:8000/api/shares/test/snapshots/test_1_replication_1’]

Receiver log

[24/Nov/2017 13:00:02] ERROR [smart_manager.replication.receiver:89] Failed to create the replica metadata object for share: 00000000-0000-0000-0000-0xxxxxxxxxxx_test… Exception: 500 Server Error: INTERNAL SERVER ERROR

phillxnet · January 30, 2018, 12:09pm

@clawes, @ecook280, and @Prabu_Raj (from a thread linking to this one). As per:

I finally got around to working on this replication regression issue.

As per Rockstor stable channel release version ~~3.9.12-13~~ 3.9.2-13 the indicated issue is now closed via it’s associated pull request:

github.com/rockstor/rockstor-core

Fix replication regression re share api change. Fixes #1853

rockstor:master ← phillxnet:1853_suspected_replication_regression_re_share_api_change

opened 07:24PM - 24 Jan 18 UTC

phillxnet

+270 -70

Prior function was restore by updating the relevant api urls in the replication …system. But in one, non critical check this was not managed. As a result the non critical but desirable internal sanity check was, for the time being, remarked out and TODOs added to signify required future attention. Also includes a number of replication UI fix ups including table formatting, sort capability, appropriate ordering, and consistency with the rest of the UI. A core mechanic of the replication system was also abstracted such that it could benefit from the existing share / snapshot management system. This alteration made more explicit, in code, a quirk involving the initial subvol transferred and how it requires a special treatment given it's unique nature in the context of the existing share / snapshot / clone / import structures. Summary: - Update internal replication api urls to fit new share by id scheme. - Remove problematic non critical internal check (TODO added for later). - Create new repclone command to improve replication share / snap handling, ie more transactional by using existing structures via api. - Minor improvements were added in the new repclone system, ie sys refresh qgroup transfer, auto mounting etc. - Ensure we import initial btrfs receive quirk subvol with replica flag and add logging to indicate it's quirk nature. - Improve replication tables formatting, text, sort ability / order. - Reduce and mirror (between send and receive) rep-snap-count. - Fix bugs in receive status update, includes utilising the pending state. - Improve user communication re data transferred / rate table cell text. - Remove broken pool link in replication overview table. - Improve debug logging re share / snap import. - Minor steps / TODOs towards future share name/subvol_name seperation. - TODOs towards centralising mount path creation. - Various TODO added for future consideration. N.B. this pr assumes the prior application of the following (currently pending) pull request: "remove immutable flag prior to share delete. Fixes #1882" pr #1883 As that fix also addresses an observed breakage during replication runs and was used (pre-applied) during all testing of this pr's code. Fixes #1853 (if #1883 in pre-applied) @schakrava Ready for review. Testing included several hundred individual replication events (btrfs send receive hosts) with runs up to 150 ish. A real hardware arrangement was also tested where the source machine's (sender) share had more data than could be replicated in the chosen scheduled replication interval (10 mins). Additional send receive pairs were not activated and all data was successfully transferred. Note that with the in pr rep-snap-count change to 2, 3 snapshots are stored both on the sender and against the receivers share counterpart. And that for a replication cycle to reach it's final state, which will in turn there after be maintained by rotation, 5 replication events must be completed: the initial full-send followed by 4 incremental sends. There are still known limitations and bugs in the replication code post pr but these can be addressed more specifically in pending issues to be opened in due course: and depend on review outcome of this pr.

There is more robustness work to be done in this area but the indicated

is now fixed; along with some code and UI improvements that should help with this feature going forward.

Hope that helps and thanks for your reports and patience on this one.

niu_lin · March 6, 2018, 2:51pm

I encountered this issue today and found this discussion. Then I subscribed to stable channel for both my 2 rockstor instances for 3 years expecting an update. But after I activated the stable channel, it says now I am on 3.9.2-17 and is the latest version… My replication is still in trouble with this bug…

phillxnet · March 6, 2018, 3:26pm

@niu_lin Hello again, and thanks for helping to support Rockstor development.

Pretty sure you are running into an issue we have with displaying the available version as the installed version when switching from testing to stable. It’s a pain as the fix is available in an update that is not actually applied, bit of a chicken and egg that will be resolved when the next iso is released. The relevant issue is:
https://github.com/rockstor/rockstor-core/issues/1870
and it’s fix (from 10th December):

github.com/rockstor/rockstor-core

version and date incorrectly reported re update info. Fixes #1870

rockstor:master ← phillxnet:1870_version_and_date_incorrectly_reported_re_update_info

opened 06:24PM - 10 Dec 17 UTC

phillxnet

+7 -6

A recent move to yum from rpm inadvertently caused a miss reporting of installed… version for that of available version. This was due to the differing outputs and the inclusion of both installed and available in the newer yum format. The command switch set was revised to output only the installed version and to include the previously missing and required build date/time info. Consequent parsing adjustments were made to accommodate for the now changed date format. Time info was excluded as it is not used. @schakrava Ready for review. Fixes #1870 Please see issue text for stable package test / validation procedure. As there are currently existing anomalies in package change log dates and contents re recently added minor version numbers this must be resolved in future package releases for the intended function of the "List of changes in this update" to correctly populate. See issue text for details. Currently it will be blank for the more recent stable package releases but should work as intended when the date anomalies are resolved in future package releases (ie when package changelogs indicate the time of release). Also tested by applying to a fresh 3.9.1-0 install pre and post testing channel subscription and thereafter then updating via command line to 3.9.1-15. In all instances installed and available were displayed correctly. An update from 3.9.1-15 to 3.9.1-16 was then successfully completed via the command line.

Essentially you need to manually update initially. To confirm you have this issue you will see from the following command:

yum info rockstor

That your Web-UI indicated ‘installed’ version is actually the available version, not the installed version. Bit of a nasty one really. Hence the almost instantaneous update directly after subscribing to stable. Anyway if you:

yum update

on each machine as the root user you should be up and running with all the last few months of goodness. There after the Web-UI should function normally re displaying the installed version and prompting when a newer version is available. You will probably have to remove you current replication config and it’s associated snapshots but see how it goes.

Hope that helps and let us know how you get on.

niu_lin · March 7, 2018, 1:12am

Thanks @phillxnet for your quick reply. As suggested, I made a manual yum update and it brought me to version 3.9.2.
However, on web ui, it still says I am on 3.9.2-17 and is the latest. As you mentioned the replication fix is in 3.9.12-13. What should I do to receive the up-to-date version?
Following is my yum info output:

**[root@nas2 ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.ustc.edu.cn
 * epel: mirror01.idc.hinet.net
 * extras: mirrors.nju.edu.cn
 * updates: mirrors.nju.edu.cn
Installed Packages
Name        : rockstor
Arch        : x86_64
Version     : 3.9.2
Release     : 17
Size        : 77 M
Repo        : installed
From repo   : Rockstor-Stable
Summary     : RockStor -- Store Smartly
License     : GPL
Description : RockStor -- Store Smartly**

Haioken · March 7, 2018, 2:58am

@niu_lin

I believe he meant 3.9.2-13, as I don’t believe they’re up to 3.9.12
Hell, according to github, they’re not even up to 3.9.2-17, so…

niu_lin · March 7, 2018, 5:09am

Anyway, I am still getting the same error in replication. I even rebooted both instances without luck. Any suggestions?

phillxnet · March 7, 2018, 9:46am

@niu_lin as per:

Thanks @Haioken.

Yes, sorry about that. You Web-UI indicator should now be working as expected and you should now have the updated replication code.

Yes:

So remove the remains of the previously failed replication:

On sender:

Delete replication related snapshots of the share you are replicating (they are named obviously)
Delete the replication send task.

On receiver:

Delete replication related snapshots of the below shares:
all replication related shares (obviously named as they have really long names which include the sender application id).
Delete the receiver task.

Given the prior code failed early and hadn’t worked for a while there is nothing really but a skeleton to remove.

This way the fixed code gets a clean start so when you configure a fresh send task it should be good.

Let us know how it goes. And please be patient as it take 5 scheduled replication task events to fully ‘settle in’ to it’s final state of 3 snapshots of the source share and a destination share that doesn’t begin with ‘.’ that also has 3 snapshots of it’s own.

Hope that helps. Also note that btrfs send / receive, the system we use for the replication, is not currently robust to interruptions.

niu_lin · March 7, 2018, 10:40am

Thank you very much @phillxnet, I followed your steps one by one, and finally get it working now! My replication is going on right now, really appreciate your patience.

phillxnet · March 7, 2018, 11:19am

@niu_lin Glad it’s on it way. We do need to update our docs on this and add a technical wiki to explain what’s going on in the background. Oh well, bit by bit; I’ll hopefully have a go at this myself soon, while it’s still fresh.

Keep an eye on it and once the initial transfer is done it should be a little less touchy as it then only has to send the changes, which are far less time consuming. We do have improvements planned in this area and my tests, prior to releasing that last regression bug fix, indicated a lower than expected transfer rate so do keep that in mind. I expect everything to improve with time.

Do report any errors and issue you have with this feature: as with all non trivial projects it’s a perpetual ‘work in progress’ and there are already improvements earmarked.

Thanks for the update.

niu_lin · March 7, 2018, 1:42pm

Now I got another issue. After several replications, it promoted the oldest snapshot to share, however, the receiver responded with an error No response received from the broker, remaining retries 0, terminating the receiver and the replication is marked as failed on both sender and receiver side. However, on the receiver side, actually everything goes good, I can mount the replicated share and read the data normally. Any suggestions?

niu_lin · March 7, 2018, 1:47pm

And here’s the logs:

[07/Mar/2018 05:40:08] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 9
[07/Mar/2018 05:40:14] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 8
[07/Mar/2018 05:40:20] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 7
[07/Mar/2018 05:40:26] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 6
[07/Mar/2018 05:40:32] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 5
[07/Mar/2018 05:40:38] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 4
[07/Mar/2018 05:40:44] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 3
[07/Mar/2018 05:40:50] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 2
[07/Mar/2018 05:40:56] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 1
[07/Mar/2018 05:41:02] ERROR [smart_manager.replication.receiver:354] Id: ******-3B45-4F96-B68D-******-8. No response received from the broker. remaining tries: 0
[07/Mar/2018 05:41:02] ERROR [smart_manager.replication.receiver:91] No response received from the broker. remaining tries: 0. Terminating the receiver.. Exception: No response received from the broker. remaining tries: 0. Terminating the receiver.

niu_lin · March 7, 2018, 2:15pm

I restarted the receiver nas, the errors disappeared.
I now encountered another issue, I put a new file on the sender’s share, after a while the replication task starts and everything goes well. However, the new file was not visible on the receiver’s share… No error reported, I’m still studying it.

niu_lin · March 7, 2018, 2:16pm

Sorry, it’s my mistake, I should do a cd . in the directory so it can refresh… things are going good now. Thank you. Finally I think I have finished all the basic functionality test.

phillxnet · March 7, 2018, 2:29pm

@niu_lin

Afraid not. Do both machines have a stable ip or better still static. It could be that one has changed it’s ip. mid session. You will probably find that one or other task has disabled itself as they do that after a set number of failures to avoid log/email spamming (if Email Notifications is configured). It can definitely work for an extended period although I only went up to around the 200 scheduled rep task mark before creating the pull request. Also note that it may complain about a snapshot already existing (on the sender) after a failure of this type (auto off from repeated fails). This can be worked around by simply deleting the indicated snapshot: it’s the next thing on the list to sort in this area (see TODO comments):

github.com

rockstor/rockstor-core/blob/master/src/rockstor/smart_manager/replication/sender.py#L225-L231


      
          )
          if is_subvol(snap_path):
              return self.rt
          raise Exception(
              "Parent Snapshot(%s) to use in btrfs-send does not "
              "exist in the system." % snap_path
          )

Also make sure to look at the error indicated on the first failed replication task (click on the task to get it’s history) at both the sender and the receiver, later indicators are often less informative. And depending on the issue one may be more informative than the other, though not necessarily at the end one would expect as they pass messages back and forth.

Take care with this as that share will be supplanted by the oldest snapshot at every replication task, so you would probably want to share it, if at all, as read only. I’m unsure but it could be that sharing it and having a file open at the time of an oldest snap to share promotion could block that promotion process and consequently break the replication. You could test for that. Try leaving the receiver share unshared to rule that out. You could try snapping the share once the rep process has settled down and between rep tasks and then clone that snap (if it’s rw) and you have a second, unmanaged copy. Think of the target / receiver share as a managed off-line copy not as a live duplicate, it’s not a clustering file system solution but a btrfs send / receive maintained scheduled copy type thing.

So try not exporting the receiver share initially and doing nothing with it until you have stability and take it from there. I did do multiple >150 cycles of varying data sizes to test it’s stability though. Although that was without network changes and with no exporting on the receiver side.

See how you get on.

OK it seems we have an overlap on the thread posts, I’ll post what I have above anyway in case it helps others catch known gotchas.

It will take 4 rep task events I think to appear on the receiver share as it get promoted through the 3 snapshots.

I hope it is working for you and yes this ‘feature’ could definitely do with updated docs / explanation. I’ll try and get to that as it’s ripe to be miss-understood / confusing otherwise.

I was quite excited to get this up and running again as it did work for a goodly while before the recent regression and it’s quite a nice bit of code all in (by @suman) but would obviously benefit from more attention. Take careful notes if you fancy reporting anything and hopefully we can weed out the week points one at a time.

Thanks for persevering.