Replication - Failure

phillxnet · February 17, 2018, 12:50pm

@dvanremortel Thanks for the report but this has already been fixed as of stable channel release 3.9.2-13 (current stable channel release version is 3.9.2-15):

See GitHub tags:

via the linked issue:

github.com/rockstor/rockstor-core

suspected replication regression re share api change

opened 05:31PM - 09 Nov 17 UTC

closed 01:33PM - 29 Jan 18 UTC

phillxnet

bug

Thanks to forum member clawes in the following thread for highlighting this issu…e. It would appear that we have a share api change related regression in the replication code as indicated by the submitted error message: ``` [08/Nov/2017 14:00:36] ERROR [smart_manager.replication.sender:73] Id: 423B4A9B-3B81-99CA-55E5-1A7620713001-2. Failed to create snapshot: aim_2_replication_1. Aborting… Exception: [u’Invalid api end point: http://127.0.0.1:8000/api/shares/aim/snapshots/aim_2_replication_1’] ``` Linking to the potentially related api change pr for context: #1808 "change share and snapshot endpoints to by id and not by name." Please update the following forum thread with this issues resolution: https://forum.rockstor.com/t/replication-is-failing-on-v3-9-1-16/4005/1

and it’s associated pull request:

github.com/rockstor/rockstor-core

Fix replication regression re share api change. Fixes #1853

rockstor:master ← phillxnet:1853_suspected_replication_regression_re_share_api_change

opened 07:24PM - 24 Jan 18 UTC

phillxnet

+270 -70

Prior function was restore by updating the relevant api urls in the replication …system. But in one, non critical check this was not managed. As a result the non critical but desirable internal sanity check was, for the time being, remarked out and TODOs added to signify required future attention. Also includes a number of replication UI fix ups including table formatting, sort capability, appropriate ordering, and consistency with the rest of the UI. A core mechanic of the replication system was also abstracted such that it could benefit from the existing share / snapshot management system. This alteration made more explicit, in code, a quirk involving the initial subvol transferred and how it requires a special treatment given it's unique nature in the context of the existing share / snapshot / clone / import structures. Summary: - Update internal replication api urls to fit new share by id scheme. - Remove problematic non critical internal check (TODO added for later). - Create new repclone command to improve replication share / snap handling, ie more transactional by using existing structures via api. - Minor improvements were added in the new repclone system, ie sys refresh qgroup transfer, auto mounting etc. - Ensure we import initial btrfs receive quirk subvol with replica flag and add logging to indicate it's quirk nature. - Improve replication tables formatting, text, sort ability / order. - Reduce and mirror (between send and receive) rep-snap-count. - Fix bugs in receive status update, includes utilising the pending state. - Improve user communication re data transferred / rate table cell text. - Remove broken pool link in replication overview table. - Improve debug logging re share / snap import. - Minor steps / TODOs towards future share name/subvol_name seperation. - TODOs towards centralising mount path creation. - Various TODO added for future consideration. N.B. this pr assumes the prior application of the following (currently pending) pull request: "remove immutable flag prior to share delete. Fixes #1882" pr #1883 As that fix also addresses an observed breakage during replication runs and was used (pre-applied) during all testing of this pr's code. Fixes #1853 (if #1883 in pre-applied) @schakrava Ready for review. Testing included several hundred individual replication events (btrfs send receive hosts) with runs up to 150 ish. A real hardware arrangement was also tested where the source machine's (sender) share had more data than could be replicated in the chosen scheduled replication interval (10 mins). Additional send receive pairs were not activated and all data was successfully transferred. Note that with the in pr rep-snap-count change to 2, 3 snapshots are stored both on the sender and against the receivers share counterpart. And that for a replication cycle to reach it's final state, which will in turn there after be maintained by rotation, 5 replication events must be completed: the initial full-send followed by 4 incremental sends. There are still known limitations and bugs in the replication code post pr but these can be addressed more specifically in pending issues to be opened in due course: and depend on review outcome of this pr.

The indicated api URL has the share name in where as we now use the share ids. It was quite a bumpy path to changing the whole project over (pools were also changed similarly) but the replication was, as far as I’m aware, the last regression candidate. Although we still have some code tidy and ‘nice to have’ stuff that needs to be re-asserted.

See also in-progress changes re testing channel releases discussed towards the end of the following thread:

Hope that helps.