Replication fails: Failed to create snapshot

MarcelW · May 27, 2019, 7:00pm

Hello there,

i have two rockstor appliances, both with stable updates, both at version:
Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 48
Size : 85 M
Repo : installed
From repo : Rockstor-Stable
Summary : RockStor – Store Smartly
License : GPL
Description : RockStor – Store Smartly

When i create the replication task on the sender, it always fails with the message:
Failed to create snapshot: kd1015-1_5_replication_1. Aborting… Exception: 500 Server Error: INTERNAL SERVER ERROR

Can you help me please?

phillxnet · May 27, 2019, 7:20pm

@MarcelW Welcome to the Rockstor community.

If there is a network interruption when a send task is initiated it can end up getting stuck.
The ‘hold up’/bug is often that the indicated snapshot already exist from the prior failed send task and the intervention required is for the indicated replication specific snapshot to be deleted so that it no longer ‘holds up’ the next send task.

Have you had this replication setup working previously, and then it stopped working? If so you may just be subject to this bug.

Let us know how you get on and if the above suggestion ‘un blocks’ your replication setup. Also note that it’s best to look to the first failed send task for the actual error as subsequent failures are a little less informative.

Hope that helps and thanks for helping to support Rockstor’s development.

MarcelW · May 27, 2019, 7:22pm

@phillxnet thank you for your answer.

There is no network interruption, both appliances are up and running perfect.

The replication was never working.

What can i do now?

phillxnet · May 27, 2019, 7:29pm

@MarcelW We will need to see the exact setup of both of your machines as if it has never worked then it is very likely that you have a configuration issue.
I’d first make sure that you have the replication task enabled on both machines as that can be easily forgotten.

Best if you send details of how you setup this replication and images of the relevant configurations within the Web-UI. Our docs on replication are a little outdated unfortunately.

This way others can chip in with possible mis-configurations. Also some details of how the machines are connected, ie presumably over a local lan or the like.

Hopefully that should help others on the forum who have successfully setup replication to chip in with ideas of where things may have gone wrong.

MarcelW · May 27, 2019, 7:37pm

OK.
On both systems the replication service is turned on, the IP and the port ist the default one.
I have checked if the defined ports are open (via https://www.yougetsignal.com/tools/open-ports/)
Both ports (sender/receiver) are open.
Both machines are directly connected to the WAN (public IP adress).

First, i added the appliances to both systems, with IP adress and mgmt port 443 (default).
Then on the sender machine, i added a replication task to the receiver machine and specified.

The connection to the receiver node works. When the sender node tries to replicate to the receiver, a receiver task is created and a new share on the receiver machine is created.

But the task always fails.

MarcelW · May 27, 2019, 8:08pm

I rebooted the sender machine, now nothing is reachable.
I cannot login via SSH, but i see some messages with “audit: backlog limit exceeded”.