Replication Fails with Exception

I’m trying to replicate over a Tailscale network. Both machines can communicate with each other. And when both are connect on the local network replication works fine. It’s only when I bring Tailscale into the mix that replication fails. The only error I’m getting is [21/Apr/2025 22:18:04] ERROR [smart_manager.replication.sender:79] Id: a8c0ea01-d2ecf432-c76c-4643-8413-5a816b4e53b4-9. b"Failed while send-recv-ing command(b'sender-ready')". Exception:

Annoyingly the exception gets cut off.

@ironlenny welcome to the Rockstor community. the error message you’ve shown above is in the Rockstor logs (/opt/rockstor/var/log/rockstor.log)?

Can you provide a bit more detail on the system(s) you’re using, e.g. Rockstor Version(s), and kernel version. Do you use Rockstor on both the sender and receiver (the error message makes it pretty clear that at least the sending system is Rockstor-based, but to make sure the receiving system is as well)?

Can you determine whether the listening port is reachable between the sender/received? Does the service/send/receive setup complete without errors?

I am running Rockstor 5.0.15-0 on both the sender and receiver.

The log error is from /opt/rockstor/var/log/rockstor.log.

Running nc -zv on the sender and connecting to port 10002 on the receiver returns nc: connect to lexington port 10002 (tcp) failed: Connection refused. Neither machine is restricted to listen on a particular interface, so I don’t know why the connection would be refused over the tailnet when it’s not over the local network.

yes, the wonders of tailscale. I know, this is a generic troubleshooting guide, but at this point it might be an issue within the tailnet preventing the connection:

Unless you took manual steps to activate a firewall on the Rockstor boxes there should not be a restriction that way.

Are each of the Rockstor instances on physically separate networks connected to the tailnet?

3 Likes

No, they are connected to the same local net. No firewall is active on either. I want to prove that replication over Tailscale works before I separate them.

2 Likes

Trying to connect with tailscale nc lexington 10002 returns Dial("lexington", 10002): unexpected HTTP response: 502 Bad Gateway, dial failure: dial tcp 100.122.101.104:10002: connect: connection refused

From lexington’s rockstor.log:

[23/Apr/2025 12:44:43] ERROR [storageadmin.views.network:212] NetworkConnection matching query does not exist.
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/network.py", line 205, in update_connection
    dconfig["connection"] = NetworkConnection.objects.get(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 637, in get
    raise self.model.DoesNotExist(
storageadmin.models.network_interface.NetworkConnection.DoesNotExist: NetworkConnection matching query does not exist.
1 Like

But if you use tailscale ping <hostname-or-ip> it works, correct?

Yes. tailscale ping returns a pong.

Could this be a bug? Replication is the only thing failing over Tailscale. I can use the web interface. I can manually send snapshots. I can ping. I can SSH.

It certainly could be. If you don’t run the systems on tailscale, does replication start working using the WebUI?

One more thing, but I suspect you implicitly already answered it above:

In the tailscale admin console under Access control, all ports are set to accessible as well, correct?

[line 18 of the standard file]

	"acls": [
		// Allow all connections.
		// Comment this section out if you want to define specific restrictions.
		{"action": "accept", "src": ["*"], "dst": ["*:*"]},

		// Allow users in "group:example" to access "tag:example", but only from
		// devices that are running macOS and have enabled Tailscale client auto-updating.
		// {"action": "accept", "src": ["group:example"], "dst": ["tag:example:*"], "srcPosture":["posture:autoUpdateMac"]},
	],

1 Like

Yes, I verified that replication works across the local net. I did that by shutting down the Tailscale daemon on both ends.

Edit:
On the sender:

	"acls": [
		// Allow all connections.
		// Comment this section out if you want to define specific restrictions.
		{"action": "accept", "src": ["*"], "dst": ["*:*"]},
	],

On the receiver:

	"acls": [
		// Allow all connections.
		// Comment this section out if you want to define specific restrictions.
		{"action": "accept", "src": ["*"], "dst": ["*:*"]},
1 Like

Hi @ironlenny ,
Thanks a lot for the report, the time, and efforts in trying to figure that one out.

The connection between the sender and receiver during replication uses a pyzmq socket and now I wonder if it has some trouble connecting/maintaining the connection under Tailscale. I believe we do have some detailed logging introduced when @phillxnet recently worked in that area but that is only logged under DEBUG mode. Would you be ok with enabling debug mode and checking the logs then? It would imply the following:

  1. Turn off replication
  2. cd /opt/rockstor
  3. poetry run debug-mode ON
  4. Tail continuously the logs in a terminal window (tail -f /opt/rockstor/var/log/*) and then turn ON the replication and watch what happens when a replication is attempted.
  5. To turn OFF debugging (when desired): cd /opt/rockstor && poetry run debug-mode OFF

Hopefully this may provide some helpful clues.

2 Likes

I tried to recreate the scenario using two of my VMs on my home network:

  • I connected both machines to the tailscale network

  • Set up the remote appliances respectively on both machines

  • Configured the Replication service to use the tailscale interface on each machine, e.g.

  • Turned on replication service on both sides

  • Defined a replication job:

I first defined one without the Remote Listener Address populated. Let it run with the first replication (which worked), as well as one with the remote listener address populated (like in the screenshot above). In both cases, the first replication started and sent the first snapshot to the receiving appliance:

Looking at my rockstor log, I think the error message you’re showing above is not related to the replication. I find the same one, but the time stamp is way before (after bootup and letting the sending system sit for a few minutes) the first replication:

[26/Apr/2025 09:04:36] ERROR [smart_manager.data_collector:1017] Failed to update disk state.. exception: Exception while setting access_token for url(http://127.0.0.1:8000): HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2). content: None
[26/Apr/2025 09:04:36] ERROR [storageadmin.views.network:212] NetworkConnection matching query does not exist.
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/network.py", line 205, in update_connection
    dconfig["connection"] = NetworkConnection.objects.get(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/django/db/models/query.py", line 637, in get
    raise self.model.DoesNotExist(
storageadmin.models.network_interface.NetworkConnection.DoesNotExist: NetworkConnection matching query does not exist.
[26/Apr/2025 09:04:38] ERROR [smart_manager.data_collector:1017] Failed to update pool state.. exception: Exception while setting access_token for url(http://127.0.0.1:8000): HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2). content: None
[26/Apr/2025 09:04:40] ERROR [smart_manager.data_collector:1017] Failed to update share state.. exception: Exception while setting access_token for url(http://127.0.0.1:8000): HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2). content: None
[26/Apr/2025 09:20:05] INFO [smart_manager.replication.sender:341] Id: <removed appliance ID>-1. Sending full replica: /mnt2/rocksalami/.snapshots/config1/config1_1_replication_1
[26/Apr/2025 09:30:04] INFO [smart_manager.replication.sender:341] Id: <removed appliance ID>-2. Sending full replica: /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_1
[26/Apr/2025 09:35:04] INFO [smart_manager.replication.sender:335] Id: <removed appliance ID>-2. Sending incremental replica between /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_1 -- /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_3
[26/Apr/2025 09:40:04] INFO [smart_manager.replication.sender:335] Id: <removed appliance ID>-2. Sending incremental replica between /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_3 -- /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_4
[26/Apr/2025 09:45:07] INFO [smart_manager.replication.sender:335] Id: <removed appliance ID>-2. Sending incremental replica between /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_4 -- /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_5
[26/Apr/2025 09:50:06] INFO [smart_manager.replication.sender:335] Id: <removed appliance ID>-2. Sending incremental replica between /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_5 -- /mnt2/rocksenf/.snapshots/2FAuth-storage/2FAuth-storage_2_replication_6

I also tried

tailscale nc rockwurst.tailcXXXX.ts.net

as well as

tailscale nc rockwurst

which in both cases did not return any error message.

In my scenario I let it run for a few incremental sends after the initial one. So, it seems to work in this setup.

But it obviously doesn’t mean that your issue is not real, but is likely related to something else (and could also be a bug that shows up in certain conditions).

Let’s see whether @Flox suggestion for turning on debug mode will provide some more clarity.

3 Likes

Neither machine has poetry installed. Is it supposed to be?

If it wasn’t, Rockstor in this version wouldn’t work at all. zypper will not show it as installed. But it’s part of the virtual environment setup during the installation of Rockstor …

In the build.sh:

and based on the setup, the poetry commands are tied to the project root (hence the cd /opt/rockstor/ before executing poetry related commands).

2 Likes

So do I need to enter the virtual environment before I can run the command?

Hi @ironlenny,

Following the steps I listed should without you needing to do anything extra. Have you gotten errors while doing so? If yes, please list exactly what they are so we can see what is happening here.

When I cd /opt/rockstor, poetry is not a binary I can run. On either machine.

Edit: I assume you are using BASH. When I switch to BASH from FISH it works.

Edit the second: The cron job is firing, but I’m not seeing any logs show up when it does.

Edit the third: It’s working! I’m embarrassed to say that I had the receiving server listening on the wrong interface.

Thank you for your help. You guys were so helpful, I’ve decided to subscribe.

Again thank you!

Edit the fourth: If I were to offer a critique of my replication experience, it would be the lack of clear and useful information.

The replication process is pretty opaque. To start, the emails from cron weren’t helpful—in fact, they were misleading. All I was receiving was:

rcommand=b'SUCCESS', reply=b'A new Sender started successfully for Replication Task(10).'
b'A new Sender started successfully for Replication Task(10).'

If I had relied solely on those emails, I wouldn’t have known that the replication jobs were actually failing.

The replication interface showing the time elapsed since the last successful replication also isn’t very informative. As a sysadmin, I don’t care how long it’s been since a task ran—I care whether it succeeded. A simple “Success/Failed” status would be immediately more useful than elapsed time. If I want to know when the last successful backup was, I can always click the task and view the log.

And about those logs: the task logs shown in the UI could use improvement. Right now, they show successful runs—“happy-path” logging. They should also show failures with meaningful error messages like failed to connect to <receiver host> or snapshot already exists.

On top of that, the replication system itself doesn’t appear to generate system-level logs. While some underlying parts (like btrfs) might produce logs, the replication system doesn’t seem to emit its own. That makes it harder to trace problems or automate monitoring beyond what the UI shows.

Finally, in my particular case, the root cause of the failure was a simple configuration issue: the wrong network interface was listening. If the replication UI displayed the interface in use, this would have been trivial to diagnose.

In conclusion:

  • Replace ambiguous “time since last success” with a clear success/failure indicator.
  • Improve task logs in the UI—show failures, not just successes, and include helpful error messages.
  • Add system-level logging to the replication system.
  • Show the listening network interface in the UI to simplify troubleshooting.

Edit the fifth: I jumped the gun on some of my criticism. I did trigger that exception again. And it did show up as “failed” in the replication UI.

image

3 Likes