Replication over WAN slow. Error: "No response received from the broker"

Hi,
Trying to replicate over WAN to a distant destination. Average ping response is around 400ms. Source looks fine, however every few seconds, I get the following errors in the destination rockstor.log:

[10/Jan/2016 12:35:01] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. remaining tries: 9

This appears irregularly, but every 2-3 mins on average. I notice receiver.py has:

if (socks.get(self.dealer) == zmq.POLLIN):
else:

msg = (‘No response received from the broker. remaining tries: %d’ % num_tries)

Is whatever POLLIN is set to too few milliseconds for this scenario? Replication seems to be progressing because I also regularly get:

[10/Jan/2016 12:44:21] DEBUG [smart_manager.replication.listener_broker:208] Active Receiver: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Messages processed: 27000

and the “Messages processed” steadily increments. But I should be achieving close to the 3mbps upload bandwidth from the source NAS, but actually currently only achieving around 0.68mbps (“Rate: 85 KB/sec”), so wondering if these errors are decelerating replication.

I’ve allowed ports 443 and 10002 (my designated Remote Listener Port) between the source and destination. Are any others needed?

Many thanks.

It’s exciting to see replication being tested over WAN. Do note that data is not encrypted but hopefully we can get to that soon.

That error message is OK. As long as there are “remaining tries” it means that a datagram from sender hasn’t made it to the receiver yet. Totally expected over WAN.

I am pretty sure the slowness is not related to that mechanism, but I’ll be setting up a transatlantic replication task soon, so I might gain more insight into the matter.

What are you speeds like between those two boxes if you were to say, scp a file over?

Just returned to check this after a couple of hours away and it had failed. The log on receiver suggests it lost contact with the sender:

[10/Jan/2016 13:22:35] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 9
[10/Jan/2016 13:22:41] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 8
[10/Jan/2016 13:22:47] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 7
[10/Jan/2016 13:22:53] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 6
[10/Jan/2016 13:22:59] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 5
[10/Jan/2016 13:23:05] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. r
emaining tries: 4
[10/Jan/2016 13:23:11] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. remaining tries: 3
[10/Jan/2016 13:23:17] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. remaining tries: 2
[10/Jan/2016 13:23:17] DEBUG [smart_manager.replication.listener_broker:79] Active Receiver: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Total messages processed: 29962
[10/Jan/2016 13:23:23] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. remaining tries: 1
[10/Jan/2016 13:23:29] ERROR [smart_manager.replication.receiver:315] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. No response received from the broker. remaining tries: 0
[10/Jan/2016 13:23:29] ERROR [smart_manager.replication.receiver:88] No response received from the broker. remaining tries: 0. Terminating the receiver… Exception: No response received from the broker. remaining tries: 0. Terminating the receiver.
[10/Jan/2016 13:23:29] DEBUG [smart_manager.replication.listener_broker:274] Identitiy: c8513240-674b808e-b510-4896-8998-5036f72689c0-6 command: receiver-error
[10/Jan/2016 13:23:29] DEBUG [smart_manager.replication.receiver:108] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Response from the broker: ACK
[10/Jan/2016 13:24:17] DEBUG [smart_manager.replication.listener_broker:75] Receiver(c8513240-674b808e-b510-4896-8998-5036f72689c0-6) exited. exitcode: 3. Total messages processed: 29962. Removing from the list.
[10/Jan/2016 13:50:18] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
[10/Jan/2016 14:50:22] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
[10/Jan/2016 15:50:26] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.

How long can replication tolerate connectivity loss? I can query from receiver to sender on port 10002 fine now. I’d hope replication would continue trying if not indefinitely, then perhaps for 24 hours. Or could make this configurable?
I’ll try SCP tomorrow.

An SCP copy of around 500Mb between the two NASs achieves 368.2KB/s, which is appx four times faster than Rockstor replication. This is what I would expect from the 3Mbps bandwidth of this connection. Can anything be done to optimise \ remove overhead?

My testing so far within LAN is going pretty well. I have some ideas about improving speed but wanted to get the functionality correct first. So, it’s definitely happening soon.