Replication doesn't handle failures

Hi,

When a replication failure occurs, the next replication does not recover. Eg:

The replication on Jan 10th failed after a connection loss. 2.74GB was transferred of the 7.5GB in the share. The next replication attempt on Jan 15th took only a few seconds, whereas I’d expect either a full re-attempt, or else a replication of the missing ~5GB. Log for the second rep on sender:

[15/Jan/2016 19:45:02] DEBUG [smart_manager.replication.listener_broker:254] new-send request received for 6
[15/Jan/2016 19:45:02] DEBUG [smart_manager.replication.listener_broker:143] previous backup failed for Sender(c8513240-674b808e-b510-4896-8998-5036f72689c0_6). Starting a new one. Attempt 0/10.
[15/Jan/2016 19:45:02] DEBUG [smart_manager.replication.listener_broker:148] No record of last successful ReplicaTrail for Sender(c8513240-674b808e-b510-4896-8998-5036f72689c0_6). Will start a new Full Sender.
[15/Jan/2016 19:45:02] DEBUG [smart_manager.replication.sender:102] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6 Initial greeting: {‘snap’: u’ISO_share_6_replication_1’, ‘incremental’: False, ‘share’: u’ISO_share’, ‘uuid’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0’, ‘pool’: u’<POOL_NAME>‘}
[15/Jan/2016 19:45:02] ERROR [storageadmin.util:38] request path: /api/shares/ISO_share/snapshots/ISO_share_6_replication_1 method: POST data: <QueryDict: {u’snap_type’: [u’replication’]}>
[15/Jan/2016 19:45:02] ERROR [storageadmin.util:39] exception: Snapshot(ISO_share_6_replication_1) already exists for the Share(ISO_share).
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[15/Jan/2016 19:45:02] DEBUG [smart_manager.replication.util:120] Snapshot(ISO_share_6_replication_1) already exists for the Share(ISO_share).
[15/Jan/2016 19:45:05] DEBUG [smart_manager.replication.sender:232] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. snap-exists received. Not sending fsdata
[15/Jan/2016 19:45:32] DEBUG [smart_manager.replication.listener_broker:60] Sender(c8513240-674b808e-b510-4896-8998-5036f72689c0_6) exited. exitcode: 0

Pertinent text from above is “share… already exists” and “snap-exists received. Not sending fsdata”. It seems to be assuming that if a share exists, even if it is incomplete, then a previous replication has succeeded. Log for second rep on receiver:

[15/Jan/2016 11:45:03] DEBUG [smart_manager.replication.listener_broker:210] initial greeting from c8513240-674b808e-b510-4896-8998-5036f72689c0-6
[15/Jan/2016 11:45:03] DEBUG [smart_manager.replication.listener_broker:236] New Receiver(c8513240-674b808e-b510-4896-8998-5036f72689c0-6) started.
[15/Jan/2016 11:45:03] DEBUG [smart_manager.replication.receiver:144] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Starting a new Receiver for meta: {u’snap’: u’ISO_share_6_replication_1’, u’incremental’: False, u’share’: u’ISO_share’, u’uuid’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0’, u’pool’: u’<POOL_NAME>‘}
[15/Jan/2016 11:45:03] ERROR [storageadmin.util:38] request path: /api/shares method: POST data: {u’sname’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share’, u’replica’: True, u’pool’: u’<POOL_NAME>‘}
[15/Jan/2016 11:45:03] ERROR [storageadmin.util:39] exception: Share(c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share) already exists. Choose a different name
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[15/Jan/2016 11:45:03] DEBUG [smart_manager.replication.util:145] Share(c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share) already exists. Choose a different name
[15/Jan/2016 11:45:03] ERROR [storageadmin.util:38] request path: /api/sm/replicas/rshare method: POST data: <QueryDict: {u’appliance’: [u’SENDER_NAS’], u’share’: [u’c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share’], u’src_share’: [u’ISO_share’]}>
[15/Jan/2016 11:45:03] ERROR [storageadmin.util:39] exception: Replicashare(c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share) already exists.
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[15/Jan/2016 11:45:05] DEBUG [smart_manager.replication.receiver:223] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Snapshot to be sent(/mnt2/<POOL_NAME>/.snapshots/c8513240-674b808e-b510-4896-8998-5036f72689c0_ISO_share/ISO_share_6_replication_1) already exists. Not starting a new receive process
[15/Jan/2016 11:45:29] DEBUG [smart_manager.replication.listener_broker:79] Active Receiver: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. Total messages processed: 1
[15/Jan/2016 11:46:05] DEBUG [smart_manager.replication.receiver:131] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6 command: snap-exists rcommand: None
[15/Jan/2016 11:46:05] DEBUG [smart_manager.replication.receiver:80] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-6. meta: {u’snap’: u’ISO_share_6_replication_1’, u’incremental’: False, u’share’: u’ISO_share’, u’uuid’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0’, u’pool’: u’<POOL_NAME>'} Receive successful
[15/Jan/2016 11:46:29] DEBUG [smart_manager.replication.listener_broker:75] Receiver(c8513240-674b808e-b510-4896-8998-5036f72689c0-6) exited. exitcode: 0. Total messages processed: 1. Removing from the list.
[15/Jan/2016 11:57:29] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.

Again “share… already exists” and “exitcode: 0. Total messages processed: 1”. I’d expect many more than one message would have been transferred. Is this a bug?

Seems Rockstor would be far from the first to not gracefully recover from a partial\failed btrfs send\receive: http://linux-btrfs.vger.kernel.narkive.com/A4yoKEmB/announcement-buttersink-like-rsync-for-btrfs-snapshots

“Since btrfs would keep a partial receive - with no easy way to tell whether a receive WAS partial or full - I would inevitably end up with interrupted sends causing a problem that couldn’t be resolved without manually deleting snapshots on the target end haphazardly until I nailed the incomplete one.”

However, the author of buttersink claims to have cracked it: “buttersink usually detects the errors and will either move the destination snapshot to mark it as partial/failed (for btrfs), or cancel and delete the partial upload. I’ve also found that it helps to wait a while (e.g. 30 seconds) after any volume deletes before trying the send/sync.”

Really helpful feedback @grizzly. I’ve created a blanked issue for this here.

Can you let more replication tasks run and report back what the pattern is? Logs are going to be very useful. Some key questions I’d be interested in are

  1. After a few replication tasks, do they all succeed or is there a different pattern of success and failures? We do have some auto recovery logic to send a full backup, I’d be interested in what(if any) low level btrfs errors look like.

  2. Again, after a few replication tasks finish, How do the contents of target Share look like? Totally corrupted content or missing content?

Thanks for testing it out thoroughly!

The 7GB test replication worked. As an aside, this eventually ended up with a new share. It would be nice to understand what is going on here, ie how and why initial replications result in snapshots, then eventually a share emerges.

However the full replication of ~5tb keeps failing. The best I’ve managed is ~2100GB. Sender log:

...
[25/Jan/2016 04:40:01] DEBUG [smart_manager.replication.sender:119] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8 Server: <Receiver_NAS>:10002 scommand:  rcommand: receiver-error
[25/Jan/2016 04:40:01] ERROR [smart_manager.replication.sender:73] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Got null or error command(receiver-error)
message(Failed to send receiver-ready. Exception: [Errno 32] Broken pipe) from the Receiver while transmitting fsdata. Aborting.. Exception: Failed to send receiver-ready. Exception: [Errno 32] Broken pipe

Receiver log:

[25/Jan/2016 04:39:33] DEBUG [smart_manager.replication.receiver:295] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Receiver alive. Data transferred: 2157.26 GB. Rate: 39.44 MB/sec.
[25/Jan/2016 04:40:01] ERROR [smart_manager.replication.receiver:88] Failed to send receiver-ready. Exception: [Errno 32] Broken pipe
[25/Jan/2016 04:40:01] DEBUG [smart_manager.replication.receiver:108] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Response from the broker: ACK

On resumption on sender:

[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.listener_broker:254] new-send request received for 8
[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.listener_broker:143] previous backup failed for Sender(c8513240-674b808e-b510-4896-8998-5036f72689c0_8). Starting a new one. Attempt 0/10.
[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.sender:102] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8 Initial greeting: {‘snap’: u’share01_8_replication_34’, ‘incremental’: True, ‘share’: u’share01’, ‘uuid’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0’, ‘pool’: u’pool01’}
[28/Jan/2016 22:18:02] ERROR [storageadmin.util:38] request path: /api/shares/share01/snapshots/share01_8_replication_34 method: POST data: <QueryDict: {u’snap_type’: [u’replication’]}>
[28/Jan/2016 22:18:02] ERROR [storageadmin.util:39] exception: Snapshot(share01_8_replication_34) already exists for the Share(share01).
Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[28/Jan/2016 22:18:02] DEBUG [smart_manager.replication.util:120] Snapshot(share01_8_replication_34) already exists for the Share(share01).
[28/Jan/2016 22:18:02] DEBUG [smart_manager.replication.sender:232] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. snap-exists received. Not sending fsdata

On resumption on receiver:

[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.listener_broker:210] initial greeting from c8513240-674b808e-b510-4896-8998-5036f72689c0-8
[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.listener_broker:236] New Receiver(c8513240-674b808e-b510-4896-8998-5036f72689c0-8) started.
[28/Jan/2016 22:18:01] DEBUG [smart_manager.replication.receiver:144] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Starting a new Receiver for meta: {u’snap’: u’share01_8_replication_34’, u’incremental’: True, u’share’: u’share01’, u’uuid’: u’c8513240-674b808e-b510-4896-8998-5036f72689c0’, u’pool’: u’pool01’}
[28/Jan/2016 22:18:01] ERROR [smart_manager.replication.receiver:139] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. There are no replication snapshots on the system for Share(c8513240-674b808e-b510-4896-8998-5036f72689c0_share01).
[28/Jan/2016 22:18:02] DEBUG [smart_manager.replication.receiver:223] Id: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Snapshot to be sent(/mnt2/pool01/.snapshots/c8513240-674b808e-b510-4896-8998-5036f72689c0_share01/share01_8_replication_34) already exists. Not starting a new receive process
[28/Jan/2016 22:18:50] DEBUG [smart_manager.replication.listener_broker:79] Active Receiver: c8513240-674b808e-b510-4896-8998-5036f72689c0-8. Total messages processed: 1

Any timescales for a fix? Many large replications and replications over slow links will fail, so I wouldn’t describe as a “corner” case. New reps should detect previous failures and ideally continue from where it left off, or else delete partial snapshot on receiver, and try again from scratch.

Soon, is all I can say. It’s near the beginning of the queue. Thanks for your patience and letting us know how important this issue is.