[SOLVED] Replication broken by upgrade

Hi All,

After upgrading 2 rockstor instances to the latest release ( from a very old one ), I have had a number of issues around replication.

I originally had 2 shares being replicated between the two rockstor instances which has been working for a very long time without issues. Immediately following the upgrade, replication stopped.

Trying to setup a new replication job as a test, that failed with the same issue.

Deleting the entire replica from the receiving node and trying again has changed the error for each new replication attempt. Looks like its re-created the share but now complains about No response received from the broker. remaining tries: 9

I’ve included below the key log snippets for the above points, but can’t get replication working again now.

Initial Network Configuration complaint:

[18/Oct/2016 20:09:58] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
[18/Oct/2016 21:10:02] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
[18/Oct/2016 22:10:06] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
[18/Oct/2016 23:10:10] DEBUG [smart_manager.replication.listener_broker:291] Replica trails are truncated successfully.
Package upgrade
[18/Oct/2016 17:33:13] DEBUG [smart_manager.data_collector:808] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server)
[18/Oct/2016 17:33:39] ERROR [smart_manager.data_collector:710] Exception while gathering kernel info: You are running an unsupported kernel(4.3.3-1.el7.elrepo.x86_64). Some features may not work properly. Please reboot and the system will automatically boot using the supported kernel(4.6.0-1.el7.elrepo.x86_64)
[18/Oct/2016 17:33:41] ERROR [storageadmin.views.network:151] NetworkConnection matching query does not exist.
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/network.py", line 149, in update_connection
    dconfig['connection'] = NetworkConnection.objects.get(name=dconfig['connection'])
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/db/models/manager.py", line 151, in get
    return self.get_queryset().get(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/db/models/query.py", line 310, in get
    self.model._meta.object_name)
DoesNotExist: NetworkConnection matching query does not exist.

Example of replication job failing with Share(xxxxxxxxxxx) already exists

[18/Oct/2016 23:36:30] DEBUG [storageadmin.util:48] Current Rockstor version: 3.8-14.22
[18/Oct/2016 23:38:02] DEBUG [smart_manager.replication.listener_broker:214] initial greeting from a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3
[18/Oct/2016 23:38:02] DEBUG [smart_manager.replication.listener_broker:240] New Receiver(a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3) started.
[18/Oct/2016 23:38:02] DEBUG [smart_manager.replication.receiver:144] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. Starting a new Receiver for meta: {u'snap': u'Narrowbeam_3_replication_1', u'incremental
': False, u'share': u'Narrowbeam', u'uuid': u'a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2', u'pool': u'BACKUP'}
[18/Oct/2016 23:38:02] ERROR [storageadmin.util:47] exception: Share(a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2_Narrowbeam) already exists. Choose a different name
Traceback (most recent call last):
  File "/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py", line 34, in run
    client, addr = self.socket.accept()
  File "/usr/lib64/python2.7/socket.py", line 202, in accept
    sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable
[18/Oct/2016 23:38:02] DEBUG [storageadmin.util:48] Current Rockstor version: 3.8-14.22
[18/Oct/2016 23:38:02] ERROR [smart_manager.replication.receiver:88] Failed to verify/create share: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2_Narrowbeam.. Exception: 500 Server Error: INTERNAL SERVER ERROR
[18/Oct/2016 23:38:02] DEBUG [smart_manager.replication.listener_broker:278] Identitiy: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3 command: receiver-error
[18/Oct/2016 23:38:02] DEBUG [smart_manager.replication.receiver:108] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. Response from the broker: ACK
[18/Oct/2016 23:38:20] DEBUG [smart_manager.replication.listener_broker:75] Receiver(a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3) exited. exitcode: 3. Total messages processed: 1. Removing from the list.
[18/Oct/2016 23:38:21] DEBUG [smart_manager.replication.listener_broker:295] Replica trails are truncated successfully.
[18/Oct/2016 23:39:21] ERROR [smart_manager.replication.listener_broker:298] Parent exited. Aborting.
[18/Oct/2016 23:40:23] DEBUG [smart_manager.replication.listener_broker:295] Replica trails are truncated successfully.
[18/Oct/2016 23:40:34] ERROR [storageadmin.middleware:35] Exception occured while processing a request. Path: /api/sm/replicas/trail/replica/3 method: GET
[18/Oct/2016 23:40:34] ERROR [storageadmin.middleware:36] Replica matching query does not exist.
Traceback (most recent call last):
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/views/decorators/csrf.py", line 57, in wrapped_view
    return view_func(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/views/generic/base.py", line 69, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 452, in dispatch
    response = self.handle_exception(exc)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 449, in dispatch
    response = handler(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/generics.py", line 241, in get
    return self.list(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/mixins.py", line 39, in list
    queryset = self.filter_queryset(self.get_queryset())
  File "/opt/rockstor/src/rockstor/smart_manager/views/replica_trail.py", line 34, in get_queryset
    replica = Replica.objects.get(id=self.kwargs['rid'])
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/db/models/manager.py", line 151, in get
    return self.get_queryset().get(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/db/models/query.py", line 310, in get
    self.model._meta.object_name)
DoesNotExist: Replica matching query does not exist.
[18/Oct/2016 23:41:23] ERROR [smart_manager.replication.listener_broker:298] Parent exited. Aborting.

Example of New replication failure:

[19/Oct/2016 08:26:25] DEBUG [storageadmin.util:48] Current Rockstor version: 3.8-14.22
[19/Oct/2016 08:28:02] DEBUG [smart_manager.replication.listener_broker:214] initial greeting from a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3
[19/Oct/2016 08:28:02] DEBUG [smart_manager.replication.listener_broker:240] New Receiver(a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3) started.
[19/Oct/2016 08:28:02] DEBUG [smart_manager.replication.receiver:144] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. Starting a new Receiver for meta: {u'snap': u'Narrowbeam_3_replication_1', u'incremental': False, u'share': u'Narrowbeam', u'uuid': u'a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2', u'pool': u'BACKUP'}
[19/Oct/2016 08:28:03] DEBUG [smart_manager.replication.listener_broker:278] Identitiy: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3 command: receiver-ready
[19/Oct/2016 08:28:03] DEBUG [smart_manager.replication.receiver:131] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3 command: receiver-ready rcommand: ACK
[19/Oct/2016 08:28:09] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 9
[19/Oct/2016 08:28:15] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 8
[19/Oct/2016 08:28:21] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 7
[19/Oct/2016 08:28:27] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 6
[19/Oct/2016 08:28:33] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 5
[19/Oct/2016 08:28:39] DEBUG [smart_manager.replication.listener_broker:79] Active Receiver: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. Total messages processed: 1
[19/Oct/2016 08:28:39] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 4
[19/Oct/2016 08:28:45] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 3
[19/Oct/2016 08:28:51] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 2
[19/Oct/2016 08:28:57] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 1
[19/Oct/2016 08:29:03] ERROR [smart_manager.replication.receiver:315] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. No response received from the broker. remaining tries: 0
[19/Oct/2016 08:29:03] ERROR [smart_manager.replication.receiver:88] No response received from the broker. remaining tries: 0. Terminating the receiver.. Exception: No response received from the broker. remaining tries: 0. Terminating the receiver.
[19/Oct/2016 08:29:03] DEBUG [smart_manager.replication.listener_broker:278] Identitiy: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3 command: receiver-error
[19/Oct/2016 08:29:03] DEBUG [smart_manager.replication.receiver:108] Id: a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3. Response from the broker: ACK
[19/Oct/2016 08:29:39] DEBUG [smart_manager.replication.listener_broker:75] Receiver(a8c0c901-d36a76c4-8126-4453-b804-3ec41c8651d2-3) exited. exitcode: 3. Total messages processed: 1. Removing from the list.
[19/Oct/2016 08:44:40] DEBUG [smart_manager.replication.listener_broker:295] Replica trails are truncated successfully.

Any help greatly welcome as the replication is my only backup method and I don’t want to risk loosing any data while I’ve no backup - as its single disks in each rockstor - so i’m exposed.

Anyone else experiencing replication being broken on latest dev builds ?

Oh well, I guess not given no replies - shame.

Anyway, I had to delete all replication sets and my backup data before being able to set them up again. Deleting some of the replication sets didn’t work. All then restart.