Replication to remote rockstor fails

kysdaddy · January 1, 2019, 2:18pm

Failed to create Snapshot: HPRS_Movies_2_replication_1 aborting… Exception: 500 Server Error: INTERNAL SERVER ERROR

i SEARCHED THE FORUM AND FOUND ISSUES ON 3.9… Dealing with Rockstor versions. Below are my versions and they seem to be identical.

Once again I am at a loss!

Last login: Mon Dec 31 18:15:19 2018 from xxx
[root@rockstor ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile

base: linux.cc.lehigh.edu
epel: mirror.math.princeton.edu
extras: linux.cc.lehigh.edu
updates: linux.cc.lehigh.edu
Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 44
Size : 79 M
Repo : installed
From repo : Rockstor-Stable
Summary : RockStor – Store Smartly
License : GPL
Description : RockStor – Store Smartly

[root@rockstor ~]#

Last login: Fri Dec 21 14:01:17 2018 from xxx
[root@lakerockstor ~]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile

base: repo1.ash.innoscale.net
epel: www.gtlib.gatech.edu
extras: repo1.ash.innoscale.net
updates: repo1.ash.innoscale.net
Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 44
Size : 79 M
Repo : installed
From repo : Rockstor-Stable
Summary : RockStor – Store Smartly
License : GPL
Description : RockStor – Store Smartly

[root@lakerockstor ~]#

kysdaddy · January 1, 2019, 4:03pm

Hello all.
I found out what “I” was doing wrong and once I forwarded the listening port on the remote server , it seems to be working.

phillxnet · January 1, 2019, 8:36pm

@kysdaddy Thanks for the report and update, and glad you got it sorted in the end.

We do still have some known fragility in our replication code that is due for some attention. At least one bug was introduced when we had to adapt (read mend) the code to our necessary share and pool api changes a while back. At that time I wasn’t able to re-establish complete prior robustness and I hope to get around to this in time, unless someone beats me to it of course.

The following TODO that I left in the code at the time springs to mind as a known failure point:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/smart_manager/replication/sender.py#L225-L231


#  create a snapshot only if it's not already from a previous
#  failed attempt.
# TODO: If one does exist we fail which seems harsh as we may be
# TODO: able to pickup where we left of depending on the failure.
self.msg = ('Failed to create snapshot: %s. Aborting.'
            % self.snap_name)
self.create_snapshot(self.replica.share, self.snap_name)

with create_snapshot() as:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/smart_manager/replication/util.py#L117-L128


def create_snapshot(self, sname, snap_name, snap_type='replication'):
    try:
        share = Share.objects.get(name=sname)
        url = ('shares/%s/snapshots/%s' % (share.id, snap_name))
        return self.law.api_call(url, data={'snap_type': snap_type, },
                                 calltype='post', save_error=False)
    except RockStorAPIException as e:
        # Note snapshot.py _create() generates this exception message.
        if (e.detail == ('Snapshot ({}) already exists for the share '
                         '({}).').format(snap_name, sname)):
            return logger.debug(e.detail)
        raise e

The bug surfaces as a ‘blocking’ snapshot on the sender following a failed send attempt that in turn blocks all future sends. The work around for now is to simply delete this snapshot so that a fresh one can be auto created upon the next send attempt. It’s for sure an inelegance and buggy behaviour but may just need a little attention to resolve. During my last stint in this area of the code I didn’t quite get around to resolving exactly why this failure happens. However if anyone cares to take a look they may find this code quite interesting. It uses zeromq for messaging between the sender and the receiver. Originally crafted by @suman with a little maintenance on my part there after. It would be good to return all it’s prior safeguards/robustness but it is, unfortunately, not a task to be undertaken lightly as the code exercises quite a lot of Rockstors ‘systems’ and a full replication cycle takes 5 replication events so it can be quite time consuming to test what ever changes are made in this mechanism to ensure it is not left worse than one finds it.

Anyway just thought I’d drop this note here by way of contextual relevance and I’ll try to create a Github issue of the exact reproducer from my notes on this known fragility when time permits. Plus you can then look out for this failure with the knowledge of how to ‘work around’ it if / when the time comes.

Thanks for helping to support Rockstor’s development via your stable subscriptions by the way and well done for getting replication setup. It’s a nice feature that deserves more time. However I have tested it as working (within it’s current limits) on our future openSUSE base via full cycle tests between CentOS based Rockstor instances and openSUSE based instances. So theirs that at least.

Hope that helps. And remember that it takes 5 replication events to settle into the final stable state. Prior to that there is a very strangely named share, early on, that was at least initially meant to be hidden but does server as an initial indicator that something happened . Give it 5 events and you will have the final stable state re share / snapshot names on both the sender and the receiver and the various associated share / snapshots arrangements should then make more sense.