Replication Error v3.9.2-23

KLSsandman · April 25, 2018, 10:07am

Hi

I created a replication to another box and it was in the middle of the replication over a WAN link. My link lost connection and when the replication job tries to start it now errors with:

Failed to create snapshot. Exception 500 internal server error

I assumed the replication would just restart? I searched the forums and can find this error for a bug but its in a previous version.

Many thanks
Simon

phillxnet · April 25, 2018, 11:18am

@KLSsandman Welcome to the Rockstor community and thanks for reporting your findings.

Yes we do have some fragility still in our side of replication, with regard to network interruptions, and I’m afraid I didn’t quite get around to addressing this on my last update of this code following a large api url refactoring a few month back.

Essentially if a snapshot has already been made prior to a network interruption/ failed replication, we fail to deal elegantly with this and instead fall over. The work around until we address this bug is to identify the snapshot in question and simply delete it. There after when the next replication task is scheduled (assuming no too many have failed) it should resume as normal. If too many have failed in a row it will auto turn off the associated replication task so it will need turning on again.

To identify the particular snapshot that is holding up proceedings you would need look more closely at the log reports of both the send and receive side. Also note that the very first task to fail will have the most information in this regard, check out it’s mouse over tooltip report of the error in the scheduled tasks history. Then once that snapshot is correctly identified and removed you should be OK again.

If my first guess is right I think this is the problem area of code that needs attention with regard to this fragile scenario (note the TODO: notes):

github.com

rockstor/rockstor-core/blob/master/src/rockstor/smart_manager/replication/sender.py#L225-L233


#  create a snapshot only if it's not already from a previous
#  failed attempt.
# TODO: If one does exist we fail which seems harsh as we may be
# TODO: able to pickup where we left of depending on the failure.
self.msg = ('Failed to create snapshot: %s. Aborting.'
            % self.snap_name)
self.create_snapshot(self.replica.share, self.snap_name)


retries_left = settings.REPLICATION.get('max_send_attempts')

The error does not match your report but may still be logged as well.

So if that is the case you may well find that logged on the first failed attempt. Also sometime the ‘other end’ of the replication has better reporting so do make sure to check both sender and receiver error reports (the tooltips) for diagnostic info.

Note that if you enable email notifications you will be notified of these events, along with some details.

Hope that helps and thanks again for reporting your findings.

KLSsandman · April 26, 2018, 10:06am

Many thanks for your reply. I did some calculations based on the amount of data to replicate vs my upload speed and based on that I have decided to do a local replication and then use a much faster connect to replicate that to my original destination.

I have removed the snap follow your instructions and my local replication if cracking along nicely.

Many thanks
Simon