Rockstor 3-8-10 Replication "Failed To Promote The Oldest Snapshot To Share"

jason1 · December 14, 2015, 1:44am

Hello,

One of the primary features I’ve been interested in testing with Rockstor is the asynchronous replication feature between two appliances. My last attempt with testing was hindered by this bug. Upon hearing replication was a primary focus on the latest release, I decided to give it another go.

Unfortunately, I’m still unable to successfully perform an initial replication between two newly installed appliances in my VirtualBox lab, this time however, for different reasons. The following errors were reported in the WebUI status fields on the respective appliances for a 5 minute scheduled job:

Replication Source: unexpected reply(receiver-error) for a8c04a01-b64b9a59-08f1-4fad-bf8d-528ac78a011f-4. extended reply: Failed to promote the oldest Snapshot to Share.. Exception: Error running a command. cmd='/sbin/btrfs', 'subvol', 'list', '-o', u'/mnt2/Test_Snapshot_Pool/.snapshots/a8c04a01-b64b9a59-08f1-4fad-bf8d-528ac78a011f_Default_Share']. rc = 1. stdout = [']. stferr = [

Replication Target: “Failed To Promote Oldest Snapshot To Share”

From what I can gather from the rockstor.log file I’m seeing a consistent error:

ERROR [smart_manager.replication.receiver:88] Failed to promote the oldest Snapshot to Share.. Exception: Error running a command. cmd = ['/sbin/btrfs', 'subvol', 'list', '-o', u'/mnt2/Test_Snapshot_Pool/.snapshots/a8c04a01-b64b9a59-08f1-4fad-bf8d-528ac78a011f_Default_Share']. rc = 1. stdout = ['']. stderr = ["ERROR: can't access '/mnt2/Test_Snapshot_Pool/.snapshots/a8c04a01-b64b9a59-08f1-4fad-bf8d-528ac78a011f_Default_Share'", '']

When I try to run the same /sbin/btrfs command manually on both the replication source and replication target I am able to reproduce the same error message from the stdout console window.

So, just wanted to highlight this issue on the forums based on my lab testing since I didn’t find a post on this yet. I can send in the full logs to the support email for full review following this post as well.

Please let me know if any further information is needed to troubleshoot the error.

suman · December 14, 2015, 3:47pm

Thanks for testing @jason1. Are you able to reproduce the behavior with a different share?

jason1 · December 16, 2015, 3:50am

During my troubleshooting, I deleted and recreated the shares and replication jobs under a couple different names and I was able to reproduce the behavior each time. I’ve only had one share active at a time though. I also made sure to add some test data into the shares but that didn’t seem to make a difference in anything.

I’ve left the VMs running since the original post and the jobs are still failing for the same reason, so unfortunately it doesn’t look like it has sorted itself out yet.

I have yet to actually install Rockstor on physical hardware yet to see if that makes a difference. Primarily because I don’t have much for hardware or space at this time.

Let me know if there is anything else I can clarify or test and I’ll try to get to it this week.

suman · December 17, 2015, 5:20pm

Thanks @jason1. I haven’t had a chance to reproduce this behavior yet, but there shouldn’t be any difference between physical and vm for this. I did 80% of the testing on vms too.

My WILD guess is that there is some regression due to hundreds of package updates from upstream(centos) right after we released. I am seeing some API calls fail, may be related, may be not. We’ll know soon. I’ll update this thread. I don’t see a reason to create a github issue at this time, but you can feel free to do so.

indi · December 17, 2015, 8:07pm

Same here, just installed two VMWare-machines with a couple of drives to test the replication but instantly get this error.

suman · December 20, 2015, 1:11am

Thanks for testing @jason1 and @indi. I’ve fixed a minor bug with the latest Testing update(3.8-10.02) and replication is working again. The reason I haven’t noticed on my machines here is because the bug was not an issue for ongoing snapshots. Anyway, it turned out to be a one line fix that I should have caught in the first place.

This update may a take while because of upstream changes. After the update, I also recommend either a reboot or systemctl restart rockstor-pre rockstor rockstor-bootstrap to ensure you don’t run into an upstream related python issue.

Let me know how it goes!

jason1 · December 21, 2015, 4:38am

Hi Suman,

So far so good.

I kicked off the update and performed a graceful reboot from the web interfaces on each appliance and re-enabled replication. Within seconds, the first job began and completed successfully according to the status field. So far I’ve gone through 3 replication jobs successfully.

So I can confirm the error messages are gone and replication is now working in my lab after grabbing update 3.8-10.02 and rebooting.

Thanks a lot for looking into this for us!