@kysdaddy Thanks for the report and update, and glad you got it sorted in the end.
We do still have some known fragility in our replication code that is due for some attention. At least one bug was introduced when we had to adapt (read mend) the code to our necessary share and pool api changes a while back. At that time I wasn’t able to re-establish complete prior robustness and I hope to get around to this in time, unless someone beats me to it of course.
The following TODO that I left in the code at the time springs to mind as a known failure point:
with create_snapshot() as:
The bug surfaces as a ‘blocking’ snapshot on the sender following a failed send attempt that in turn blocks all future sends. The work around for now is to simply delete this snapshot so that a fresh one can be auto created upon the next send attempt. It’s for sure an inelegance and buggy behaviour but may just need a little attention to resolve. During my last stint in this area of the code I didn’t quite get around to resolving exactly why this failure happens. However if anyone cares to take a look they may find this code quite interesting. It uses zeromq for messaging between the sender and the receiver. Originally crafted by @suman with a little maintenance on my part there after. It would be good to return all it’s prior safeguards/robustness but it is, unfortunately, not a task to be undertaken lightly as the code exercises quite a lot of Rockstors ‘systems’ and a full replication cycle takes 5 replication events so it can be quite time consuming to test what ever changes are made in this mechanism to ensure it is not left worse than one finds it.
Anyway just thought I’d drop this note here by way of contextual relevance and I’ll try to create a Github issue of the exact reproducer from my notes on this known fragility when time permits. Plus you can then look out for this failure with the knowledge of how to ‘work around’ it if / when the time comes.
Thanks for helping to support Rockstor’s development via your stable subscriptions by the way and well done for getting replication setup. It’s a nice feature that deserves more time. However I have tested it as working (within it’s current limits) on our future openSUSE base via full cycle tests between CentOS based Rockstor instances and openSUSE based instances. So theirs that at least.
Hope that helps. And remember that it takes 5 replication events to settle into the final stable state. Prior to that there is a very strangely named share, early on, that was at least initially meant to be hidden but does server as an initial indicator that something happened . Give it 5 events and you will have the final stable state re share / snapshot names on both the sender and the receiver and the various associated share / snapshots arrangements should then make more sense.