Monitoring and recovery features

sprint · January 19, 2015, 9:34pm

Hi, just wondering what are the monitoring and recovery features planned for Rockstor, and is there any sort of ETA?

suman · January 20, 2015, 5:26pm

What kind of monitoring would you like? SMART support will be added soon and we are also working on a ALERTs mechanism that shows them on the web-ui. But I suppose you may be talking about different kind of support. Please elaborate.

Similarly, I’d like to hear ideas about recovery. You may be referring to recovery from failed drives etc… Please elaborate.

As of now, there are no issues in any of our milestones, but I hope to add them to our roadmap based on what users want.

sprint · January 20, 2015, 10:41pm

Well, in addition to SMART, also filesystem read/write/checksum errors and other btrfs related messages in the system log and periodic scrub (I know you can currently schedule the scrub but the results are not monitored) and device stats. Plus general system health indicators e.g. temperature; cpu, memory & network utilisation; process monitoring. These need alerts when there is an issue, plus summary reports to pick up developing issues.
There should be remote alerts, e.g. email and SNMP as not many people are going to be monitoring the web-ui very often.

As for recovery, there should be a guided system for replacing a failed drive. Something that a technically minded person with no specific knowledge of btrfs or Rockstor could do. Also hot-spare features. (This can be done by having a full drives spare capacity in the pool and triggering a balance if a device fails, or by having a spare drive which gets added in to the pool on failure in which case the spare can be shared between multiple pools.)
Note that after an error the pool may not be mountable normally, so there should be provision to mount the pool degraded, recovery or RO as necessary, along with temporarily switching off exports relating to the pool or turning them RO.
In case of filesystem corruption it may be necessary to run brtfschk, but I would hesitate to add that in to the web-ui, as it can also make problems worse. Support for btrfs-recover to copy files to a new pool would be good.
There is also a need for backup & restore of Rockstor settings etc. Maybe it would be a good idea to store a copy in each pool under the root or a special subvolume, to allow for easy recovery if the system needs to be reinstalled, but the pool devices are OK.

suman · January 21, 2015, 1:28pm

Thanks for great feedback sprint You’ve raised many possibilities here and I think we should approach this piecemeal. Naturally, we’d like to do more critical things(bearing in mind current state of BTRFS) before the other ones.

Lack of recovery(guided disk replacement) support has bothered us from the beginning given the criticality of data recovery. so, I think that category of features should be driven first. We’ve definitely arrived at a higner level of btrfs stability to test recovery usecases(at least for up to raid5) again and implement necessary support. I’ll create appropriate issues on github soon. (feel free to add your own if you like)

The second category of recovery you mention is the recovery of Rockstor state. There’s already an issue for that: https://github.com/rockstor/rockstor-core/issues/392 We hope to get to it soon. Thanks for suggesting some implementation ideas.

My last thought is about SNMP. We’ve added support for SNMP. Have you tried using it? It may be inadequate in which case let us know how we can improve its usability.

You’ve also mentioned many other things in great detail in your comment. We are sure to refer back to it as we continue development.