Can't access the pool, web GUI just gives an error

Henning · July 3, 2016, 5:59pm

I already mailed to the support, but maybe I ask here too.
I have setup rockstor in hyper v windows 10. Created a raid 5 with four disks, vhdx files, and all was fine.
Then I did a shutdown, deleted one file to simulate an disk failure, created a new vhdx and started rockstor again.

On cli I see that btrfs reports correctly that one disk is missing.

My questions: can you reproduce this? I didn’t found the time to set up a second virtual machine…

Is this a known bug? I see this as a big bummer that you have to fix this over cli because the gui can’t handle a missing drive in the pool.

Btw in the section of the disk everything was reported correctly.

Can you lead me a hand to repair the pool? I would like to add the new disk to the pool and balance it. And obviously remove the old disk out of the configuration of the pool.

Henning · July 3, 2016, 6:13pm

Just checked this: why didn’t I get an email reporting me that the disk went missing?

Henning · July 4, 2016, 5:54pm

any help on this much appreciated!

f_l_a · July 4, 2016, 6:02pm

Dunno. Haven’t configured any notification settings (yet). And simulated / performed a recent disk replacement via CLI as well.

Henning · July 12, 2016, 2:32pm

did drop the case. No support did come to me after ONE week…
did build the pool from scratch out of my backup…

f_l_a · July 13, 2016, 4:48pm

Ok, one week not comfortable for you. Understandable. I waited also for replies, here. But… Have you bought any support* for incident plan? Otherwise, its best effort only, sorry. These guys are busy, trust me. And put up a good part of their time.

Please, this is basically filed as an issue here so we have to see; maybe enter it on their Github as well, if you happen to have an account there.

And nevertheless: thank you, for testing and reporting!

(my profession is related; so i know about managing (our own) customers’ expectations, quite frequently…)

Henning · July 13, 2016, 9:09pm

I’m a bit confused / there seems to be has been a change in the ui. 10 days ago the ui asked me to send an email. Yesterday it asked me to raise a ticket. In both ways I added a logcat provided by the ui.

(I have an 5 years sub)

the problem did get worse two days ago. With no clue what had happen / I didn’t change anything - my raid wasn’t readable anymore. So there was no goal in restoring the raid5 on cli. (maybe there still was, but I didn’t want to go into deep, with no help here)

Over all this isn’t about subscription or guys having time.

This is about the most crucial function for a NAS system and it didn’t work.

All I did was removing a disk, like it can happen all the time, in a raid5…

suman · July 14, 2016, 5:36pm

I assume you’ve setup the e-mail notifications. In addition to that, you can configure S.M.A.R.T to send you an e-mail upon errors. However, we don’t have any notifications integrated with BTRFS events just yet. It’s planned.

suman · July 14, 2016, 5:57pm

In the past, user was instructed to e-mail support@rockstor.com which was monitored by a huge team of me, myself and I It was manageable with a smaller user base but not anymore. That’s why we split support into two streams. Subscribers are directed to the support portal. Testing channel users are instructed to open a forum post with a specific template. Support portal issues are obviously dealt with a higher priority but there’s really no SLA we promise, unless of course you are a business user that purchased support incident bundle. We are evolving and plan to come up with affordable options for home users and better documentation. This forum has been really helpful with users who know BTRFS well, perhaps more than us devs. Ok, back to the real technical issue…[quote=“Henning, post:7, topic:1701”]
This is about the most crucial function for a NAS system and it didn’t work.

All I did was removing a disk, like it can happen all the time, in a raid5…
[/quote]

Couldn’t agree more. Rockstor code is partially to blame in that we need to improve the notification mechanism. But I think a lot us here are patiently waiting for BTRFS raid5/6 to mature. A stronger warning has been added to the BTRFS wiki recently. We have also updated our docs and the UI, thanks to @phillxnet.

I’m aware none of this help to gets your data back and you are smart to have backups. I wish to have tried to troubleshoot with you earlier but it’s too late now. If you plan to continue to use RAID5/6, please read through BTRFS wiki and familiarise yourself with btrfs-progs if you are not already. If this is a commercial use case you can also purchase tech support.

phillxnet · July 14, 2016, 5:58pm

@Henning Welcome to the Rockstor community and thanks for helping to support Rockstor development via your stable channel subscription from the Update Channels.

This is the result of a very recent ‘value add’ on stable channel installs and came about as from Rockstor version 3.8-14 stable channel release and was initially trialled in testing channel updates from version 3.8-13.15 onwards:

So hopefully that explains the change, i.e. stable channel subscriptions now get directed to a proper ticketing system osTicket based as it goes, while testing channel updates are directed to the forum. This is entirely intended to help and not hinder and of course all are welcome on the public forum but it is not always appropriate for especially commercial concerns to be pasting log entries and the like.

I think in the case of this forum thread it would have been beneficial to include additional information, such as excerpts from the logs, so that others here could offer more informed guidance. I did see your post when it first arrived but I’m personally outclassed by many of our generous and active forum members on the low downs of btrfs, especially when it comes to disaster scenarios. And so I tend to leave such threads to those more experienced in these area as with such a large project one is best advised to pick their battles. There are some guides within the official documentation that address some of these situations such as the Data loss Prevention and Recovery in Rockstor.

So at a guess I would say that some how you got caught between the two reporting systems but either way as @f_l_a kindly and eloquently points our there is no promise of support / response time beyond that of the incident plans which are in themselves more than many os projects have anyway.

To the nub of your issue as you succinctly put it:

When referring to drive removal notification.

Exactly, and few would argue that this is not up there with appliance NAS features on the importance scale. However an important point here is that we are wholly dependant on btrfs’s current (but very rapidly improving) facilities on this one and although we may well be able to pole logs and the like these solutions are generally unattractive to those who want to do things right or wait until they can be done right and so on occasions they are unavailable for far longer than many would also like. I would hazard a guess this includes the majority of the main contributors to Rockstor and btrfs for that matter. And in this vein the indication of ‘detached disk’ that you did see in the Disk’s menu is as a result of poling, and in turn this is another technical debt that needs to be addressed. But in this case it did correctly represent the disk status. There are currently ongoing discussions within the linux-btrfs mailing list on how best to address / surface user level notification of such things as degraded volume status and as soon as such things are resolved there, and with the help of our contributors and stable channel subscribers such as yourself Rockstor will be looking to make best use of these facilities. And while in this area please see the following forum thread for some helpful links along these lines:

where others have also discussed this shortcoming.

Another issue opened recently on this which I have just updated on considering your posts is:

Of note also is that Rockstor’s recommendation has always been that the parity profiles within btrfs (raid5 and raid6) where not recommended and although this was revised on the btrfs wiki some time ago Rockstor’s documentation and the advise given by the leading developers here in the forum remained sceptical on these levels and the Rockstor docs section on Redundancy profiles maintained their original advise against using these raid levels. However with another very recent change to the official status of these raid levels we did update our docs prior warnings and added warning to the UI as well against using raid5 and raid6 levels:

Previously the pool creation raid level selector advised referring to the above docs when considering the raid level.
These changes were a little delayed (by a few days), see:

But were never the less timely on the playing it safe side.

I hope that this helps in understanding how all this comes together (or occasionally doesn’t) and fulfils the personal obligation I feel to open development practice. But all of this takes a long time and all our failures, as well as our successes, are out in the open; my belief is however that such a development model is the only way to go and ultimately leads to a better product.

OK looks like @Suman has also replied here just now.

suman · July 14, 2016, 6:00pm

Thanks for your kind words @f_l_a. It’s tough, but we are steadily improving thanks to our upstanding community!

Henning · July 15, 2016, 9:28am

OK to all of it, no harm done. Wouldn’t have test this, if I wouldn’t had a backup.

To the bug itself: I did look on progs in cli and there was no big deal with the raid. Progs reported, that a drive is missing. Raid was still readable, due to raid5 - all fine.

2 points here:
1: I wasn’t sure on how the gui created the raid, so that kept me from messing with the configuration on cli
2: while progs itself did look good in this case (so development of btrfs itself is not to blame) the output of progs seemed to have caused the ui to crash.

All I wanted to do was add a new disk to the pool / raid and do a balance, but the gui kept me from doing this.

The second bug was that I could not use the three drives after deleting the pool. I tried to use the recovery function to restore the 3 old disk (and in the next step to add the new one) but the UI crashed again. That because I opened 2 tickets.

.

I waited one week to provide such info if needed but noone asked.

.

For the third bug is maybe btrfs to blame. After one week of the raid with a missing drive, the raid died.

For future incidence I installed the check_mk-agent and configured snmp and of course email notification was installed right in the beginning.

Check_mk give you much additional information, could get more, but didn’t installed the btrfs plugins (yet)