Rockstor 4.5.6-0 All Pools and Shares Unmounted after Reboot

Rockstor is starting up with every Pool and Share unmounted. Perusing the logs reveals

  • Error message: “Failed to start Rockstor bootstrapping tasks”;

  • The additional message from “systemctl status rockstor-bootstrap.service” is “OauthApp matching query does not exist”.

The sequence of events that led up to this error was:

  1. (background) An SSD was failing and I decided to take it offline. It was the sole drive in a pool named “SSD1”. Pool SSD1 had only a single share named “Music”.
  2. I deleted NFS and Samba exports for share “Music”.
  3. I deleted share “Music”.
  4. I deleted pool “SSD1”. The drive was not physically removed from the system.
  5. I then created a new share named “Music” on a different, pre-existing pool named “Data_SSD”.
  6. Then I recreated NFS and Samba exports for the new share “Music”.
  7. I tried mounting the new export via NFS on my Linux workstation. The “sudo mount” command hung, failed to return. I killed the process and tried again, had to kill it again. (Clearly, the Rockstor server was already borked at this stage.)
  8. Rebooted the Rockstor server. It threw the above error message and from the web UI all pools and shares were unmounted. (I tried to download a log via the web UI, but that hung the browser, so restarted the browser and didn’t attempt further log examination via the UI.)
  9. Tried rebooting the server again, with the same error. Then used “systemctl” and “journalctl” to obtain the error messages.
  10. I tried manually running “systemctl start rockstor-bootstrap.service” but the same error occurred.

Any ideas? Could the reuse of the name “Music” for a share have exposed a bug, perhaps a failure to clean up properly after deletion of a share or pool?

@Walt Hello there, and nice report.
Re:

Quite possibly. A pool delete of the failed drive may first have been required before the same share name was used on another pool; see below for a potential bug/clash.

That error can result from a normal start-up prior to setup I think it is, i.e.:

[10/Feb/2023 15:20:29] ERROR [smart_manager.data_collector:994] Failed to update disk state.. exception: OauthApp matching query does not exist.
[10/Feb/2023 15:20:29] ERROR [smart_manager.data_collector:994] Failed to update pool state.. exception: OauthApp matching query does not exist.
[10/Feb/2023 15:20:29] ERROR [smart_manager.data_collector:994] Failed to update share state.. exception: OauthApp matching query does not exist.
[10/Feb/2023 15:20:29] ERROR [smart_manager.data_collector:994] Failed to update snapshot state.. exception: OauthApp matching query does not exist.

So may not be a pointer in this case.

There may be more info/hints in the:

/opt/rockstor/var/log/rockstor.log

We do have a ‘thing’ where we can’t mount a btrfs volume (Pool) if it has repeated share names from another volume (Pool). It may be that the existing pool in db now with non members, is blocking the mount of the remaining pool. But I’m not certain of this in your case.

Check that log to see if there is more info on the lead-up and if any logs are created.

There may also be some leads in the system log:

journalctl

If this is a legacy db entry breaking the remaining pool mount, the quickest way around may be to do a hard reset of the entire rockstor install:

WARNING: the following will loose all Rockstor settings and require a fresh setup via the Web-UI. As per a new install at this stage: https://rockstor.com/docs/installation/installer-howto.html#rockstor-setup-and-eula

systemctl stop rockstor*
rm /opt/rockstor/.initrock
systemctl start rockstor-bootstrap

However you may then be able to import your existing pool/s assuming they are all healthy, via the instructions here:
https://rockstor.com/docs/interface/storage/disks.html#import-btrfs-pool
and potentially restore a config backup: https://rockstor.com/docs/interface/system/config_backup.html
Or just re-establish your config.

Do let us know if you find any more leads to this failure. We normally guard against the re-use of an existing share name (btrfs subvolume) however but as you say this particular sequence may have highlighted a bug.

Hope that helps.

2 Likes

Hi Phil, Thanks for the detailed reply.

I’ve examined the logs you mentioned. There was like 4000 instances of “Failed to update disk/pool/share/snapshot state… exception: OauthApp matching query does not exist.”, apparently it retries every few minutes. But there were no additional clues, error messages or lead-up info.

The way the exceptions read, i.e. “Failed to update disk/pool/share/snapshot state” (and their sheer number) suggest to me that all queries to the db are breaking, not just a single one. It suggests that the entire db is hosed.

Your tip for how to do a hard reset of the system was useful, it meant I didn’t have to install from scratch. Though it has one slight problem, it did not remove the administrative users but demoted them to non-administrative. That meant that I could not use the proper admin user name, but had to create a different one. And there’s no way to fix this through the web UI. I’ll have to do some low level hacks to clean that up.

Okay, so I seem to be up and running again (barring further issues). (I still need to delete a few more shares and pools but now I’m nervous about that breaking something again.)

Phil, I know you aren’t the person to whine to about this (you’ve been nothing but great help), but I have to mention that I’m concerned over just how easy Rockstor is to break. I’ve encountered this exact problem twice this past month and a third time about 2 years ago. As an Engineering Manager, I’d label this bug as Critical, (severe enough to prevent the program from functioning at all), and instruct the developers to drop everything else and fix it NOW. And yes, I know FOSS development is a different sort of culture, but still, that doesn’t make it okay to leave severe bugs unfixed. A Critical bug can cause irreparable harm to the reputation of a software product. When central IT storage goes down it hobbles the entire business activity, so the first and highest requirement of any storage appliance is bulletproof reliability and stability. Okay. End of rant. Your assistance is appreciated.

BTW, what is this mysterious database that apparently is getting broken? Would it be a good idea to make a quick and dirty backup copy of it before starting future configuration changes?

@Walt Hello again.
Re:

Good Point, we could add a:

userdel admin-user

to those instructions. Then one could re-use the same username. That user is no more an admin on the underlying OS than any other. It’s simply the user that has admin on the database, which is what backs the Web-UI. We use Django as our underlying framework to manage our user interface, so the admin user is marked as admin on that system, but is also a regular user on the linux OS. Hence the issue you experienced via my incomplete instructions on that front.

Yes, that is a work-around. We should have a doc section for this hard-reset for those that run-into the need for it. I use this method all the time to rest system myself during development. And there I use the userdel command to enable re-use of the same username.

Or rinse and repeat but this time delete the underlying OS user that is also created. My apologies for missing that step. Without knowledge of the OS user element the error message would be quite cryptic. I’ve created the following issue to improve our messaging on this front. Along with a note to create a doc issue to mirror this advanced reset method. It would of course not be an issue on a re-install: unless folks want to re-use a system default user of course; of which there are many.

Great, but experimentation is good prior to production deployment. And ultimately you can always reset if you are aware of how you re-setup again. Also note that one can do that userdel command and then just re-enter the same username on the initial setup page as that is the only blocker (the underlying OS users existence).

Also note that you always have access still to the underlying OS. We should always be able to import a pool if it’s basic structure has not been altered. The underlying btrfs is the source of truth in this matter. And we are informed of this truth by simply running and parsing btrfs command lines. We are to move to json output from those commands and eventually adopt an in-development native python interface but all in good time. And bit by bit as our resources allow.

In which case you can, with you engineering background, presumably reproduce it. And if so we are at the GitHub issue stage. If we have a re-producer we are well on the way to an actual fix. We have had around 928 of those to date in the projects public history. And it is the entire purpose of the testing channel: and exactly how we get to our stable releases. We currently have the following GitHub Milestone:

which denotes the outstanding issue we would like to sort before declaring the next stable release prior to then embarding on the next testing channel changes.

Reproducers are King and Queen in proving any fix. And the first place to start in understanding where the problem lies. All subvolumes not mounting could have a miriad of causes. Each one needs a reproducer. Whe have fixed quite a few of these, but obviously not all of them. For example, if a disk dies and one reboots the pool will not mount without a ‘-o degraded’ mount option. We advise this in the Web-UI. We could just carry that option, but that would be counter to our upstream developers (the btrfs developers in that case) advise. We have to tread a narrow and unclear path as we go here. Easy to use means we have to cover up some complexity - but it is not gone. The btrfs file-system is a non trivial engineering endeavour, facilitating end-user ease (our task) is similarly fraught with challenges. But I think the endeavour is worthwhile.

So in short if you can exactly reproduce your observed failure then someone can fix it. Assuming the fix is within the computationally accessible universe :slight_smile: . That person or group of people may or may not be within the Rockstor team and you may even have flaky hdd/ssd firmware (read about btrfs exposing this to some faceboot engineers for example, sorry not link for that). Noteworthy in the story is that btrfs was the first to be blamed! But the previously unknown hdd bug was plain as day once proven. Check-summing can be good that way. And it’s a shame it hasn’t been integral in our file-systems of the past. But we do have way more computational resources available these days.

What we ideally want in our GitHub issues are reproducers. I install this, to that, the bug show up every time. Anyone on the planet with the same setup, assuming no hardware dependency, can then reproduce this exact failure and trace it down through our code and possibly out the other end (thought it’s often us of course). That is an ideal outcome of chats here on the forum. You will find many over the years that have lead to such outcomes and they represent a significant fraction of those 928 ‘fixes’ mentioned earlier :). We then have a way to prove the fix and will in many cases, especially where critical code is concerned, produce tests that also reproduce a failure to avoid regressions. We currently have around 245 tests on the go. Take a look at the last time we added more tests concerning upstream changes to the btrfs scrub status command output:

Then first define the bug. Exactly: with a reproducer to prove the proposed fix is actually a fix. At least withing the realms of what is know about that fix.

Software is hard (and very complex/deep), and we have an ongoing task to update the deep stack upon which we stand. Python Django js etc. All of those projects have had years of fixes that we can also be affected by or benefit from, and that takes up developer time. Our move from CentOS to the “Built on openSUSE” variant took time away from our modernisation drive but it had to take priority. It also meant that folks who used Rockstor benefitted from a way newer btrfs stack. That is of course the primary concern for a storage platform - file integrity. Ergo our move there rather than do other things. I stand by that decision but it has cost us in other areas. Oh and CentOS, as was, disappeared so there’s that we side-stepped of course.

Agreed on all points. I was even a little hesitant to become engaged in the Rockstor project when I first became aware of it, due to the central and critical part it can play in many of it’s chosen deployments. But I found no other project for what I wanted to see in the world.

That was the major reason for us moving away from CentOS actually. So we have that at least under our belt :). SuSE/openSUSE actively finance btrfs, CentOS only ever employed one person to work on btrfs! Yes, FOSS is different, it has come up with the likes of gcc the linux kernel, btrfs. We have to take the rough with the smooth here, and provide exact reproducers.

As mentioned earlier we make extensive use of Django, and it in-turn uses Postgresql, both of which we have significantly updated (by a few years worth of work) in our latest testing channel, but we are not yet at the next stable release, but getting there.

Entirely agree again with your comment on stability by the way, hence our use of btrfs, Django, and Postresql, and Ngnix. And our recent drive to update all of these. Be it by proxy (OS updates/transition) or by our own dependency management (see our recent move to Poetry on that front). They each in turn host some of the largest storage/Website endeavours in existence. And their licenses and FOSS nature mean we can use them in our little old project on the internet. But we do still use older versions of some critical things and this will be the focus of our next testing channel as it goes.

On that front we may just want to do what openSUSE does already and snapshot the entire (almost) OS before and after each software update: see System Rollback by Booting from Snapshots

But there are technical dificulties in the interplay of for example the /var subvol that are beyond this discussion. And the db can changes from one second to another depending on Web-UI requirements. What we need here is an exact reproducer of your observed issue to know where to put the effort required. Our database manager is likely behaving just find. And we also have complexities regarding migrations that can cause problems also. Again all a little beyond this discussion and available time, speaking of which I must sign off:

Alternatives are always available, however I was not content with them myself and adopted Rockstor as my preference as I liked the license and the ability to affect the outcome. That may not be important in your situation however. Maybe Oracle has an offering appropriate to your requirements.

I fully share your frustrations by the way, and our early days were way rockier :slight_smile: . Rockstor’s public GitHub is now around But my focus has been on stabilising and modernising all that my time/talent/knowledge allows. And our testing channel has helped hugely with this effort. And again if an issue can be identified (exactly with reproducer to prove proposed fix) it can be approached if one can find the time/talent/knowledge. And on this front there are moves I am making to assist folks with sponsoring the project in directions they would rather see: but more on that front when the pudding is ready.

Hope that helps, and thanks again for the engagement. We do have to take care here as it is very easy to discourage folks, and community focused Open Source very much depends on ‘folks’. We are a non trivial project that needs more of them actually: again this is in the works re prior guarded “… moves I am making …” pudding speak.

3 Likes

Phil,
Thanks for the detailed reply and the receptiveness to the issues I’ve raised. I’ve made a note in my own notes about applying userdel admin-user as part of the system hard-reset procedure. I have read and do understand the challenges you’ve outlined and the forthright measures taken to move the project forward. I’m with you, man.

LOL! We’d all love it if every bug was reliably repeatable and all bug reports came with detailed instructions for how to reproduce it, but sadly, far from all bugs are that cooperative. A colleague of mine used to manage the development of ATM machines at NCR. One time they got a report from a client bank in Europe that one of their machines went crazy and spit out $400,000 in cash. A few months later another bank reported a similar problem. The bug was extremely unlikely to occur, but when it did strike, the losses were astronomical, so it had to be found and fixed. NCR filled a warehouse with machines and created an automated test environment to continually hammer those machines with transactions 24/7. The ATMs required specialized test equipment to log every detail that went on inside, and NCR had to rent every unit available anywhere in the world for this test. It took months, but they eventually captured an instance of the bug in action along with the exact sequence of events that led up to it.

It always used to irk me when I’d get a bug report from the field like, “It doesn’t work!” No mention of what doesn’t work, what the failure symptoms were, and so on. So as a card-carrying member of the Testing Channel here I do my duty to report bugs and issues, along with as much detail as I can provide that might help in reproducing the bug, plus whatever other info might be of help.

I don’t have a “sheepdip” system here which I can set up exclusively as a test platform for Rockstor, to try to deliberately get it to crash, multiple times, until I get a handle on what’s causing it. That’s what I’d have to do in order to provide an exact reproducer. Attempting to deliberately crash the central server on which my office depends (and also to risk losing my data) is a bridge too far. I’ll try be as helpful as I can but my first priority is to keep the system operational and keep the data safe.

All the best on your “mystery pudding” project.

1 Like

@Walt Hello again.
Re:

I just wanted to follow up on:

The following doc pull request should help folks in the future re the created of an unpriveladed user during the Web-UI setup (now in installer howto) and in our developer docs re hard-reset procedure:

If you look to the files-changed tab you can see the changes associated with that pull request.

And the rockstor-core issue is still oustanding on this front. But at least we are clearer, explanation wise, in the docs now: user (installer howto) and Dev (contributing to rockstor) side, i.e. those using the testing channel such as yourself.

Thanks again for the thoughtful feedback.

Oh and nice story there re the NCR bug tracking adventure. Shame we don’t have such resources available to our little project. But at least we site on the shoulders of tech giants :slight_smile: . Assuming we stick closely to our upstream that is.

Cheers. Announcements soon hopefully. Ducks in a row and all that.

2 Likes