@Stefan Hello again. Yes I pretty sure this is a known bug and one that has been quite difficult to reproduce but a few people have system configurations that seem to trigger it regularly. It seem to depend on how a system renames it’s devices from one boot to another and is essentially harmless but rather irritating and of course highly undesirable. But as it goes there is a pending pull request that I’m hoping should address the root cause of this but I can’t confirm this as I no longer have a system that previously reproduces this behaviour (but only occasionally) and nor do the other developers. We have an open GitHub issue:
on this bug which in turn links to the associated, in progress, pull request / code changes.
I have linked to this forum thread from within that issue so hopefully this thread should receive an update upon the suspected fix being merged and depending on what update channel your problematic machine is on you should receive what is hopefully a fix via the regular updates either sooner or later. In fact it would be really helpful if you could confirm any change in behaviour once this next code change is in place and released. Especially since it seem to be a fairly predictable event on your particular setup. So if possible don’t change anything as your system may potentially be one of the only proof of fixes we currently have as I am uncertain if prior reporters are still in contact. But don’t worry otherwise of course as hopefully other reporters are still in contact and can confirm any fix or otherwise when the time comes.
The main factor seems to be the drives arrangement and how they are re-named as the bug is triggered by a very particular re-arrangement.
Like the work around by the way.
There is another much older forum thread Pool not mounted on reboot but I believe that was when there were other bugs (since addressed) that could cause similar behaviour and I don’t think all reporters there were experiencing the same root cause.
Thanks for the report and hopefully this one should be fixed soon.
I thought I would chime in as I actually encountered the same problem last night. Everything was properly recognized but after I rebooted my RS instance (3.8.14), I found all my disks and shares listed on the webUI, but empty when browsing them by SSH. I rebooted again and then everything was properly mounted and back to normal.
My setup is relatively simple (this is the currently working one):
[root@rockstor ~]# btrfs fi show
Label: ‘rockstor_rockstor’ uuid: f23998c3-e834-48e4-a5f4-6a12980e5a15
Total devices 1 FS bytes used 1.54GiB
devid 1 size 25.71GiB used 4.04GiB path /dev/sda3
Label: 'main_pool' uuid: 4aaa19f1-0875-4572-b179-3a522f75a216
Total devices 2 FS bytes used 868.59GiB
devid 1 size 2.73TiB used 871.03GiB path /dev/sdb
devid 2 size 2.73TiB used 871.03GiB path /dev/sdc
The drive labeled sda3 is my OS drive (USB flashdrive), and both HDD (sdb and sdc) are in RAID1.
Before the reboot that caused problem, the OS drive was sdb3, and both HDD were sda and sdc.
I’m not sure what triggered this name change. but I’ll be happy to provide all logs that could help pinpoint the cause of the bug.
I don’t think there could be a link, but I applied updates a couple days before the reboot that caused issue. Here’s the list of updates in question, in case it can be helpful (downloaded from the log manager–very convenient!):
Jun 30 22:59:06 Updated: systemd-libs-219-19.el7_2.11.x86_64
Jun 30 23:00:47 Updated: systemd-219-19.el7_2.11.x86_64
Jun 30 23:01:03 Updated: samba-libs-4.2.10-6.2.el7_2.x86_64
Jun 30 23:01:23 Updated: samba-common-tools-4.2.10-6.2.el7_2.x86_64
Jun 30 23:01:40 Updated: samba-common-4.2.10-6.2.el7_2.noarch
Jun 30 23:01:47 Updated: libwbclient-4.2.10-6.2.el7_2.x86_64
Jun 30 23:02:37 Updated: samba-client-libs-4.2.10-6.2.el7_2.x86_64
Jun 30 23:02:58 Updated: samba-common-libs-4.2.10-6.2.el7_2.x86_64
Jun 30 23:03:34 Updated: libgudev1-219-19.el7_2.11.x86_64
Jun 30 23:04:01 Updated: 1:NetworkManager-libnm-1.0.6-30.el7_2.x86_64
Jun 30 23:04:19 Updated: systemd-sysv-219-19.el7_2.11.x86_64
Jun 30 23:04:36 Updated: libxml2-2.9.1-6.el7_2.3.x86_64
Jun 30 23:05:03 Updated: iproute-3.10.0-54.el7_2.1.x86_64
Jun 30 23:05:10 Updated: rpcbind-0.2.0-33.el7_2.1.x86_64
Jun 30 23:05:33 Updated: samba-winbind-modules-4.2.10-6.2.el7_2.x86_64
Jun 30 23:05:40 Updated: samba-winbind-4.2.10-6.2.el7_2.x86_64
Jun 30 23:05:50 Updated: libsmbclient-4.2.10-6.2.el7_2.x86_64
Jun 30 23:06:01 Updated: 7:device-mapper-libs-1.02.107-5.el7_2.5.x86_64
Jun 30 23:06:28 Updated: 7:device-mapper-1.02.107-5.el7_2.5.x86_64
Jun 30 23:06:48 Updated: kpartx-0.4.9-85.el7_2.5.x86_64
Jun 30 23:07:56 Updated: dracut-033-360.el7_2.1.x86_64
Jun 30 23:08:37 Updated: polkit-0.112-7.el7_2.x86_64
Jun 30 23:09:38 Updated: 1:NetworkManager-1.0.6-30.el7_2.x86_64
Jun 30 23:09:40 Updated: 1:nginx-filesystem-1.6.3-9.el7.noarch
Jun 30 23:09:48 Updated: selinux-policy-3.13.1-60.el7_2.7.noarch
Jun 30 23:12:11 Updated: selinux-policy-targeted-3.13.1-60.el7_2.7.noarch
Jun 30 23:12:34 Updated: 1:nginx-1.6.3-9.el7.x86_64
Jun 30 23:13:05 Updated: 1:NetworkManager-team-1.0.6-30.el7_2.x86_64
Jun 30 23:13:06 Updated: 1:NetworkManager-tui-1.0.6-30.el7_2.x86_64
Jun 30 23:13:33 Updated: dracut-network-033-360.el7_2.1.x86_64
Jun 30 23:13:34 Updated: dracut-config-rescue-033-360.el7_2.1.x86_64
Jun 30 23:13:58 Updated: 1:nfs-utils-1.3.0-0.21.el7_2.1.x86_64
Jun 30 23:14:08 Updated: samba-client-4.2.10-6.2.el7_2.x86_64
Jun 30 23:14:11 Updated: samba-winbind-krb5-locator-4.2.10-6.2.el7_2.x86_64
Jun 30 23:14:30 Updated: libxml2-python-2.9.1-6.el7_2.3.x86_64
Jun 30 23:14:35 Updated: 1:NetworkManager-glib-1.0.6-30.el7_2.x86_64
Jun 30 23:16:33 Updated: samba-4.2.10-6.2.el7_2.x86_64
Jun 30 23:17:02 Updated: tzdata-2016e-1.el7.noarch
Jun 30 23:17:10 Updated: epel-release-7-7.noarch
Hope this can help…
EDIT: I just remember something related to my disks that may help situate the issue.
My OS drive is a SanDisk Cruzer Fit flashdrive, and I do see a recurring error message in the logs, similar to what was discussed here:
Would adding the ‘-d scsi’ option as listed in this post limits the chances of this mounting error occur?
Here’s a screenshot of my Disks page:
@Flox Thanks, and yes that looks exactly like what I have seen and documented in the dev name change breaking mounts issue referenced earlier, particularly towards the end of that issue as early on there were still other bugs that often looked the same. It seems to depend upon the system drive getting the same name as a prior data drive, this then leads to an attempted mount of the system drive as if it were a data drive, however this is also very time sensitive as there is typically only a very small time window (< 2 seconds on the system I had) during boot prior to the drive database self correcting. So in your case sda was the system drive but upon reboot it took the place of a data drive sdb in this case. The system drive is mounted anyway and early on obviously and is not affected by this but does then take the slot / name of a data drive. So when the time comes to mount the data drive the db has an old reference and attempts to mount it, hence in the logs there are then indications of no btrfs on that disk as it’s the system disk now so it’s not a whole disk btrfs like the data disks are. There should also be indications during that problematic boot of attempts to mount in your case sda3 but of course on that boot it no longer exists as a partition as sda as you most diligently note is now a data drive and so is a whole disk btrfs without partition. So that error is akin to ‘/dev/sda3 does not exist’ however as it’s the system drive it muddles through but the data drives fair less well.
This is actually a long standing and rather deep problem that has required a little re-working on the lowest levels of Rockstor’s disk subsystem and so no I don’t think it is related to the updates. It is also hampered by being very hardware config dependant and non of the main Rockstor developers has a system that can reliable reproduce this. But I did have one for a while and after very many power cycles I think I understand the issue. If fact that machine got turned on and off so many times in order to provoke this issue it ended up killing it’s PSU, which are most stressed on initial power on. Anyway the pending pull request / code change is now awaiting merge and has been tested by both me and @suman but due to it’s deep nature we are proceeding very cautiously, also as a side effect it has affected some UI elements that have taken time to resolve with the smaller ones still pending. Anyway I am fairly confident that the move to by-id (ie boot safe) naming in the db should at the very least help further to diagnose this problem and am actually fairly confident that it will address the root cause. However proving this fix is the tricky one as with no test system in the hands of the developers that reliable displays this we are a little stuck, hence my not yet linking this issue directly with the pending revise internal use and format of device names code changes as a fix as I can’t yet prove it. So if you and @Stefan could keep an eye on the updates and take a note of the move to internal (and possible also external on the UI) by-id naming and then test your previously problematic systems that would be invaluable as we are very keen to put this one to bed. And due to some changes @suman put in quite a while ago the system I had that only occasionally displayed this behaviour now does it almost never so we are definitely whittling this one down.
Thanks for the offer to help and it’s heartening to have people offer their time and attention to root out these matters for the greater good. Often tricky to singularly identify them but I think we are getting pretty close to this one.
On your additional point of the ‘-d scsi’ for a SanDisk Cruse Fit that should make no difference but adding those as custom SMART options on the device should stop those “Unknown USB bridge” messages from appearing, but won’t fully enable SMART as it doesn’t fully exists on those devices so you will still get errors displayed in the Web-UI when attempting to gain further SMART info; but at least the related log issue is then addressed.
The following shows it applied to a Cruze Fit I have here:
@Stefan and @Flox Just to let you know that as of the just released Rockstor version 3.8-14.02 testing channel updates contains the code changes that are intended to help with the problem that I think you have both encountered, ie occasional boots resulting in empty data mount points, which a reboot then restores. If all is well these changes should be included in the next stable channel update but are available in the testing channel if you wish to provide any feedback on if your systems are still displaying this behaviour once running 3.8-14.02 or newer.
Thanks and just to let you know that there is hopefully a fix awaiting testing; and if not then at least the logs should be of greater value given they should now contain more descriptive device names.