Rockstor fails to boot with kernel 3.18.1

Hello folks, I’m brand spanking new to Rockstor and just installed on a new system with 250GB SSD as root, and 2x WD Red 2TB drives in Raid 1, all over SATA 6.0Gb. I used the 3.5-5 ISO from SourceForge.


The WebUI wanted me to upgrade to the latest version, 3.6-9, which I did. With that came the new 3.18.1 kernel. However I’m having serious problems getting 3.18.1 to boot my system. Half of the time, after POST I just get a blinking cursor and nothing ever happens. When it does boot, I almost always get various errors about btrfs on my terminal (the most recent one I captured was “systemd-udevd: timeout ‘/bin/mknod /dev/btrfs-control c 10 234’” but there have been others.

So during boot, I can manually select the older 3.10.0 kernel, and bootup works fine. However I get the Web UI warning about using an unsupported kernel.

I’m pretty familiar with Linux in general but quite new to btrfs and don’t have much experience switching out kernels, so any help here would be appreciated. I’m using an ASRock Intel based mobo. I can provide any other hardware details as needed.

TIA,
Toby

I’ve never seen this exact error, but perhaps I don’t have the exact hardware you do. we have several different hardware combos running rockstor and the closest thing to something like this that I’ve seen is when there were errors on the hard drive. booting took for a long time as btrfs was repairing stuff and it finished after like an hour or so. I’d see a lot of btrfs messages on the console. Here are some questions and random suggestions.


1. remove quiet from boot params to see more information. may be something understandable shows up.
2. Have you created any pools already or is it just a vanilla install?
3. when you boot with 3.10, do you see any disk related errors in dmesg or logs anywhere?

Since you are pretty familiar with Linux, you might have more ideas. You can also try and see if the latest kernel from elrepo boots(http://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-ml-3.19.1-1.el7.elrepo.x86_64.rpm). I’ve been testing this kernel and have not seen any issues. If it works, you can ignore the error on the web-ui. This kernel will become new default soon anyway.

Thanks for the reply suman. I’ve been very impressed with Rockstor so far and am excited to get everything working.


To answer your last question first, I do not have any errors whatsoever when booting 3.10 (except for a notice that btrfs has skinny extents which I assume is expected). It boots very quickly and everything seems to work great.

I did indeed create a single Raid 1 pool which makes use of the the entire 2TB on the 2x WD drives. Something else to mention, I was actually testing on a different drive before the new ones got here, and had the same problem, but assuming I’d messed something up with moving hardware around, I completely wiped & reinstalled Rockstor from ISO. So this has actually happened twice.

A bit of Googling found this post which actually sounds pretty similar to my problem, but recreating the initramfs with --no-hostonly did not fix the problem.

I’ll try removing the quiet params and if that doesn’t shed any light I’ll try the new kernel. I don’t see this elrepo in the yum config anywhere; do I just download the RPM and install manually? (Sorry it’s been a while since I used CentOS/yum).

Okay, a bit of follow-up. I installed the 3.19 kernel from the RPM you mentioned (downloaded and install with rpm -ivh), but it did not fix the issue. The first time it booted the new kernel, it seemed to work great. Then I rebooted again, and got the hang on boot.


However with quiet and rhgb disabled I was able to get a bit more info. The last error I get is (typing this from a screenshot so hopefully I type it correctly; I didn’t include the full device UUID):

[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-92447d61\…\xxx.device
[DEPEND] Dependency failed for /home.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Relabel all filesystems, if necessary.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.

So this seems to be some sort of either boot race condition, or device label issue? I think if it were a kernel bug, it would fail every time and not sometimes.

Please let me know if you need more info or would like me to perform additional steps, thanks. Unfortunately it doesn’t look like I can get systemd/journald to save logs from failed boots, since that disk isn’t available yet…

…Sorry for the spam but in case it isn’t clear… /home is on /dev/sda, which is on the same SSD that the machine is booting from. This happens very shortly after the message “Started Remount Root and Kernel File Systems”… so it seems to be having some sort of problem remounting that device?


Since it’s happening so closely to the “Mark the need to relabel after reboot”, etc., is why I think it’s a race condition of some sort. Or possibly, when remounting, a file is held open preventing the unmount?

Thanks for your interest in Rockstor and detailed troubleshooting trail. I wonder if something got corrupt or changed on your root disk, like fs labels. Can you boot into 3.10 and paste the output of this command?


/usr/bin/lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

Also, perhaps /etc/fstab?



One more question. Is there any hardware raid on this system?

Sure thing, glad to help!


Output of lsblk:
NAME=“sda” MODEL=“Samsung SSD 840 " SERIAL=“S1DBNSCF102321V” SIZE=“232.9G” TRAN=“sata” VENDOR=“ATA     " HCTL=“0:0:0:0” TYPE=“disk” FSTYPE=”” LABEL="" UUID="“
NAME=“sda1” MODEL=”" SERIAL="" SIZE=“500M” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“ext4” LABEL="" UUID=“6d920393-f250-4fca-a5c2-d993f55f0ab8"
NAME=“sda2” MODEL=”" SERIAL="" SIZE=“7.5G” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“swap” LABEL="" UUID=“d21934be-9ed3-479b-ab50-753c77332720"
NAME=“sda3” MODEL=”" SERIAL="" SIZE=“224.9G” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_nas” UUID="92447d61-73cc-442b-8db4-5c17ae908dbf"
NAME=“sdb” MODEL=“WDC WD20EFRX-68E” SERIAL=“WD-WCC4MKKACPCX” SIZE=“1.8T” TRAN=“sata” VENDOR="ATA     " HCTL=“1:0:0:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“wd_red_pool” UUID="9fb673ae-1e53-497e-ae8a-c03aa119e9e2"
NAME=“sdc” MODEL=“WDC WD20EFRX-68E” SERIAL=“WD-WCC4MHYZDPVZ” SIZE=“1.8T” TRAN=“sata” VENDOR="ATA     " HCTL=“2:0:0:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“wd_red_pool” UUID=“9fb673ae-1e53-497e-ae8a-c03aa119e9e2”

/etc/fstab:
#
# /etc/fstab
# Created by anaconda on Fri Mar 13 21:01:58 2015
#
# Accessible filesystems, by reference, are maintained under ‘/dev/disk’
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=92447d61-73cc-442b-8db4-5c17ae908dbf /                       btrfs   subvol=root     1 1
UUID=6d920393-f250-4fca-a5c2-d993f55f0ab8 /boot                   ext4    defaults        1 2
UUID=92447d61-73cc-442b-8db4-5c17ae908dbf /home                   btrfs   subvol=home     1 2
UUID=d21934be-9ed3-479b-ab50-753c77332720 swap                    swap    defaults        0 0

I’m still just exploring the system so I don’t mind testing things out and just wiping/starting over if necessary…

Aaaah I just noticed as I pasted that, is it okay for two different physical disks (sdb and sdc) to have the same LABEL and UUID? No, I don’t have any hardware RAID enabled.

Output looks normal to me. The LABEl and UUID are same because those two drives belong to the same pool. A pool in Rockstor is BTRFS filesystem. Since this is not a core Rockstor issue, I am unable to provide insightful feedback. Sorry, but here are couple of things I’d try next.


1. You said you were able to boot from 3.19 once but it hung on reboot. Rockstor automatically sets the default kernel to 3.18.1 on every boot. Is it possible that on reboot, you were going back to 3.18.1 or did you manually select 3.19 from grub and make sure 3.19 also has the same problem?

2. During the install, have you given the hostname as nas? If you do re install, try not setting the hostname during install. You can do it during the web-ui setup. This is unrelated and I doubt it’s related to the label issue, but a worthy experiment.


Ok, thanks for the input so far. Yes, I was manually selecting 3.19 from Grub. And yes, I gave the hostname during install, so I’ll try it in the web-ui setup if I exhaust other options.

The current kernel version as of current Rockstor version(3.8-1) is 4.0.2. Please update if this is still an issue.

Hi @suman, unfortunately I was never able to resolve these issues and ended up moving off of Rockstor for my NAS, so I’m unable to test it any further right now. I had tried many different combinations of kernels and install options but kept having the hangup during boot.

Based on some similar issues by other users, I tried things like reordering/disabling various systemd options but still couldn’t get it to boot reliably. I made some progress by disabling some auto-mount points in /etc/fstab then writing a script to mount them after boot was finished but even then it didn’t always boot reliably.

I don’t think any of this really has anything to do with Rockstor, with the possible exception that it seemed like its startup script was running twice (sorry I can’t find specifics on which script since I didn’t write it down) but even disabling that didn’t really fix the problem.

I have too had this issue and it is related to when the system trys to jump to using a gui beet screen.

I had this same problem on 3.18 and 4.0.2 and solved it by doing the following

edit /etc/default/grub

change

GRUB_CMDLINE_LINUX=“rhgb quiet crashkernel=auto”

to

GRUB_CMDLINE_LINUX=“crashkernel=auto”

Then run

grub2-mkconfig -o /boot/grub2/grub.cfg

I would suggest making this a change to the distro as it seems to be causing a few issues and there is no need for the gui boot screen on a server?

Cheers

Chris

2 Likes

Isn’t removing rhgb and quiet parameters purely cosmetic? Or does rhgb require drivers that may not be available on certain hardware? If so, does it not fail back to a text boot screen?

I am not exactly sure but i know it fixes it. My thinking is it requires the frame buffer to do more work in higher res and older hardware with on board graphics is not likeing it.

Hi
Had the same error and found out the problem was … hard disk full!!
For some strange reason my 500Gb system only non shared disk was full.
I’m moving to a 1Tb had disk and later find out what could be causing this.

Sorry to tack on to this thread.

I see the same error after I added 2 WD Red drives. I see the following errors repeating over and over with different values for x and n:
[uptime in s?] systemd-udevd[different pids?]: worker [another pid] /devices/pci0000:00/000:00.1f.2/ata4/ host3/target/…/block/sdc timeout: kill it
[uptime in s?] systemd-udevd[different pids]: timeout ‘/bin/mknod /dev/btrfs-control c 10 234’
<above line repeats 5 times>
[uptime in s?] seq [some number] /devices/pci /block/sdc killed

The above repeats additions with /devices/system/memory/memory65 timeout: kill it

Not sure what is going on here.

Hi,

same problem here.

I tried all of the above methods but none worked. First, I also thought that removing quiet and rhgb from kernel line fixed the problem. But this was not the case. After some days the problem occured again.
Also my disk is not full. I also tried disabling the crashkernel feature. But no luck.

I tried this on four completely different hardware.
First was a HP microserver N54L.
Second was a Asrock Q1900 with J1900 CPU.
Third was an older AMD E-350 Board.
The last one is an Asrock AM1B-ITX with an Athlon 5350.

On all systems the same problem after some time.
I also tried different SSDs as system disk. And I also tried different harddrives for data. But all these drives where WD red.

I tried different SATA cables and different Cases. All the same problem.

The only thing, that is in common: All the Cases (HP Microserver, a 4bay Case from Inter-Tech and now an NSC-200 from U-Nas) have a backplane for the harddrives.

Actually I have pulled out the drives and connected them directly to the mainboard. But I have not much hope for this.

I am running out of ideas with this. Last thing is a software problem.

Last info on this: I tried Rockstor versions from 3.8-10 on to the actual 3.8-12. All with the same issue.

Additional info again:
I have running a productive Rockstor system in a VM on a KVM host. This works without such a problem.

Also Rockstor installations in other VMs seem to work fine.

So this problem seems to only occur on physical hardware.