Urgent! boot device died

cbits68 · April 7, 2021, 7:34pm

my 40GB SSD boot deviece died yesterday
I tried to reinstall rockstor on the same server using a 500 GB HDD but every time grub boot manager failed to install on the disc. I tried a 1000 GB HDD but had the same error again.
I tried deactivating ACPI, but without any success.
what ca I do to get rockstor running again. Is it better to install it bases on SuSe Leap instead of Cent OS?
Unfortunately I could not find an installer iso based on SuSE.

regards
Andreas

phillxnet · April 7, 2021, 7:55pm

@cbits68 Welcome to the Rockstor community.

I can chip in quickly on this bit:

No there is not yet available an installer for the 4 ‘Built on openSUSE’ variant. But it may well be a better way to go. You can build your own installer however by following the instructions here:

And multiple forum members have successfully accomplished this task so will hopefully chip in if you report any issue you are having in building the installer. We don’t plan to release a pre-compiled installer until we reach the official stable release, but we are fairly close now I think.

Your failure during the install of the older ISO (CentOS based) is likely due the the hardware compatibility. You can, if this works for you, install on one machine and move the system disk to another machine. That way you may side step the problem you are experiencing here. You may have to apply all updates on the ‘donor’ machine, which is best done via a command line:

yum update

before you move the system disk from the donor machine to the desired machine.

Note that this is not required with the 4 DIY installer build as it pre-applies all updates at the time of build. And 4 is the future so you might as well try that route if you can/have the time.

Hope that helps. If anyone else can chip in with ideas here also that would be great.

cbits68 · April 7, 2021, 8:47pm

Thanks Phil,
i like to try to go with SuSe as soon as possible, but first I have to get access to my data again.
I’m wondering, why I have to install on another machine and move the disk back to the origin machine again?
This machine was working for years now and the only thing I change is the boot disc.
Is there any limit for the boot disc regarding capacity? Is 500GB too much for the installation of CentOS?
OK, I tried an older Samsung 2,5 HDD (320GB) and installation was finished without any problems. Seems that CentOS installation has an issue with newer discs or larger capacities.
I will try to get my data accessible again now and when I get the ordered new SSD, I will setup the server based on SuSE Leap.

Thanks
Andreas

phillxnet · April 8, 2021, 1:38pm

@cbits68 Re:

Just as a work around given installing directly on the original machine seems to be the problem. By using a donor machine we potentially side step an install issue with the problematic machine.

Some bios settings can be quite fussy and will refuse to boot from a chosen drive unless everything is just right. So changing a disk may well be the issue here. You could also try changing some settings within the bios for example. Or it may even be that there are issues in hardware that have lead to this failure.

No, no limit. Any drive will do. But too small is an issue but only because of having enough to install and then download and install the updates.

Great so getting somewhere then. The ‘Built on openSUSE’ variant is likely to be far less fussy is my guess. Just based on being a lot newer.

Glad your now sorted, mostly anyway. And if you have issues with creating the new Rockstor 4 installer then just start a new thread here on the forum and hopefully those issues can be worked out.

Cheers.

cbits68 · April 8, 2021, 6:28pm

Now everything is working again I changed to a new SSD bootdevice today and only have a few issus regarding quotas but hope to fix it during the next days. Also ROCK-ONS like mariadb owncloud and plex are back and running.
I really trust in Rockstor since 2018 and I just saw that I need to prolong my subscription next month.

phillxnet · April 9, 2021, 2:43pm

@cbits68 Thanks for the update and glad you’ve got the re-install import sorted.
Re:

Yes we have seen quotas getting upset post an import. You could try disabling via Web-UI and then re-enabling; and remember that it will take a good few minutes to rescan after re-enabling. Otherwise just disable via command line on the problem pool then re-enable via the Web-UI and wait. That usually sorts itself out there after and can then disable / re-enable just fine from the Web-UI there after.

Hope that helps and chuffed Rockstor is of use to you. Thanks for the continued support via a subscription. Much appreciated.

cbits68 · April 10, 2021, 1:15pm

Thanks Phil,
I just build the iso for SuSE installer by myself and it was quite easy
I first startet with installation of a test System in VMware and its was running smooth.
Next step will be the migration of my productive system from CentOS to SuSE.
As far as I understand, I just have to backup my configuration in the current mode of operation and restore it after setup with on SuSE.
My ROCK-ONS share is already (since setup 3 years ago) on the NAS drives and not on the boot device. so there should be problem, but I have to install and configure Rock-ONS again, I think.
Ar there any other things I have to consider, migrating the NAS to SuSe?

best regards
Andreas

phillxnet · April 10, 2021, 3:14pm

@cbits68 Hello again.
Re:

That’s good to hear. Do please chip in here on the forum to help others with their hickups on this procedure as it’s one I’m super keen to enter into general Rockstor knowledge. So feel free to pop in on such questions as we very much depend on community support here on the forum.

Best to apply the restore after having imported the pool. As then the pool import (which also imports the shares) ensures the shares are there for the smb stuff to be applied to.

So pool import then config restore.

You shouldn’t have to do this if you have done as advised in wizards and docs and used specific shares for each Rock-on config. Rock-on restore is only a 4 thing so you won’t have them restored but as long as when you re-install them you pick the exact same shares (presumably named to help this) then the Rock-ons will pick-up where they left off.

You may also have an issue with docker getting confused with a pre-owned rock-ons-root. In which case, assuming you have used it for nothing else, it will only contain the docker images anyway which will be downloaded if not found by the Rock-on install process. So if it all goes pear shaped with Rock-on re-installs just create a fresh Rock-ons-root2 say and then it should be possible to re-install the Rock-ons into there but selecting each’s prior config/data shares so the resume their prior state.

Re CentOS base to ‘Built on openSUSE’ our code, outside managing the far more complex system subvol arrangement, is the same. But as always you should have backups refreshed before hand and take great care with drive selection during install, and ideally disconnect all but the intended system disk when doing the fresh install.

See the excellent post by our intrepid forum Admin and long time core developer @Flox on this same question:

And in the same thread an as ever poniente contribution by the prolific @GeoffA:

Hope that helps.

cbits68 · April 10, 2021, 6:40pm

hmmm unfortunately SuSE seems to have problems with 2 of my drives connected to the sata 6 controller on my hardware build.
I don’t know why but I get some errors during startup with slow response from these drives.
I now tried to update from stable to 4.0.6.0 but still get the message link slow on sata 7 & 8 (two Seagate 8 TB drives in raid 1 mode)
In addition restore of rockstor config did not include recovery of smb shares, users and groups managed by rockstor and Rock-Ons at all.
I had to configure/install it manually to get everything working again.
Ok, it’s beta but I don’t think that al lot of users can solve the problems on their own after migration to SuSE and Rockstor 4.
I also had to configure quotas again and need to disable and enable them in the pool configuration.
Now I try to do a UAT during the next days.

best regards
Andreas

phillxnet · April 12, 2021, 10:02am

@cbits68 Hello again.

This is likely quite serious and should be addressed / resolved prior to any other reports as these ‘hangs’ / slow response can end up breaking all maner of other facilities due to time-outs etc. It is also likely outside the remit of Rockstor itself as we use the default kernel and btrfs stack from openSUSE LEAP 15.2, assuming that this install is from a DIY installer build. I’d suggest you do generic searches for your hardware/drives on these interfaces. Are these drives SMR by any chance. They are not ‘yet’ btrfs or COW friendly.

Re:

N.B. Rock-ons restore is only functional if the config save was taken from a 4 variant. And there is as yet no known issue with user/group restore.

Did you restore the config after first importing the pool/pools. And had you updated fully prior to doing this config restore. I.e what version of Rockstor were you running when you did the restore. From your post I’m guessing you were restoring a 3 config onto a 4.0.6 instance. Is that correct.

A known issue currently and quotas are still a little tricky within btrfs and Rockstor. And yes, usually a disable/re-enable within the Web-UI sorts this. And if not a command line disable helps. This quota issue only relates to pools migrated from 3 to 4 where we are ‘catching-up’ on a number of years of btrfs development.

Thanks for the reports, all helps with improving where we can. As stated we are not yet in a Stable release and whittling down the final issues re feature parity and migration from 3 to 4. Although not all folks will be migrating their pool and a feature such as Rock-on config save/restore can’t be retro-active if it did not exist when the config save was enacted. Again we need better documentation on this. Pull requests welcome on the docs repo, as always:

And the relevant section here would be:

http://rockstor.com/docs/config_backup/config_backup.html

Bit by bit. And thanks again for your feedback.

cbits68 · April 12, 2021, 12:33pm

Hi Philip,
yes. correct. I restored config of 3.9.2 to 4.0.6.0 after successful import of all pools.
To setup quotas and compression againg, I also had to change rights of some shares from root back to origin useres and groups (in 3…9.2).
After restore of coniig, the users and groups that are not under rockstor management were back again, but not the users and groups that were managed by rockstor.before.

with journalctl I found the following messages, created during startup:

ACPI (example) I get this message for :
Apr 12 09:47:48 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol [_SB.PCI0.SAT0.SPT4._GTF.DSSP], AE_NOT_FOUND (20190703/psargs-330)
Apr 12 09:47:48 localhost kernel: ACPI Error: Aborting method _SB.PCI0.SAT0.SPT4._GTF due to previous error (AE_NOT_FOUND) (20190703/psparse-531)
Apr 12 09:47:48 localhost kernel: ACPI BIOS Error (bug): Could not resolve symbol [_SB.PCI0.SAT0.SPT4._GTF.DSSP], AE_NOT_FOUND (20190703/psargs-330)
Apr 12 09:47:48 localhost kernel: ACPI Error: Aborting method _SB.PCI0.SAT0.SPT4._GTF due to previous error (AE_NOT_FOUND) (20190703/psparse-531)

I also get some SRST failed messages for the 2 SATA 6 drives (Seagate 8 TB)_

Apr 12 09:48:47 localhost kernel: ata8: SRST failed (errno=-16)
Apr 12 09:48:47 localhost kernel: ata8: reset failed, giving up
Apr 12 09:48:47 localhost kernel: ata7: SRST failed (errno=-16)
Apr 12 09:48:47 localhost kernel: ata7: reset failed, giving up
Apr 12 09:48:47 localhost systemd-udevd[257]: seq 1547 ‘/devices/pci0000:00/0000:00:1f.5’ is taking a long time

My mainboard has an internal RAID controller but I didn’t use it:
Apr 12 09:48:49 NAS001 systemd[1]: Started udev Wait for Complete Device Initialization.
Apr 12 09:48:49 NAS001 systemd[1]: Starting Activation of DM RAID sets…
Apr 12 09:48:49 NAS001 dmraid[554]: no raid disks
Apr 12 09:48:49 NAS001 systemd[1]: dmraid-activation.service: Main process exited, code=exited, status=1/FAILURE
Apr 12 09:48:49 NAS001 systemd[1]: Failed to start Activation of DM RAID sets.
Apr 12 09:48:49 NAS001 systemd[1]: dmraid-activation.service: Unit entered failed state.
Apr 12 09:48:49 NAS001 systemd[1]: dmraid-activation.service: Failed with result ‘exit-code’.
Apr 12 09:48:49 NAS001 systemd[1]: Reached target Local Encrypted Volumes.
Apr 12 09:48:49 NAS001 systemd[1]: Reached target Local File Systems.
Apr 12 09:48:49 NAS001 systemd[1]: Starting Restore /run/initramfs on shutdown…
Apr 12 09:48:49 NAS001 systemd[1]: Starting Create Volatile Files and Directories…
Apr 12 09:48:49 NAS001 systemd[1]: Started Restore /run/initramfs on shutdown.

SRST issue with both Seagate drives could be a BIOS/Drive Jumper configuration problem.
I havent seen that with CentOS 7 but this could occure due to another kernel with SuSE …
I will check that by reading the seagate manual about jumper configuration.

regards
Andreas

phillxnet · April 13, 2021, 1:04pm

@cbits68 Thanks for the feedback.

Keep us posted re that controller / drive issue. Looks like the most pressing currently.

Cheers.

cbits68 · April 13, 2021, 6:08pm

OK, today I changed sata connection of my 2 WD40 NAS drives with the two ST8000.
The WD40 drives produced the same problems on the SATA Ports 7&8 than the ST8000 drives.
These 2 SATA Ports support up to 6 Gbps while all other only suppurt 3 Gbps.
Normally the controller automatically negotiate the speed with the SATA connected drives during startup.
In this case drives connected to 3 Gbps SATA ports are coneccted with 3 GBPS.

SATA link up 3.0 Gbps (SStatus 123 SControl 300)

but the drives connectee to 6.0 Gbps produce the messages
link is slow to respond, please be patient (ready=-19)
SRST failed (errno=-16)
and are linked only with
limiting SATA link speed to 1.5 Gbps

this is really irritating

I think about downgrading to Rockstor 3 on CentOS7 to review if this is only happening with Rockstor 4 on SuSE.

warm regards
Andreas

phillxnet · April 13, 2021, 6:20pm

@cbits68 Re:

Given we use a generic openSUSE Leap 15.2 hardware component in a 4 install, you might be better off, rather than going backwards, to search for this as a known issue with the controller chips for the given 6 Gbps sata HBA’s. It’s likely a known issue, we used to offer a Tumbleweed variant but unfortunately our technical dept put paid to that a little while ago (back in due course) but we hope to release a Leap 15.3 variant soon.

So check for this as a known issue within Leap 15.2 as there is likely a kernel command line option that can resolve/work-around this issue for the time being. The CentOS variant is years older and a major step back in every respect, plus we no longer release any rpm updates for it. So I’d concentrate on finding a fix/work-around for this directly. At least until we update our Leap base in due course.

Hope that helps.

cbits68 · April 13, 2021, 7:04pm

yes for sure …
i think the best way will be to change the hardware with a SATA 6 only controller (supported by the current kernel) that will speed up all my drives to 6 Gpbs which will give some improvement.
I build this Rockstor NAS from the old gaming PC of my son (of 2013) Perhaps I can find some used hardeware with onboard graphics.
But I expect that I will also have to change CPU and RAM. Currently my config has 12 GB DDR3 and Intel® Core™ i5-2500K CPU @ 3.30GHz
At all 28 TB storage on NAS drives
This is sufficient, but can be improved

Hooverdan · April 13, 2021, 7:35pm

@cbits68, have you by chance tried replacing the SATA cables for the “slow” drives? I believe, I remember reading something that it could cause the entire array to slow down and create errors?
Another comment I saw, was that it “self-healed” by doing a cold reboot (i.e. shutdown and power back on), but I suspect that’s more anecdotal evidence and not really any root cause resolution…

cbits68 · April 13, 2021, 9:02pm

Thanks Dan for the hint. Yes I also tried to change cables and disc drives on these ata ports, unfortunately without success. I also had the idea that cabeling could caused the problem, because I disconnected all data drives during update to Rockstor 4
I really guess that this issue is caused by incompatible driver for this HBA (SATA 6) because SATA 3 is working without any problem, and only drives connected to the 2 SATA 6 Ports report a slow link.

cbits68 · April 14, 2021, 7:59am

OK now I’m a little bit confused.
I once more checked the logs and all my 6 drives are identified correctly and the link speed is also set to the correct value.
The messages/failiures I get are for ata 7&8 which are not present on my mainboard. I only have connectors for sata 1-6-
So I wonder why I get these messages
Apr 14 09:24:26 localhost kernel: ata7: SATA max UDMA/133 cmd 0xf090 ctl 0xf080 bmdma 0xf050 irq 19
Apr 14 09:24:26 localhost kernel: ata8: SATA max UDMA/133 cmd 0xf070 ctl 0xf060 bmdma 0xf058 irq 19
Apr 14 09:24:42 localhost kernel: ata7: link is slow to respond, please be patient (ready=-19)
Apr 14 09:24:42 localhost kernel: ata8: link is slow to respond, please be patient (ready=-19)
Apr 14 09:24:46 localhost kernel: ata8: SRST failed (errno=-16)
Apr 14 09:24:46 localhost kernel: ata7: SRST failed (errno=-16)

on the other hand ata5&6 are connected with 3.0 Gbps
Apr 14 09:24:37 localhost kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 14 09:24:37 localhost kernel: ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

but are limited later on also to 1.5 Gbps like ata7&8
Apr 14 09:25:22 localhost kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Apr 14 09:25:22 localhost kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

perhaps ata7 & 8 are reserved for eSATA connectors but in the manual of my mainbord it is stated, that they should not be present in my version of the MSI P67A-GD53 and only available in P67A-GD65.

If I take a look into the disk info (Rockstor WebUI), I can exactly follow the results:

The disks connected to sata 5&6 are limited to 1,5 Gbps.
Disks connected to sata 1-4 work with the expected speed of 3 or 6 Gbps.

After I changed sata connection of some drives Rock-Ons like owncloud but also other will not start again and reinstallation failed.
I really don’t know if it was a good idea to upgrade to Rockstore 5 on SuSE Leap

Following messages in rockstor.log (rock-on owncloud official cannot be installes again)

[14/Apr/2021 21:01:53] INFO [storageadmin.tasks:55] Now executing Huey task [install], id: 36dffe14-4614-4c67-9aba-6458f18f92f6.
[14/Apr/2021 21:01:53] ERROR [system.osi:199] non-zero code(1) returned by command: ['/usr/bin/docker', 'stop', 'owncloud-official']. output: [''] error: ['Error response from daemon: No such container: owncloud-official', '']
[14/Apr/2021 21:01:53] ERROR [system.osi:199] non-zero code(1) returned by command: ['/usr/bin/docker', 'rm', 'owncloud-official']. output: [''] error: ['Error: No such container: owncloud-official', '']
[14/Apr/2021 21:01:55] ERROR [system.osi:199] non-zero code(127) returned by command: ['/usr/bin/docker', 'run', '-d', '--restart=unless-stopped', '--name', 'owncloud-official', '-v', '/mnt2/owncloud:/var/www/html', '-v', '/etc/localtime:/etc/localtime:ro', '-p', '8080:80/tcp', '-p', '8080:80/udp', 'owncloud:latest']. output: [''] error: ['docker: Error response from daemon: stat /mnt2/ROCK-ONS/btrfs/subvolumes/bf85d96eb6bc3f36e633530611881ca3a8ee8da7bd107aa78341b4b3ae80a7ec: no such file or directory.', "See 'docker run --help'.", '']
[14/Apr/2021 21:01:55] ERROR [storageadmin.views.rockon_helpers:207] Error running a command. cmd = /usr/bin/docker run -d --restart=unless-stopped --name owncloud-official -v /mnt2/owncloud:/var/www/html -v /etc/localtime:/etc/localtime:ro -p 8080:80/tcp -p 8080:80/udp owncloud:latest. rc = 127. stdout = ['']. stderr = ['docker: Error response from daemon: stat /mnt2/ROCK-ONS/btrfs/subvolumes/bf85d96eb6bc3f36e633530611881ca3a8ee8da7bd107aa78341b4b3ae80a7ec: no such file or directory.', "See 'docker run --help'.", '']
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/rockon_helpers.py", line 204, in install
    globals().get("{}_install".format(rockon.name.lower()), generic_install)(rockon)
  File "/opt/rockstor/src/rockstor/storageadmin/views/rockon_helpers.py", line 390, in generic_install
    run_command(cmd, log=True)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 201, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/bin/docker run -d --restart=unless-stopped --name owncloud-official -v /mnt2/owncloud:/var/www/html -v /etc/localtime:/etc/localtime:ro -p 8080:80/tcp -p 8080:80/udp owncloud:latest. rc = 127. stdout = ['']. stderr = ['docker: Error response from daemon: stat /mnt2/ROCK-ONS/btrfs/subvolumes/bf85d96eb6bc3f36e633530611881ca3a8ee8da7bd107aa78341b4b3ae80a7ec: no such file or directory.', "See 'docker run --help'.", '']
[14/Apr/2021 21:01:55] INFO [storageadmin.tasks:63] Task [install] completed OK

Hooverdan · April 14, 2021, 8:13pm

@cbits68 for the Rock-on reinstall, you might have to “completely” uninstall it before reinstalling, using the script and Rock-on name on the command line, e.g.:
/opt/rockstor/bin/delete-rockon 'OwnCloud - Official'

This will remove the metadata for that installation from the database and the underlying docker image.

cbits68 · April 14, 2021, 9:52pm

deleting rock-ons by command line didn*t help.
I solved the issues by removing ROCK-ONS Share and creating it again.
Now I was able to install the rock-ons again.

I also tried to roll back to rockstor 3 and saw that I had the same problem with limited ata defices before, I only haven’t seen it before.
I really have to think about an upgrade kit with new mainboard with onboard graphics, CPU and memory.

Does anybody has a recommendation for a new hardware build?