Hi,
in the last two weeks, after long uptime periods (50 or more days), my system has started to randomly reboot (sometimes within hours, sometimes within days). As I am not a real Linux expert I can’t really tell what the issue might be. In the rockstor logs or the messages file I cannot find any indication why it shut down. It only seems to show the details when it came back up. Fortunately, it has come back up every time so far without data corruption, though on one of the last occasions I lost my RockOn setup (Plex Server) and had to follow another forum post to reinstate the entire Rock On service (remove RockOns, remove RockOn Share, create new share, reassign to RockOn service, restart).
Today, the reboot happened right while I was checking through various things (logs, sensors, etc.).
I have installed lm_sensors and checked the CPU temps and they are well within the non-critical temperature ranges. Niether the system disk (SSD) nor the NAS disks are anywhere near full.
I am running an ASROCK Rack board (CS-236) with a Skylake CPU and ECC RAM (from when I was looking at FreeNAS).
What other data should I post/other places to look so somebody can possibly provide some insights?
hmmm… if the logs aren’t presenting anything then that would suggest a hardware issue. I would swap out the power supply and see if that fixes it, if power is being cut there would be nothing in the logs about a kernel error aside from boot saying there was a fatal shutdown or whatever it’s called.
That might be a good point. All of the components are less than a year old, but … of course I could have purchased a dud for a power supply …
I will give that a shot as well. I don’t remember seeing anything about an unexpected/fatal shutdown anywhere, but will investigate that again.
@Hooverdan I agree with @ScottyEdmonds and I’ve been very impressed with BeQuiet PSU’s. Although I’ve only tried a couple of older models of their Straight Power (E9 I think), and on a more budget build, one from their Pure Power range (L8 at the time). 450W and 300W respectively in my case. Super good build quality, fantastic spec (voltage accuracy and ripple etc) and measured performance and really efficient with good standby power consumption. Both models were super quiet but the Straight Power was essentially silent, although the Pure Power was close enough to it. All researched on older models now though. I selected the Straight Power a while ago from reading the following (now old) review where they seemed pretty thorough (back in 2013):
One of these L8’s has been used extensively ever since.
All still going strong. Many builds don’t concentrate on the PSU much but I think this is a mistake (Because Science ).
It’s at least worth switching the PSU out for another one just by way of diagnosis. Maybe just switch with another machine for the time being, assuming load compatibility of course.
Thanks for the tips. I will try a Corsair that’s been serving me well over the years to see whether that’s the root cause. If it turns out to be stable, then I’ll try out one of the BeQuiet ones. The one I am currently using is a Roswill (NewEgg) that worked for my small but well ventilated case …
I’ll update back with what I find, if anything.
Just for Completeness: here’s what I see in the messages log before the reboot seems to occur:
Jun 14 23:14:23 rockstorw systemd: Stopping user-0.slice.
Jun 15 00:01:01 rockstorw systemd: Created slice user-0.slice.
Jun 15 00:01:01 rockstorw systemd: Starting user-0.slice.
Jun 15 00:01:01 rockstorw systemd: Started Session 13 of user root.
Jun 15 00:01:01 rockstorw systemd: Starting Session 13 of user root.
Jun 15 00:15:15 rockstorw systemd: Removed slice user-0.slice.
Jun 15 00:15:15 rockstorw systemd: Stopping user-0.slice. –> Here the reboot occurs, and according to rockstor log is then completed a short time later.
Jun 15 00:56:20 rockstorw rsyslogd: [origin software=“rsyslogd” swVersion=“7.4.7” x-pid=“3243” x-info=“http://www.rsyslog.com”] start
Jun 15 00:56:09 rockstorw kernel: Linux version 4.8.7-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Thu Nov 10 20:47:24 EST 2016
Jun 15 00:56:09 rockstorw kernel: Command line: BOOT_IMAGE=/vmlinuz-4.8.7-1.el7.elrepo.x86_64 root=UUID=73014235-1edb-4824-b3ac-c0dd8bf26cec ro rootflags=subvol=root crashkernel=128M rhgb quiet bert_disable
Jun 15 00:56:09 rockstorw kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers’
Jun 15 00:56:09 rockstorw kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers’
Jun 15 00:56:09 rockstorw kernel: x86/fpu: Supporting XSAVE feature 0x004: ‘AVX registers’
…
In the dmesg file I see some ACPI Errors, but after some research appears to be something not too critical (considering that I found on the various forums that this is a long-running message already - my BIOS is on the latest version, too):
[ 1.312192] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20160422/psargs-359)
[ 1.312195] ACPI Error: Method parse/execution failed [_SB.PCI0.SAT0.PRT2._GTF] (Node ffff88084d4fae88), AE_NOT_FOUND (20160422/psparse-542)
I also had disabled BERT since it was throwing an error (can see that in the boot parameters).
Today will swap the power supply to see whether that alleviates the problem
Just to provide an update. I swapped out the PSU with a smaller one. It has been running for 24 hours without a reboot (though with the old one I had up to 3 days before a spontaneous reboot), so at least it’s not worse than the old one
I will continue to run it with some loads as well and report back. @phillxnet and @ScottyEdmonds thanks again for your input. While I posted this message in the 3.9.0-0 stable channel, of course the automated update now moved it to 3.9.1-0 in the meantime, but assuming it’s a hardware problem, this shouldn’t matter.
@Hooverdan Thanks for the update.[quote=“Hooverdan, post:7, topic:3370”]
It has been running for 24 hours without a reboot (though with the old one I had up to 3 days before a spontaneous reboot)…
[/quote]
So fingers crossed then.
Agreed, especially given that before the problem occurred without any software changes; although it is of course not ideal to be changing multiple vectors of a problem during diagnosis.
Hi, it appears at this time that indeed the power supply was at fault. The system has been running stable (despite high summer temperatures) and no spontaneous reboots without a trace in the log files have occurred.
For testing I had swapped the Rosewill PSU with a new Corsair SF600 (since I have a small case anyway, and am partial to Corsair products - so far their PSUs have not given me any issues in the last 10 years). It’s small and quiet and has ample power to support my Rockstor setup. If I just jinxed myself, I will post again, but I think at this time I consider the issue resolved.