Crash / System becomes unresponsive under IO load

Hello All,

I thought I had a faulty cable and then a faulty disk but I have solved the problem now so I thought I would share my findings.

When the system would crash I found this in the syslog

Jul 12 15:36:48 backup kernel: ata6: lost interrupt (Status 0x50)
Jul 12 15:36:48 backup kernel: ata6.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 12 15:36:48 backup kernel: ata6.01: failed command: WRITE DMA EXT
Jul 12 15:36:48 backup kernel: ata6.01: cmd 35/00:00:00:56:e9/00:02:24:00:00/f0 tag 0 dma 262144 out
         res 40/00:00:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jul 12 15:36:48 backup kernel: ata6.01: status: { DRDY }
Jul 12 15:36:48 backup kernel: ata6: soft resetting link
Jul 12 15:36:49 backup kernel: ata6.00: configured for UDMA/133
Jul 12 15:36:49 backup kernel: ata6.01: configured for UDMA/133
Jul 12 15:36:49 backup kernel: ata6.01: device reported invalid CHS sector 0
Jul 12 15:36:49 backup kernel: ata6: EH complete
Jul 12 15:37:19 backup kernel: ata6: lost interrupt (Status 0x50)
Jul 12 15:37:19 backup kernel: ata6.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 12 15:37:19 backup kernel: ata6.01: failed command: WRITE DMA EXT
Jul 12 15:37:19 backup kernel: ata6.01: cmd 35/00:80:80:c7:e9/00:05:24:00:00/f0 tag 0 dma 720896 out
         res 40/00:00:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jul 12 15:37:19 backup kernel: ata6.01: status: { DRDY }
Jul 12 15:37:19 backup kernel: ata6: soft resetting link
Jul 12 15:37:20 backup kernel: ata6.00: configured for UDMA/133
Jul 12 15:37:20 backup kernel: ata6.01: configured for UDMA/133
Jul 12 15:37:20 backup kernel: ata6.01: device reported invalid CHS sector 0
Jul 12 15:37:20 backup kernel: ata6: EH complete
Jul 12 15:37:51 backup kernel: ata6: lost interrupt (Status 0x50)
Jul 12 15:37:51 backup kernel: ata6.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 12 15:37:51 backup kernel: ata6.01: failed command: WRITE DMA EXT
Jul 12 15:37:51 backup kernel: ata6.01: cmd 35/00:00:80:a5:eb/00:0d:24:00:00/f0 tag 0 dma 1703936 out
         res 40/00:00:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jul 12 15:37:51 backup kernel: ata6.01: status: { DRDY }
Jul 12 15:37:51 backup kernel: ata6: soft resetting link
Jul 12 15:37:52 backup kernel: ata6.00: configured for UDMA/133
Jul 12 15:37:52 backup kernel: ata6.01: configured for UDMA/133
Jul 12 15:37:52 backup kernel: ata6.01: device reported invalid CHS sector 0
Jul 12 15:37:52 backup kernel: ata6: EH complete

I then also found this on boot in the log

Jul 12 10:04:42 backup kernel: ACPI Warning: SystemIO range 0x0000000000000400-0x000000000000041F conflicts with OpRegion 0x0000000000000400-0x000000000000040
F (\SMRG) (20150410/utaddress-254)
Jul 12 10:04:42 backup kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Jul 12 10:04:42 backup kernel: shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
Jul 12 10:04:42 backup kernel: ACPI Warning: SystemIO range 0x00000000000004B0-0x00000000000004BF conflicts with OpRegion 0x0000000000000480-0x00000000000004B
F (\GPS0) (20150410/utaddress-254)
Jul 12 10:04:42 backup kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
Jul 12 10:04:42 backup kernel: ACPI Warning: SystemIO range 0x0000000000000480-0x00000000000004AF conflicts with OpRegion 0x0000000000000480-0x00000000000004B
F (\GPS0) (20150410/utaddress-254)
Jul 12 10:04:42 backup kernel: ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver

So I did some reading / googling and found that ACPI can cause issues on older boards.

Once I put acpi=off in my grub cmdline linux the problem seems to have gone away.

Edited

/etc/default/grub

from
GRUB_CMDLINE_LINUX="crashkernel=auto"
to
GRUB_CMDLINE_LINUX="crashkernel=auto acpi=off"

then run

grub2-mkconfig -o /boot/grub2/grub.cfg

reboot

1 Like

Are @sirhcjw strikes again.
Yes older systems can be a real pain and as Rockstor uses pretty new kernels they may start to become unworkable. For instance I heard the other day, though I don’t know if it is true or not, that no systemd developers are using spinning rust drives on /; fancy that. And acpi is very well established and sometimes very poorly implemented, especially on older motherboards. You could also try a firmware upgrade on your motherboard as they often contain acpi fixes. But again that in itself is a risky business so maybe if you are up and running then the “don’t fix what isn’t currently broken” idea might be the way to go. However acpi is rather assumed these days so there may be some potentially strange behaviour, disabling in the bios as many things as you can would also help to simplify things; ie sound card etc.

Thanks for sharing again.

I have had a further development with this one.

I found ACPI 2.0 was disabled in my bios I have enabled that and removed the acpi=off from my kernel boot line and things seem to be good again.

I will update with any further developments.

Should we start a hardware config tread where people can post details of there systems?

So others can see what works and what might have issues?

For example

dmidecode run on the comand line will return some good info

Base Board Information
Manufacturer: ASUSTeK Computer INC.
Product Name: P5KPL-AM EPU
Version: x.0x

Processor Information
Socket Designation: Socket 775
Type: Central Processor
Family: Other
Manufacturer: Intel
ID: 7A 06 01 00 FF FB EB BF
Version: Intel® Xeon® CPU E5450 @ 3.00GHz
Voltage: 1.2 V
External Clock: 333 MHz
Max Speed: 3800 MHz
Current Speed: 3000 MHz

Memory Module Information
Socket Designation: DIMM A1
Bank Connections: 0 1
Current Speed: 30 ns
Type: DIMM SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

Memory Module Information
Socket Designation: DIMM B1
Bank Connections: 2 3
Current Speed: 30 ns
Type: DIMM SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

Memory Device
Array Handle: 0x002D
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM A1
Bank Locator: BANK0
Type: DDR2
Type Detail: Synchronous
Speed: 667 MHz
Manufacturer: Manufacturer0
Serial Number: SerNum0
Asset Tag: AssetTagNum0
Part Number: PartNum0

Handle 0x0030, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x0007FFFFFFF
Range Size: 2 GB
Physical Device Handle: 0x002F
Memory Array Mapped Address Handle: 0x002E
Partition Row Position: 1
Interleaved Data Depth: 1

Memory Device
Array Handle: 0x002D
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: None
Locator: DIMM B1
Bank Locator: BANK1
Type: DDR2
Type Detail: Synchronous
Speed: 667 MHz
Manufacturer: Manufacturer1
Serial Number: SerNum1
Asset Tag: AssetTagNum1
Part Number: PartNum1

Handle 0x0032, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00084000000
Ending Address: 0x00103FFFFFF
Range Size: 2 GB
Physical Device Handle: 0x0031
Memory Array Mapped Address Handle: 0x002E
Partition Row Position: 1
Interleaved Data Depth: 1

This may be a bit verbose but as I look at this maybe could be the basis of a system information page in Rockstor and you could have a collect hardware info and send to Rockstor button so you guys could start so compile a known working hardware list.

@suman @phillxnet

1 Like

Well my error is back bummer.

Jul 19 20:40:12 backup kernel: ata6: lost interrupt (Status 0x50)
Jul 19 20:40:12 backup kernel: ata6.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 19 20:40:12 backup kernel: ata6.01: failed command: WRITE DMA EXT
Jul 19 20:40:12 backup kernel: ata6.01: cmd 35/00:00:00:27:40/00:26:29:00:00/f0 tag 0 dma 4980736 out
res 40/00:00:00:00:00/00:00:00:00:00/10 Emask 0x4 (timeout)
Jul 19 20:40:12 backup kernel: ata6.01: status: { DRDY }
Jul 19 20:40:12 backup kernel: ata6: soft resetting link
Jul 19 20:40:12 backup kernel: ata6.00: configured for UDMA/133
Jul 19 20:40:12 backup kernel: ata6.01: configured for UDMA/133
Jul 19 20:40:12 backup kernel: ata6.01: device reported invalid CHS sector 0
Jul 19 20:40:12 backup kernel: ata6: EH complete

I will continue to try and solve it I might be able to make some more bios changes and fix it.

I will investigate and report back.

@sirhcjw we have had another recent post relating to ACPI and I quoted your original findings here in that forum thread and articulated how I thought it might relate. When I saw that you had effectively reverted your settings to full ACPI on I thought I’d wait and see how you got on. Drawing the two together now, though they are only tangentially related, I suggest you turn off “ACPI 2.0 support” in bios but leave on “ACPI APIC suport” as that combination found by @KarstenV from your original findings allowed all their CPU cores to be enabled whilst they were still able to install. But of course you may not have those options in your BIOS. Their situation was a little less instability and a little more “wont install” specifically around the grub area. I detail there what I think was happening but of course that may all be wrong. But their finding was that disabling just ACPI 2.0 Support allowed them to install past the final grub which previous to this BIOS adjustment was failing with unknown error. I surmised this to be similar to your mostly hidden timeout issue. Give that post a read and see if it helps, thanks for keeping us all posted as these kinds of issues are a real pain and if we could work out a blanket guide that would be great.

Tweaking just BIOS settings is also preferred as it’s less fiddly than kernel boot options but we may still need them in some circumstances of course.

Nice idea on the hardware database by the way, I’m just not quite sure how we would establish if the reports were from “known good” hardware as many settings event on the same board can affect stability. There is definately something to it though. Or maybe we should just reference CentOS certified hardware as that is essentially what Rockstor is linux wise at least.