Input/output error (5)

Noggin · January 27, 2021, 11:37pm

I recently had issues (Dec 2020) and ended up with a file system that was read only. I was able to copy 95% or more of the data to USB drives, reinstalled the OS, remade the pools, and restored the data. Stuff I wasn’t able to copy off I was able to recover from an off-site backup. I also ran MEMTEST which found no errors, replaced the thermal grease on the processor, and replaced the power supply. The system is on a UPS and has not had an improper shutdown.

I’m in a similar situation again. A recent scrub reported many millions of errors. The number was so high, I didn’t bother counting the number of digits. It might have been billions, or even hundreds of billions of errors. I poked around a bit and had the impression that the issue was that a hard drive had just become disconnected. This scrub also happened right after I installed two 10TB drives (though I didn’t add them to a pool). I opened the case, reseated each of the SATA and power cables, put it back together, powered up and everything appeared to be fine.

I started a new scrub, estimated that it’d take 4 days to complete, went to bed, forgot about it. I checked on it after about a week and found that the scrub stopped 2 TB in (out of 13 or so) and the array was read only.

So now, I’m again copying the data back off of the pool, at least everything that I can copy. Rockstor services have been disabled (sudo systemctl disable rockstor-pre rockstor rockstor-bootstrap) I’m getting a significant number of “Input/output error (5)” while using rsync to copy the data. Rebooting the server after getting this error and retrying the copy will allow it to succeed. Some files just seem to be dead. No matter how many attempts/reboots, the copy just fails.

When I’m done, I want to test the hard drives. Each drive has passed a smartctl long test, but that test is basically a self-contained test. I don’t think it tests the link between the processor and the drive. Is badblocks the best solution for testing the drive? I’ll probably do a btrfs check or btrfs restore before badblocks to see if I can recover some of the stubborn files.

dmesg is showing a LOT of csum errors on sdc

[ 1692.129176] BTRFS warning (device sdc): csum failed root 4546 ino 11131 off 3971227648 csum 0x8194dd1c expected csum 0xf202924a mirror 1

That makes me think that sdc may be a bad drive, but if the file system just can’t reliably read from sdc, shouldn’t that be something that RAID helps with? sdc does have a new SATA cable and a new power cable.

Edit: I’m contemplating unplugging SDC (after a proper shutdown) and seeing if that helps with copying files off of the array.

phillxnet · January 28, 2021, 11:12am

@Noggin Sorry to hear of your plight here.
Re:

As you state later, it can help to reboot and try-again. Sometimes it can take a few scrubs to finnally get the job done. Sometimes due possibly to ordering changes, sometimes maybe due to hardware flakiness. It may very well be worth power cycling in the hope the pool mounts rw and then attempt more scrubs. I’m assuming a redundant btrfs raid level here such as > 0.

This is where I’m hoping a follow up scrub can help, but that does depend on achieving a rw mount of course. And is probably best done only once you have nothing more to looks, i.e. you’ve got off, via read only mount, all you can.

An important element here is the version of Rockstor you are using. Our prior CentOS variant has a much older btrfs stack and there have been years of improvements made in the mean time. Very much worth trying to import this pool into a Rockstor 4 instance as this is now ‘Built on openSUSE’, Leap 15.2 as it goes. You didn’t specify which version this machine is running. But that’s definitely worth a try to get the improved capabilities of the newer kernel and btrfs userland tools.

I would agree that you have a dead or failing drive here.

Yes, if raid1 and above. But this fail over operation only happens on a read failure. And for it to happen you drives have to honour a sane time out. Dead or dying drives don’t always do that. Take a look at the following outstanding issue we have re drive time out settings and their relevant to kernel time outs. In short the drive must give up before the kernel’s default kicks in. Otherwise you have a mall configured system that stands in the way of the raid ‘healing’ capabilities. These time outs are the reason it is advisable to use NAS orientated drives. However may desktop drives have this setting capability but it’s not always configured by default to be appropriate.

https://github.com/rockstor/rockstor-core/issues/1177

Be sure to read all of that issues comments are some drives are known to hang when asked to reconfigure their time out settings.

Hope that helps.

Noggin · January 28, 2021, 1:17pm

Thanks @phillxnet

At the time of the last failure in December, I wiped all the drives, bought a fresh USB stick, and installed OpenSUSE based Rockstor. I also set up my pools such that the stuff I care a lot about (projects, tax information, photos, etc) are on a RAID1. Stuff I don’t care about much (ripped blu-rays, scraped youtube channels for the kids, etc) are on RAID5.

All of the drives are WD Red or White labels, TLER time reported as 7 seconds. They’re in a RAID5 (and I’m aware that RAID5 isn’t production ready). Luckily, rebooting always puts the pool into RW mode. Here’s my scrub status (emphasis added):

scrub status for 2207e565-5850-4e82-8e92-b3f5ae604cb1
scrub started at Thu Jan 28 06:14:19 2021 and was aborted after 00:08:24
total bytes scrubbed: 25.33GiB with 16 errors
error details: read=16
corrected errors: 0, uncorrectable errors: 16, unverified errors: 0

So a scrub doesn’t seem to fix it. This mirrors my experience when trying to scrub from the UI. But it also hasn’t dropped to RO as I can still touch files in the file system. I think the next step might be btrfs rescue chunk-recover /dev/sdd. However, I’ve read that this can result in files being “wrongly restored” so I’m leaning towards just restoring from offsite or just re-ripping. I have an rsync log which shows all of the files it couldn’t copy. I can just go through it and pick out any files I actually care about to restore or regenerate.

I swapped the cables between /dev/sdd and /dev/sdc. They swapped in Linux, and the errors reported in dmesg followed the physical drive. I think this suggests that it isn’t a cable issue, and it isn’t a bad connector on the motherboard. The next question is whether the drive is bad, or if it is simply a fubar’ed file system.

I have badblocks running a non-destructive test, but I had to force it as it thought /dev/sdd was mounted. umount /dev/sdd reported that it wasn’t mounted, lsof didn’t show any open files, so I forced it. I mean, it’s already corrupted, what do I have to lose? I just want to know if it is reliable.

Noggin · January 30, 2021, 3:42pm

Switched badblocks to a destructive test so it’d go faster. Completed a single pass (read and write) of badblocks and found no errors on any of the drives that were in the pool. Put the drives back into a pool and am currently restoring the data. Third time’s the charm?