I recently had issues (Dec 2020) and ended up with a file system that was read only. I was able to copy 95% or more of the data to USB drives, reinstalled the OS, remade the pools, and restored the data. Stuff I wasn’t able to copy off I was able to recover from an off-site backup. I also ran MEMTEST which found no errors, replaced the thermal grease on the processor, and replaced the power supply. The system is on a UPS and has not had an improper shutdown.
I’m in a similar situation again. A recent scrub reported many millions of errors. The number was so high, I didn’t bother counting the number of digits. It might have been billions, or even hundreds of billions of errors. I poked around a bit and had the impression that the issue was that a hard drive had just become disconnected. This scrub also happened right after I installed two 10TB drives (though I didn’t add them to a pool). I opened the case, reseated each of the SATA and power cables, put it back together, powered up and everything appeared to be fine.
I started a new scrub, estimated that it’d take 4 days to complete, went to bed, forgot about it. I checked on it after about a week and found that the scrub stopped 2 TB in (out of 13 or so) and the array was read only.
So now, I’m again copying the data back off of the pool, at least everything that I can copy. Rockstor services have been disabled (
sudo systemctl disable rockstor-pre rockstor rockstor-bootstrap) I’m getting a significant number of “Input/output error (5)” while using rsync to copy the data. Rebooting the server after getting this error and retrying the copy will allow it to succeed. Some files just seem to be dead. No matter how many attempts/reboots, the copy just fails.
When I’m done, I want to test the hard drives. Each drive has passed a smartctl long test, but that test is basically a self-contained test. I don’t think it tests the link between the processor and the drive. Is badblocks the best solution for testing the drive? I’ll probably do a
btrfs check or
btrfs restore before badblocks to see if I can recover some of the stubborn files.
dmesg is showing a LOT of csum errors on
[ 1692.129176] BTRFS warning (device sdc): csum failed root 4546 ino 11131 off 3971227648 csum 0x8194dd1c expected csum 0xf202924a mirror 1
That makes me think that sdc may be a bad drive, but if the file system just can’t reliably read from sdc, shouldn’t that be something that RAID helps with? sdc does have a new SATA cable and a new power cable.
Edit: I’m contemplating unplugging SDC (after a proper shutdown) and seeing if that helps with copying files off of the array.