Use some basic commands like this to move into the cron.daily folder, and create a file status.sh,
cd etc
cd cron.daily
vi status.sh
VI starts off in a view mode, and you hit i to change to insert mode. Put the text above or similar as required. Then when you have finished editing run the following commands to stop edit mode, save the file and exit.
ESC
SHIFT + :
x
Make sure you put the correct email address in the file, and some basic BTRFS, scrub, and SMART status information will be emailed daily. This being a shell script means you can put any command you like in the file.
Also put the correct pool and device names in your file. The device names can be obtained using this command,
btrfs fi show
I have added some smart switches such as -d sat as these are required for my external hard drives.
The only other thing I was thinking would be to grep the text and look for failures and summarise that at the start of the email.
That … looks bad. That’s telling you that you have varying numbers of read and write errors across many disks in your pool.
Also, I’m not sure why you’re grepping out the healthy stats (grep -vE ' 0$')
I would also suggest sort -n rather than sort, as it will ignore the alphanumeric differences, and give you just the counts descending.
In your output, there appears something very wrong with one of the following:
Your pool
Your disk controller/motherboard
Your disks sd[bcdfkl]
Here is an example of what the output of stats output should look like (from my own system)
Note that I only have 2 read errors on a single disk.
I grepped out the non-zero, because they were things I don’t care about. No need to see a report that shows a bunch of lines with 0. And I did the sort over sort -n so that it was sorted by the drive labels (sda/b/c)
So things seems to be running fine. I’ve got no SMART errors. I’ve had Rockstor running on this motherboard for 18 months, and with the current PERC h200 card for the last 4 months. My disks are all varying ages (some pretty new, some quite old). So maybe it is just an aging disk, and SMART hasn’t complained yet?
I just did a btrfs device stats -z /mnt2/MainPool to zero out the stats, and am running a scrub now. I’ll keep checking and see if the error numbers start going up again.
That makes sense, though it does prevent the ability to gain an overview - IE: How many out of the total disks are not happy.
Note that SMART has thresholds when reporting errors, IO errors are typically only reported once you’ve breached the threshold (that said, I’d have expected sdb/sdf to have done so, based on that output!)
Zeroing the stats and scrubbing is a good interim test, but worth noting that a scrub may not need to touch a lot of the data.
I thought a scrub touched all of the data, no? From the man page:
btrfs scrub is used to scrub a btrfs filesystem, which will read all data and metadata blocks from all devices and verify checksums. Automatically repair corrupted blocks if there’s a correct copy available.
in case so many disks are affected…do you use an external controller? its unusual that 6 cables or disks break at the same time…i would vote for the controller to be broken…
The good news is that I have now been running for the last 14 days, and all my stats are 0 still. I guessing if the controller was bad, or if I had a bad disk/disks, I’d have started to see some numbers.
I’ll keep an eye on it, but I’m thinking it was just an anomaly (knock on wood).