Status Information From Rockstor

petermc · October 2, 2017, 8:57am

I was thinking about a feature to add to rockstor, and perhaps do it myself

How about a weekly report, reporting back drive space, and smart information on the drives, emailed.

How would I go about this? I am sure there would be multiple ways to do this in a linux system.

doenietzomoeilijk · October 2, 2017, 5:43pm

I’d say most information should be readily available from the command line tools.

Actually I like your idea, so do keep us posted on it!

petermc · June 4, 2018, 3:53am

This is something I have been working on lately. Here is a simple version of what I have come up with in case it helps.

Put a file like this in the /etc/cron.daily/ folder, and it will run daily.

#!/bin/sh
(
    btrfs fi show
    btrfs filesystem df /mnt2/MainPool/
    echo " "
    btrfs scrub status -R /mnt2/MainPool/
    echo " "
    echo "==SDA=="
    smartctl --health /dev/sda
    echo "==SDB=="
    smartctl --health -d sat /dev/sdb
    echo "==SDC=="
    smartctl --health /dev/sdc
    echo "==SDD=="
    smartctl --health -d sat /dev/sdd
    echo "==SDE=="
    smartctl --health /dev/sde
    echo "==SDF=="
    smartctl --health /dev/sdf
) | /bin/mail -s 'NAS Status' youremailaddres@something.com

Use some basic commands like this to move into the cron.daily folder, and create a file status.sh,

cd etc
cd cron.daily
vi status.sh

VI starts off in a view mode, and you hit i to change to insert mode. Put the text above or similar as required. Then when you have finished editing run the following commands to stop edit mode, save the file and exit.

ESC
SHIFT + :
x

Make sure you put the correct email address in the file, and some basic BTRFS, scrub, and SMART status information will be emailed daily. This being a shell script means you can put any command you like in the file.

Also put the correct pool and device names in your file. The device names can be obtained using this command,

btrfs fi show

I have added some smart switches such as -d sat as these are required for my external hard drives.

The only other thing I was thinking would be to grep the text and look for failures and summarise that at the start of the email.

kupan787 · June 6, 2018, 2:07am

Thanks for this!

I had read somewhere else to also check btrfs device stats

I noticed when I do that, I get the following output:

[root@rocknas ~]# btrfs device stats /mnt2/MainPool | grep -vE ’ 0$’ | sort
[/dev/sdb].read_io_errs 4128
[/dev/sdb].write_io_errs 8534
[/dev/sdc].read_io_errs 48
[/dev/sdc].write_io_errs 21
[/dev/sdd].read_io_errs 138
[/dev/sdd].write_io_errs 1
[/dev/sdf].read_io_errs 2509
[/dev/sdk].read_io_errs 46
[/dev/sdl].read_io_errs 123
[root@rocknas ~]#

Not sure what this really tells me, or if it is something to worry about.

Haioken · June 6, 2018, 11:57am

@kupan787,

That … looks bad. That’s telling you that you have varying numbers of read and write errors across many disks in your pool.
Also, I’m not sure why you’re grepping out the healthy stats (grep -vE ' 0$')
I would also suggest sort -n rather than sort, as it will ignore the alphanumeric differences, and give you just the counts descending.

In your output, there appears something very wrong with one of the following:

Your pool
Your disk controller/motherboard
Your disks sd[bcdfkl]

Here is an example of what the output of stats output should look like (from my own system)
Note that I only have 2 read errors on a single disk.

[ root@rockout (pass 0s) ~ ]# btrfs device stats /mnt2/tempraid/
[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sde].write_io_errs    0
[/dev/sde].read_io_errs     0
[/dev/sde].flush_io_errs    0
[/dev/sde].corruption_errs  0
[/dev/sde].generation_errs  0
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     2
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdf].write_io_errs    0
[/dev/sdf].read_io_errs     0
[/dev/sdf].flush_io_errs    0
[/dev/sdf].corruption_errs  0
[/dev/sdf].generation_errs  0
[/dev/sda].write_io_errs    0
[/dev/sda].read_io_errs     0
[/dev/sda].flush_io_errs    0
[/dev/sda].corruption_errs  0
[/dev/sda].generation_errs  0
[/dev/sdh].write_io_errs    0
[/dev/sdh].read_io_errs     0
[/dev/sdh].flush_io_errs    0
[/dev/sdh].corruption_errs  0
[/dev/sdh].generation_errs  0
[/dev/sdg].write_io_errs    0
[/dev/sdg].read_io_errs     0
[/dev/sdg].flush_io_errs    0
[/dev/sdg].corruption_errs  0
[/dev/sdg].generation_errs  0

Yours scares me.

kupan787 · June 6, 2018, 2:55pm

Ah, bummer. Thats not what I wanted to hear.

I grepped out the non-zero, because they were things I don’t care about. No need to see a report that shows a bunch of lines with 0. And I did the sort over sort -n so that it was sorted by the drive labels (sda/b/c)

So things seems to be running fine. I’ve got no SMART errors. I’ve had Rockstor running on this motherboard for 18 months, and with the current PERC h200 card for the last 4 months. My disks are all varying ages (some pretty new, some quite old). So maybe it is just an aging disk, and SMART hasn’t complained yet?

I just did a btrfs device stats -z /mnt2/MainPool to zero out the stats, and am running a scrub now. I’ll keep checking and see if the error numbers start going up again.

Haioken · June 7, 2018, 2:12am

That makes sense, though it does prevent the ability to gain an overview - IE: How many out of the total disks are not happy.

Note that SMART has thresholds when reporting errors, IO errors are typically only reported once you’ve breached the threshold (that said, I’d have expected sdb/sdf to have done so, based on that output!)

Zeroing the stats and scrubbing is a good interim test, but worth noting that a scrub may not need to touch a lot of the data.

Thomas Kreen has a fantastic write-up on Analyzing failing disks with smartmontools, which may be a useful read.

kupan787 · June 7, 2018, 2:45pm

I thought a scrub touched all of the data, no? From the man page:

btrfs scrub is used to scrub a btrfs filesystem, which will read all data and metadata blocks from all devices and verify checksums. Automatically repair corrupted blocks if there’s a correct copy available.

Haioken · June 8, 2018, 5:47am

Sorry, my bad - it won’t touch all of the disk.
Thus, anything marked as deleted (I think) won’t get scrubbed, neither will free space.

This means that your scrub may ignore some locations that have previously experienced errors, because the data there is now marked as free space.

petermc · June 9, 2018, 10:19pm

There does seem to be a problem with your setup @kupan787. btrfs device stats looks useful though.

I put it in my latest version of my script. Plus I put the same commands in for my rockstor pool. It now looks like this.

#!/bin/sh
(
    btrfs fi show
    echo "==MainPool=="
    btrfs device stats /mnt2/MainPool/
    echo " "
    btrfs filesystem df /mnt2/MainPool/
    echo " "
    btrfs scrub status -R /mnt2/MainPool/
    echo " "
    echo "==rockstor=="
    btrfs device stats /mnt2/rockstor_rockstor/
    echo " "
    btrfs filesystem df /mnt2/rockstor_rockstor/
    echo " "
    btrfs scrub status -R /mnt2/rockstor_rockstor/
    echo " "
    echo "==SDA=="
    smartctl --health /dev/sda
    echo "==SDB=="
    smartctl --health -d sat /dev/sdb
    echo "==SDC=="
    smartctl --health /dev/sdc
    echo "==SDD=="
    smartctl --health -d sat /dev/sdd
    echo "==SDE=="
    smartctl --health /dev/sde
    echo "==SDF=="
    smartctl --health /dev/sdf
) | /bin/mail -s 'NAS Status' youremailaddres@something.com

g6094199 · June 13, 2018, 11:13am

in case so many disks are affected…do you use an external controller? its unusual that 6 cables or disks break at the same time…i would vote for the controller to be broken…

kupan787 · June 20, 2018, 5:27am

The good news is that I have now been running for the last 14 days, and all my stats are 0 still. I guessing if the controller was bad, or if I had a bad disk/disks, I’d have started to see some numbers.

I’ll keep an eye on it, but I’m thinking it was just an anomaly (knock on wood).