Btrfs Scrub - scheduled task got cancelled - no reason being displayed?

I’ve scheduled a “scrub” task to run every Friday at 9am…it seems to have started at 9am but was cancelled immediately - no hint given why or what the problem is? any advice?
image

Thx,
Roland

@glenngould Hello again.

What version of Rockstor is this happening on and is there any info/clues from the log (System - Logs Manager) at the time it was ‘cancelled’?

@phillxnet Hi, thanks for the fas reply.
I’m on 3.9.1-5. The rockstor log, does not show any entry from today - last is from yesterday. hm…

@glenngould OK so you have the latest scrub fixes. I’ve just tried a freshly scheduled scrub here on 3.9.1.5 and all was OK:
scheduled-scrub

What is the output of the following command:

btrfs scrub status /mnt2/pool-name-here

You could try removing that scrub job and re-creating a new one to see if that helps. All depends on what version that job was created on, ie those created before 3.9.1-2 will error out as they use the old pools api ie:

and:

Hope that helps.

@phillxnet thx again.

this was the first time I did set up a scrub task with 3.9.1-5.

[root@rockstor ~]# btrfs scrub status /mnt2/rm_share
scrub status for d5829eb2-f19f-4329-83a4-c71261a270bf
scrub started at Fri Aug 11 09:00:02 2017 and was aborted after 00:12:44
total bytes scrubbed: 72.97GiB with 28 errors
error details: read=24 csum=4
corrected errors: 0, uncorrectable errors: 28, unverified errors: 0

11th of August at 9am was exactly the time when it should have started the scrub task - 28errors :cry:

@glenngould OK so that explains it then as “aborted” is the same message that is given by ‘btrfs scrub status pool-mnt-point’ when a scrub is manually cancelled’. Hence the cancelled status report. As it goes we already save all the other info returned by scrub status but we just don’t yet have a way of surfacing it in the GUI. But obviously this is planned and will probably take the form of a link on each of the scrub ‘Task history’ table entries, probably as an entry in an additional table column with a single word summary ie OK (Green) and Errors (Red) or the like.

But ‘as is’ we only assess that first line for overall status feedback.

Out of interest the actual data collected is from the slightly more verbose command:

btrfs scrub status -R /mnt2/pool-name-here

Thanks for helping to bring this pending refinement to light again and any comments on how this additional info might be presented, re the above table enhancement example, would be welcome.

Hope that helps.

@phillxnet Is it then to be expected to be cancelled again next Friday when the scrub task is scheduled?

I’m getting somehow nervous with all the errors showing up on my system.
It would be good if there would be more pop-ups in the GUI of Rockstor in case errors especially disk related will pop-up.

What does those errors here mean - is there one of the disks starting to fail?

[root@rockstor ~]# btrfs scrub status -R /mnt2/rm_pool
scrub status for d5829eb2-f19f-4329-83a4-c71261a270bf
        scrub started at Fri Aug 11 09:00:02 2017 and was aborted after 00:12:44
        data_extents_scrubbed: 1171570
        tree_extents_scrubbed: 100936
        data_bytes_scrubbed: 76696780800
        tree_bytes_scrubbed: 1653735424
        read_errors: 24
        csum_errors: 4
        verify_errors: 0
        no_csum: 0
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 28
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 118165733376

@phillxnet
With the latest testing release 3.9.1-7 it does not show canceled instead it shows error.
image

It should run tomorrow 9am again, lets see if there is any changed message.

How can a get more information regarding

uncorrectable_errors: 28

What can I do about those?
Which disk does have an error?
Should I change a disk?

thx

@glenngould Thanks for the update, it might be useful to have another output pasted here from the last run, ie another:

 btrfs scrub status -R /mnt2/rm_pool

That is before your scheduled scrub runs again.

Best look in your logs at around the time of that scrub.

System - Logs Manager - Rockstor Logs & Dmesg (Kernel)

May well point to the problem drive; although these errors are at the pool level but most likely the logs will inform you of a drive to ‘blame’ but not necessarily as the error could simply be down to a corruption that occurred for reasons such as faulty ram for instance. You could run memtest86+ as instructed in the Pre-Install Best Practice (PBP) though note the warning there.

In order 1) find the cause, 2) depends on cause (above comment re logs), 3) If one is faulty.

More on 3): You could run a S.M.A.R.T self test on each of your disks but note that this does stress them and given your pool is now ‘poorly’ your backups may well come into play here so make sure you have them lined up ready as it could be that a smart self test could push a drive into full failure: but obviously the cause is the first thing to look into.

Hope that helps.

@phillxnet
The old output of

btrfs scrub status -R /mnt2/rm_pool

was posted before - see 2 posts above.
The new output from today here - same number of errors and image

[root@rockstor ~]# btrfs scrub status -R /mnt2/rm_pool
    scrub status for d5829eb2-f19f-4329-83a4-c71261a270bf
            scrub started at Fri Aug 18 07:50:01 2017 and was aborted after 00:13:34
            data_extents_scrubbed: 1171564
            tree_extents_scrubbed: 75978
            data_bytes_scrubbed: 76696780800
            tree_bytes_scrubbed: 1244823552
            read_errors: 24
            csum_errors: 4
            verify_errors: 0
            no_csum: 0
            csum_discards: 0
            super_errors: 0
            malloc_errors: 0
            uncorrectable_errors: 28
            unverified_errors: 0
            corrected_errors: 0
            last_physical: 118165733376

This within the Dmesg (Kernel) Log file

[    0.999427] systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
[    0.999561] systemd[1]: Detected architecture x86-64.
[    0.999563] systemd[1]: Running in initial RAM disk.
[    0.999577] systemd[1]: Set hostname to .
[    1.015459] random: systemd: uninitialized urandom read (16 bytes read)
[    1.015473] random: systemd: uninitialized urandom read (16 bytes read)
[    1.015494] random: systemd: uninitialized urandom read (16 bytes read)
[    1.015531] random: systemd: uninitialized urandom read (16 bytes read)
[    1.015940] random: systemd: uninitialized urandom read (16 bytes read)
[    1.016016] random: systemd: uninitialized urandom read (16 bytes read)
[    1.016127] random: systemd: uninitialized urandom read (16 bytes read)
[    1.018438] systemd[1]: Reached target Swap.
[    1.018445] systemd[1]: Starting Swap.
[    1.018460] systemd[1]: Reached target Local File Systems.
[    1.018465] systemd[1]: Starting Local File Systems.
[    1.018474] systemd[1]: Reached target Timers.
[    1.018478] systemd[1]: Starting Timers.
[    1.128531] pps_core: LinuxPPS API ver. 1 registered
[    1.128532] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti 
[    1.128820] PTP clock support registered
[    1.129950] tg3.c:v3.137 (May 11, 2014)
[    1.131157] libata version 3.00 loaded.
[    1.131970] ahci 0000:00:1f.2: version 3.0
[    1.132095] ahci 0000:00:1f.2: SSS flag set, parallel bus scan disabled
[    1.142266] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[    1.142270] ahci 0000:00:1f.2: flags: 64bit ncq sntf ilck stag pm led clo pmp pio slum part ems apst 
[    1.143454] tg3 0000:03:00.0 eth0: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address 34:64:a9:9a:88:a0
[    1.143455] tg3 0000:03:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    1.143456] tg3 0000:03:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    1.143457] tg3 0000:03:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[    1.149600] [TTM] Zone  kernel: Available graphics memory: 5096024 kiB
[    1.149601] [TTM] Zone   dma32: Available graphics memory: 2097152 kiB
[    1.149602] [TTM] Initializing pool allocator
[    1.149606] [TTM] Initializing DMA pool allocator
[    1.155685] tg3 0000:03:00.1 eth1: Tigon3 [partno(N/A) rev 5720000] (PCI Express) MAC address 34:64:a9:9a:88:a1
[    1.155689] tg3 0000:03:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[    1.155690] tg3 0000:03:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[    1.155692] tg3 0000:03:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[    1.158046] scsi host0: ahci
[    1.159771] scsi host1: ahci
[    1.161862] scsi host2: ahci
[    1.162581] scsi host3: ahci
[    1.162676] scsi host4: ahci
[    1.163077] scsi host5: ahci
[    1.163106] ata1: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0100 irq 35
[    1.163108] ata2: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0180 irq 35
[    1.163109] ata3: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0200 irq 35
[    1.163110] ata4: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0280 irq 35
[    1.163111] ata5: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0300 irq 35
[    1.163112] ata6: SATA max UDMA/133 abar m2048@0xfacd0000 port 0xfacd0380 irq 35
[    1.172796] fbcon: mgadrmfb (fb0) is primary device
[    1.199203] tg3 0000:03:00.0 eno1: renamed from eth0
[    1.271388] usb 1-1: new high-speed USB device number 2 using ehci-pci
[    1.279379] usb 2-1: new high-speed USB device number 2 using ehci-pci
[    1.281752] tg3 0000:03:00.1 eno2: renamed from eth1
[    1.327827] Console: switching to colour frame buffer device 128x48
[    1.358451] mgag200 0000:01:00.1: fb0: mgadrmfb frame buffer device
[    1.390239] usb 1-1: New USB device found, idVendor=8087, idProduct=0024
[    1.390240] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    1.390647] hub 1-1:1.0: USB hub found
[    1.390850] hub 1-1:1.0: 6 ports detected
[    1.391551] [drm] Initialized mgag200 1.0.0 20110418 for 0000:01:00.1 on minor 0
[    1.422009] usb 2-1: New USB device found, idVendor=8087, idProduct=0024
[    1.422010] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    1.422147] hub 2-1:1.0: USB hub found
[    1.422281] hub 2-1:1.0: 6 ports detected
[    1.623399] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.624445] ata1.00: ATA-8: SanDisk SDSSDHP128G, X2316RL, max UDMA/133
[    1.624447] ata1.00: 250069680 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    1.625543] ata1.00: configured for UDMA/133
[    1.625812] scsi 0:0:0:0: Direct-Access     ATA      SanDisk SDSSDHP1 6RL  PQ: 0 ANSI: 5
[    1.695390] usb 1-1.5: new low-speed USB device number 3 using ehci-pci
[    1.695393] tsc: Refined TSC clocksource calibration: 3392.293 MHz
[    1.695403] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x30e5de2a436, max_idle_ns: 440795285127 ns
[    1.695427] usb 2-1.3: new high-speed USB device number 3 using ehci-pci
[    1.773802] usb 2-1.3: New USB device found, idVendor=0424, idProduct=2660
[    1.773805] usb 2-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[    1.774027] hub 2-1.3:1.0: USB hub found
[    1.774115] hub 2-1.3:1.0: 2 ports detected
[    2.103384] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.104063] ata2.00: ATA-9: WDC WD40EFRX-68WT0N0, 82.00A82, max UDMA/133
[    2.104066] ata2.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    2.104710] ata2.00: configured for UDMA/133
[    2.105020] scsi 1:0:0:0: Direct-Access     ATA      WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 5
[    2.575374] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    2.576010] ata3.00: ATA-9: WDC WD40EFRX-68WT0N0, 82.00A82, max UDMA/133
[    2.576013] ata3.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    2.576646] ata3.00: configured for UDMA/133
[    2.576871] scsi 2:0:0:0: Direct-Access     ATA      WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 5
[    2.719556] clocksource: Switched to clocksource tsc
[    3.047357] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[    3.047985] ata4.00: ATA-9: WDC WD40EFRX-68WT0N0, 82.00A82, max UDMA/133
[    3.047987] ata4.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
[    3.048634] ata4.00: configured for UDMA/133
[    3.048837] scsi 3:0:0:0: Direct-Access     ATA      WDC WD40EFRX-68W 0A82 PQ: 0 ANSI: 5
[    3.364691] ata5: SATA link down (SStatus 0 SControl 300)
[    3.638355] usb 1-1.5: New USB device found, idVendor=0463, idProduct=ffff
[    3.638356] usb 1-1.5: New USB device strings: Mfr=1, Product=2, SerialNumber=4
[    3.638357] usb 1-1.5: Product: Ellipse ECO
[    3.638357] usb 1-1.5: Manufacturer: EATON
[    3.638358] usb 1-1.5: SerialNumber: 000000000
[    3.668700] ata6: SATA link down (SStatus 0 SControl 300)
[    3.675865] sd 0:0:0:0: [sda] 250069680 512-byte logical blocks: (128 GB/119 GiB)
[    3.675867] sd 2:0:0:0: [sdc] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[    3.675869] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[    3.675872] sd 1:0:0:0: [sdb] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[    3.675874] sd 1:0:0:0: [sdb] 4096-byte physical blocks
[    3.675879] sd 0:0:0:0: [sda] Write Protect is off
[    3.675880] sd 2:0:0:0: [sdc] Write Protect is off
[    3.675881] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    3.675882] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[    3.675886] sd 1:0:0:0: [sdb] Write Protect is off
[    3.675888] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    3.675901] sd 2:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    3.675907] sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    3.675910] sd 3:0:0:0: [sdd] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[    3.675912] sd 3:0:0:0: [sdd] 4096-byte physical blocks
[    3.675914] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    3.675921] sd 3:0:0:0: [sdd] Write Protect is off
[    3.675923] sd 3:0:0:0: [sdd] Mode Sense: 00 3a 00 00
[    3.675939] sd 3:0:0:0: [sdd] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[    3.680219]  sda: sda1 sda2 sda3
[    3.680843] sd 0:0:0:0: [sda] Attached SCSI disk
[    3.686764] sd 3:0:0:0: [sdd] Attached SCSI disk
[    3.688025] sd 1:0:0:0: [sdb] Attached SCSI disk
[    3.701409] sd 2:0:0:0: [sdc] Attached SCSI disk
[    3.708383] random: fast init done
[    6.947141] hid-generic 0003:0463:FFFF.0001: hiddev96,hidraw0: USB HID v10.10 Device [EATON Ellipse ECO] on usb-0000:00:1a.0-1.5/input0
[    6.984238] raid6: sse2x1   gen()  9875 MB/s
[    7.001239] raid6: sse2x1   xor()  7810 MB/s
[    7.018235] raid6: sse2x2   gen() 12605 MB/s
[    7.035238] raid6: sse2x2   xor()  8886 MB/s
[    7.052236] raid6: sse2x4   gen() 14734 MB/s
[    7.069235] raid6: sse2x4   xor() 10912 MB/s
[    7.069235] raid6: using algorithm sse2x4 gen() 14734 MB/s
[    7.069236] raid6: .... xor() 10912 MB/s, rmw enabled
[    7.069236] raid6: using ssse3x2 recovery algorithm
[    7.069570] xor: automatically using best checksumming function   avx       
[    7.073685] Btrfs loaded, crc32c=crc32c-intel
[    7.074125] BTRFS: device label rockstor_rockstor devid 1 transid 1151796 /dev/sda3
[    7.074649] BTRFS info (device sda3): disk space caching is enabled
[    7.083992] BTRFS info (device sda3): detected SSD devices, enabling SSD mode
[    7.290582] systemd-journald[193]: Received SIGTERM from PID 1 (systemd).
[    7.304812] systemd: 24 output lines suppressed due to ratelimiting
[    7.338910] SELinux:  Disabled at runtime.
[    7.338922] SELinux:  Unregistering netfilter hooks
[    7.353249] audit: type=1404 audit(1501958567.859:2): selinux=0 auid=4294967295 ses=4294967295
[    7.373758] ip_tables: (C) 2000-2006 Netfilter Core Team
[    7.373802] systemd[1]: Inserted module 'ip_tables'
[    7.445316] BTRFS info (device sda3): disk space caching is enabled
[    7.445929] RPC: Registered named UNIX socket transport module.
[    7.445930] RPC: Registered udp transport module.
[    7.445930] RPC: Registered tcp transport module.
[    7.445931] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    7.466444] systemd-journald[1272]: Received request to flush runtime journal from PID 1
[    7.468285] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[    7.520193] input: PC Speaker as /devices/platform/pcspkr/input/input4
[    7.520895] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    7.521843] EDAC MC0: Giving out device to module ie31200_edac controller IE31200: DEV 0000:00:00.0 (POLLED)
[    7.522741] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 2128 MHz, 3400 MHz
[    7.526388] power_meter ACPI000D:00: Found ACPI power meter.
[    7.526406] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[    7.526410] power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[    7.527237] ACPI Warning: SystemIO range 0x0000000000000928-0x000000000000092F conflicts with OpRegion 0x0000000000000920-0x000000000000092F (\SGPE) (20170303/utaddress-247)
[    7.527243] ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver
[    7.527264] lpc_ich: Resource conflict(s) found affecting gpio_ich
[    7.539045] hpwdt 0000:01:00.0: HPE Watchdog Timer Driver: NMI decoding initialized, allow kernel dump: ON (default = 1/ON)
[    7.539090] hpwdt 0000:01:00.0: HPE Watchdog Timer Driver: 1.4.0, timer margin: 30 seconds (nowayout=0).
[    7.540346] ipmi message handler version 39.2
[    7.541174] ipmi device interface
[    7.542542] ipmi_si IPI0001:00: ipmi_si: probing via ACPI
[    7.542567] ipmi_si IPI0001:00: [io  0x0ca2-0x0ca3] regsize 1 spacing 1 irq 0
[    7.542568] ipmi_si: Adding ACPI-specified kcs state machine
[    7.542593] IPMI System Interface driver.
[    7.542610] ipmi_si: probing via SMBIOS
[    7.542611] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
[    7.542612] ipmi_si: SMBIOS-specified kcs state machine: duplicate
[    7.542613] ipmi_si: probing via SPMI
[    7.542614] ipmi_si: SPMI: io 0xca2 regsize 2 spacing 2 irq 0
[    7.542615] ipmi_si: SPMI-specified kcs state machine: duplicate
[    7.542616] ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x20, irq 0
[    7.549092] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    7.549129] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    7.549179] sd 2:0:0:0: Attached scsi generic sg2 type 0
[    7.549211] sd 3:0:0:0: Attached scsi generic sg3 type 0
[    7.588131] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 163840 ms ovfl timer
[    7.588132] RAPL PMU: hw unit of domain pp0-core 2^-16 Joules
[    7.588132] RAPL PMU: hw unit of domain package 2^-16 Joules
[    7.588133] RAPL PMU: hw unit of domain pp1-gpu 2^-16 Joules
[    7.598526] AVX version of gcm_enc/dec engaged.
[    7.598527] AES CTR mode by8 optimization enabled
[    7.609069] Adding 5111804k swap on /dev/sda2.  Priority:-1 extents:1 across:5111804k SSFS
[    7.625827] iTCO_vendor_support: vendor-support=0
[    7.627885] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    7.628321] gpio_ich: GPIO from 436 to 511 on gpio_ich
[    7.628360] iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
[    7.640412] alg: No test for pcbc(aes) (pcbc-aes-aesni)
[    7.642057] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[    7.714717] ipmi_si IPI0001:00: Found new BMC (man_id: 0x00000b, prod_id: 0x2000, dev_id: 0x13)
[    7.714767] ipmi_si IPI0001:00: IPMI kcs interface initialized

The Rockstor log file does not show any entry from today.

I’ve perfromed once a SMART conveyance test (no full test), but with no hint of any error or damage

@glenngould

Yes the cancelled report here is as intended, ie “aborted” translates to cancelled. So currently you report from the Rockstor side is as intended.

Can’s see anything in your dmesg log so looks like you are going to have to resource the:

journalctl

command to find those logs for more info on the scrub error cause / scope.

Let us know how it goes. ‘As is’ your issue is now in the realm of btrfs as Rockstor is reporting the last scrub as intended: given the current first line analysis only arrangement.

S.M.A.R.T short tests are often not that helpful. The long ones do a full scan of the surface (roughly) so often show up more info but also stress the drive and can take ages.

Have you tried the memtest86+ suggestion? Note that this is stressful on system cooling.

You should receive the same result if you do a manually initiated scrub so that might help with tracking down log entries, ie do the scrub and watch the logs.

the journalctl equivalent of a log tail is:

journalctl -f

Hope that helps.

@phillxnet thanks again for your prompt reply!

journalctl

gives an endless list of…I stopped after 100.000 lines…every day and every minute the same errors.

Aug 06 17:53:52 rockstor dockerd[13270]: Starting Plex Media Server.
Aug 06 17:53:52 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 1
Aug 06 17:53:52 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 2
Aug 06 17:53:52 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 1
Aug 06 17:53:52 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 2
Aug 06 17:53:53 rockstor dockerd[13270]: Aug  6 17:53:53 bd507ffaba5e zma_m1[485]: INF [HIK: 1900000 - Analysing at 25.00 fps]
Aug 06 17:53:53 rockstor dockerd[13270]: Starting Plex Media Server.
Aug 06 17:53:53 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 1
Aug 06 17:53:53 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 2
Aug 06 17:53:53 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 1
Aug 06 17:53:53 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 2
Aug 06 17:53:54 rockstor dockerd[13270]: Starting Plex Media Server.
Aug 06 17:53:55 rockstor dockerd[13270]: Starting Plex Media Server.
Aug 06 17:53:56 rockstor dockerd[13270]: Starting Plex Media Server.
Aug 06 17:53:56 rockstor kernel: btrfs_print_data_csum_error: 10 callbacks suppressed
Aug 06 17:53:56 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 1
Aug 06 17:53:56 rockstor kernel: BTRFS warning (device sdc): csum failed root 897 ino 266 off 52584448 csum 0xe4014457 expected csum 0x72754285 mirror 2
Aug 06 17:53:57 rockstor dockerd[13270]: Starting Plex Media Server.

It seems there is always (only) drive (sdc) involved - so maybe this one causes the errors?

Is there a way (command) to jump to the end of journalctl output?

Maybe this is something to consider for future enhancements of Rockstor and he GUI - that any error related to hard disks will be prompted visually on the GUI fo Rockstor as it is important for data safety!

I’ve not tried to memory test so far due to time constraints.

thx

@glenngould

Ok, so we are getting somewhere, great.
if you do a:

ls -la /dev/disk/by-id/

you will see which device model / serial is currently sdc. N.B. the sdc type names can change from boot to boot but in your case seems to have remained. Just be sure to find the latest report.

Capital ‘G’ should do it, and little ‘g’ takes you to the beginning, same as with the less command.

man journalctl

Definitely and we have an open issue where some discussion has already taken place on what ‘backend’ system we might use for this:

The consensus currently seems to be to integrate logcheck as suggested by @maxhq. You will find some nice instructions re this config in that issue. This issue is part of the current milestone so hopefully, in time, it should come to some sort of fruition soon. Definitely appreciated as a key component so need to be done right.

Hope that helps.

@phillxnet
I’ve identified the right disk ID based your provided command and performed a SMART run (extensive one) but also no error was found.

With the given errors being displayed by scrub btrfs itself - as an amateur I would expect that the file system would be able to tell me more details (what is effected) and a possible root cause, but this is maybe easy thinking :wink:

@glenngould

Might be worth doing the same on the other disks in the pool, but remember that this does stress them so only do this if you already have your data backed up / or don’t care about it of course. And remember about the memtest86+ thing as all that you do on potentially dodgy hardware will increase the likelihood of further damage.

That’s the hope, I’d look again at your logs around the time a scrub was performed.

Hope that helps.

@phillxnet Just a quick update from my end.
I’ve performed now on all 3 disks (included into my “rm_pool”) an extensive SMART test - all no errors.

Can I assume that none of my disks have any issues?
Any other test I can or should perform to be sure?

The remaining test to perform is the RAM / memorytest you indicated earlier - when my time allows :grinning:

Thx

Hello, any further updates on this topic? I am running into the same issue with errors on my scheduled scrubs. Any help is greatly appreciated!

Thanks,

RIch

@rbcross62 Welcome to the Rockstor community.

Have you tried deleting and re-creating the schedule scub task itself. This usually sorts the problem as we had a api url change way back. Depends on the Rockstor version you are using but in most cases simply updating should sort things. But if not then try the task re-creation. The stable updates channel received a few fixes that help with migrating older tasks that for a while failed to run as a result of the api changes.

Hope that helps.

Honesty on my side this error is still ongoing, that the scrub task is being canceled right after start. Even if this has to do with the 28 found uncorrectable errors but no SMART errors make me feel not confident at all :thinking:

@glenngould

What version of Rockstor are you running and what is the output of the following command run as root:

btrfs scrub status -R /mnt2/<pool-name-here>

as that is the command Rockstor uses to assess the running or last run scrub task.

And the code that parses the output of that command is at:

It may well be that your last scrub is presenting a report that our parser is confused by.

Lets see what that output looks like before trying anything else.

Hope that helps.