BTRFS Error (device sda) bdev/sda errs: wr 262

Rockstor version 3.92-15
Linux: 4.12.4-1.el7.elrepo.x86_64
I am using a HW raid controller on the backend Areca ARC-1680. Array shows no errors all 24 bays populated w/ 1TB drives. 20TB Raid 6 volume
I use Rockstor to serv NFS mounts to my Vsphere 6.5 environment.
All of the sudden my 20TB storage pool shows its active mount options as ro,relatime,space_cache,subvolid=5,subvol=/
I console into my rockstor server and I am seeing BTRFS Error (device sda) bdev/sda errs: wr 262 rejecting I/0 to offline devices.
Needless to say I can noty do anything with the servers that are living on Rockstor.
Could someone please let me know what this is all about?

Scrub on pool shows errors. What is going on with BTRFS? As the controller verifys the disk are all fine

Pool name: RAID6_VOLUME
Scrub Status: cancelled
Scrub Statistics
Attribute Value
ID 9
Start Time February 18th 2018, 2:52:52 pm
End Time February 18th 2018, 4:27:41 pm
Data Scrubbed 3.35 TB
Data Extents Scrubbed 61707837
Tree Extents Scrubbed 759296
Tree Bytes Scrubbed 12440305664
Read Errors 2336
Csum Errors 0
Verify Errors 0
No Csum 158016
Csum Discards 0
Super Errors 0
Malloc Errors 0
Uncorrectable Errors 2336
Unverified Errors 0
Corrected Errors 0
Last Physical 3700956856320

@dhaman Hello again and thanks for starting a new thread.

When btrfs encounters errors that it can’t deal with automatically it will often go ro (read only). This is further evidenced by you scrub failing to complete with a number of errors. So it would seem that you raid6 pool is poorly and in need of repair. This is a non Rockstor specific issue as it relates to the generic case of repairing a btrfs raid6 pool. The wr (ight) errors you mention are most likely due to the pool going ro.

Hope that helps.

How could I find what the errors actually are. What log contains that information as per my HW controller all devices are fine.

The problem with it forcing me into RO mode is that I cannot vmotion vms off of the Rockstor storage on to anything else.

@dhaman

You can probably check dmesg for errors, however BTRFS as a general rule is not the most verbose logging filesystem I’ve dealt with.

Are you using HW RAID6 on the ARC controller, or SW RAID6 in BTRFS?
If you’re using the latter, there’s a reasonable chance that you’re a bit screwed as BTRFS parity RAID levels are considered unstable, and are often difficult to recover.

The device going RO is a deliberate act to prevent further data loss. You may be able to force read/write once with:

mount -o remount,rw /mnt2/RAID6_VOLUME

However this capability may be limited to RAID1/10 only.
Note that if it does work, be ready to do everything else you need to immediately, as it will likely not allow it again.

I’m using HW raid 6 on my Areca controller.

Then I suggest your consult the Areca controller and Smartmontools for disk status to see where the problem lies.

You can use the Linux Areca CLI tools to query information regarding your raid health and described in Thomas Krenn’s Wiki.
You can also query the individual disk via smartmontools, as follows:

smartctl -d areca,[DISK_NUMBER] -a /dev/[RAID_DEVICE]

I can’t really help you determining disk numbers and raid devices as I don’t have an Areca controller to test with.

@dhaman As per @Haioken comment:

Rockstor does have Disk Custom S.M.A.R.T Options and more specifically the ability to configure the smart commands used within the UI to be able to do S.M.A.R.T through Hardware RAID Controllers. It’s a little clunky and doesn’t fit all configurations but it is there; mainly for one-to-one disk configs so might not be relevant to your setup given your hw raid6 use. The example given in the docs pertains to a 3ware raid config provided by forum member @kcomer . But from a quick look at the link provided in the custom smart options Web-UI page it looks like the Areca stuff functions similarly. Just popping this in here for those using raid controllers. But again it’s intended for jbod configs really.

@phillxnet

Yup, I see the options your talking about, however as @dhaman is using a HW RAID6, will he still see the individual disks exposed in the UI, or only the RAID?

TBH, I haven’t played with HW RAID on Rockstor, so I can’t be sure, but I would’ve thought he’d only have seen the RAID exposed by the HW controller.

@Haioken

Only the single raid6 disk was my understanding, as per your:

but the smart command may still allow exposing at least a single disk’s smart info at a time.

I thought I’d just drop it in for those trying to get smart info through a raid but as per:

it’s more relevant when the individual disks are treated as such by the controller, and hence show through independently. Smart info through raid is a bit of a problem usually it seems hence the custom option/capability and the mention. But not very helpful on this thread unfortunately.

Hi,

wr erros can happen, imho beside hw defects they are caused by power losses or such. maybe this can HELP with an step by step instrution what tools to use on (manybe) correcting the errors.

good luck!