Time to Complete Scrubs and Balances

Noggin · June 12, 2016, 6:15pm

Balances
I’m relatively new to the world of NASes and I’ve chosen Rockstor as my OS. This morning, my system consisted of Qty 4, 2 TB hard drives as a RAID 10. I installed two more identical drives this morning at 8:15 AM. I used the web interface to add these two drives to my main pool and started a balance. It is now 1:03 PM, nearly 5 hours later, and the “Percent finished” of the balances is still 0%. I have approximately 1.5 TB of data total. Is this normal?

Edit: The balance is complete now, it just didn’t indicate the percentage correctly. It went from 0 straight to 100%.

Scrubs
I have a weekly scrub set up for every Friday night (my drives are refurbs with 3 years on them so I’m being overly cautious). The task history shows the scrubs completing within 3 seconds. Is this normal? This can’t indicate the total scrub time, maybe this indicates the time it takes to initiate the scrub?

I’ve only had my system running for about 3 weeks now. One of the scrubs that was scheduled has a status of “Error” but I wasn’t able to figure out what the error was. Once I noticed the issue, I manually started a scrub. it appears to have taken about 70 minutes to run according to the log, and the status is “finished” so I don’t think anything is really wrong.

SMART data (smartctl -i -a /dev/sdf | grep Rea) shows one drive with a single Raw Read Error Rate, but 0 reallocated sectors for all drives. I think my drives are OK.

sysopj · June 12, 2016, 11:29pm

I noticed on my Rockstor that I would have to leave the pool section (say go to homepage) then come back for the 0% to actually refresh.

Also for me, a scrub of 2.9TB takes about 33 hours on a 3gbs SAS connection / 3gbs hard drives on a SAS expander.

sysopj · June 12, 2016, 11:33pm

[root@nas ~]# smartctl -i -a /dev/sdf | grep Rea
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
[root@nas ~]# smartctl -i -a /dev/sda | grep Rea
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 97
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
[root@nas ~]# smartctl -i -a /dev/sdb | grep Rea
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 2
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
root@nas ~]# smartctl -i -a /dev/sdc | grep Rea
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 346
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
[root@nas ~]# smartctl -i -a /dev/sdd | grep Rea
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 269
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0
root@nas ~]# smartctl -i -a /dev/sde | grep Rea
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 1399
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0

Not sure how to read this, but this is what mine looks like for you to compare.

Noggin · June 13, 2016, 2:39am

Thanks. I think the Reallocated_Sector_Ct is what you need to keep an eye on to catch a drive before a catastrophic failure. If it is ever non-zero, it indicates bad sectors and is often (but not always) a precursor to drive failure.

The Raw_Read_Error_Rate, from what I’ve read, indicates possible communication errors such as bad SATA cables or something of the sort. I don’t know what “normal” is, but I have 6 refurbished Hitachi drives each with 30k hours on them and only a single Raw_Read_Error_Rate among them all.

Noggin · June 13, 2016, 2:44am

Wow… looking at my short logs, that seems REALLY long. Here’s my logs since I built the system less than a month ago:

May 23 - 933 GB - 31 minutes
May 27 - 1.01 TB - 35 minutes
June 3 - 1.23 TB - 42 minutes
June 6 - 1.99 TB - 1 hour 9 minutes
June 10 - 2.45 TB - 1 hour 26 minutes

All of these scrubs are from when I only had qty 4, 2 TB SATA drives in RAID 10.

suman · June 13, 2016, 6:43pm

Yes, this needs to be improved. we plan to create appropriate issues and add them to our roadmap soon.

suman · June 13, 2016, 6:45pm

What’s the RAID level of the Pool?

suman · June 13, 2016, 6:51pm

The UI needs a bit of improvement as I’ve indicated in my other reply. Regarding balance times, when we trigger a balance, btrfs essentially rewrites the entire filesystem. so the time it takes is proportional to the amount of data. redundancy profile is also another factor. We can improve this with smart usage of balance filters. something we are considering.

sysopj · June 13, 2016, 11:07pm

Raid 6. I also cleaned up some data so now its 1.9TB in 20 hrs 42 min.

I was not concerned, originally posting for comparison. But you stepping in does raise my eyebrow.

The main metric I was paying attention to was the sustained performance on the LAN which I am very happy with.

suman · June 14, 2016, 1:07am

I just wanted to confirm it’s a RAID 5/6 Pool. This is expected behaviour. Hopefully the BTRFS developers will improve the performance soon. Also, I wonder how it compares to HW RAID 5/6 resilver times. It would be great if someone can provide a HW reference point as well.

On other RAID profiles, it should be much faster as mentioned elsewhere in this post.