Scheduled Scrub not running

So I had scheduled scrub task from when I first built the VM, sadly at some point it got stuck and I resumed it from the CLI.

The problem is Rockstor thinks it’s still running so doesn’t start a new one, meaning the Pool hasn’t been scrubbed for nearly a year (Opps!)

[root@rock-home ~]# btrfs scrub status /mnt2/Pool1/
scrub status for 48e9fe04-9f49-4c33-b9d7-12aefe745674
scrub resumed at Fri Aug 19 22:10:20 2016 and finished after 90:42:26
total bytes scrubbed: 5.46TiB with 0 errors

423 August 1st 2016, 3:00:00 am August 1st 2016, 3:00:07 am running

Is there anyway to force Rockstor to show the job as complete?

Edit:

Going to delete the task and create a new one and see if that fixes it.

@Dragon2611 How did deleting the scheduled scrub task and creating a new one work out for you?

As per my post replying to @kbogert:

I seem to have omitted that in my communications re the fix:

Although you may be subject to a different bug. Have you tried the force option within the UI?

@Dragon2611 re:

Have a look at the following forum thread started by @Christian_Rost. Does your current issue look like that?

Cheers, just trying to narrow this one down.

Not sure yet as I set the Scrubs for the 1st and the 15th of the month, don’t really want to schedule one for now as there’s a balance still running

@Dragon2611 OK, no worries.

I’m having something similar but didn’t think to investigate. I’ll see if it’s the same issue but it’s scrubs not starting. I powerdown nightly but made sure scrubs start 30 minutes after wake.

@magicalyak Hello again:

Cheers that would be good.

Just try re-creating those tasks as then they adopt the new pool API, assuming you have all the recent testing channel updates in place of course.

Deleted and re-added, still getting an error. This is the email contents on the failed cron scrub job

Cron <root@rocky> /opt/rockstor//bin/st-pool-scrub 4 \*-*-*-*-*-*

Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 78
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 39

@magicalyak Thanks for trialing the delete re-add route, and given your cron entry (cheers) you now have the new pool API format (id rather and name) so could you please post the output of the following:

btrfs scrub status -R /mnt2/pool-name-here

But from a quick look we have a bug in error message formatting that is making this a little more tricky to fathom ie:

expects as string but is receiving a list! I don’t see how as yet our pid is a list at this point.

Could you see if a manual scrub attempt (using the force option) clears this up as I’m not sure yet what’s happening. That is assuming there are no currently ongoing scrubs.

As can be seen the problem seems to be in the simple validation of the pool id, as far as the TypeError but the larger problem is why the pool id lookup is incorrect. Haven’t seen this before. Have you rebooted recently, ie since we made the pool API changes? Or at least re-started the rockstor service?

This looks to be a different problem to the others reported here.

I’ll try tonight but I reboot nightly and use wake at 6am (giving the machine a full hour before attempting a scrub). So I’m sure the new stuff is loading. I recall a manual scrub working a few weeks ago but I’ll verify today or tomorrow at the latest.

[root@rocky ~]# btrfs scrub status -R /mnt2/bigdata
scrub status for a16b3b05-763b-419e-ac6b-33297e0529ef
	scrub started at Sat Jul 15 13:14:53 2017 and finished after 03:52:11
	data_extents_scrubbed: 134997767
	tree_extents_scrubbed: 754941
	data_bytes_scrubbed: 7965659951104
	tree_bytes_scrubbed: 12368953344
	read_errors: 0
	csum_errors: 0
	verify_errors: 0
	no_csum: 519680
	csum_discards: 0
	super_errors: 0
	malloc_errors: 0
	uncorrectable_errors: 0
	unverified_errors: 0
	corrected_errors: 0
	last_physical: 8530047270912
[root@rocky ~]# btrfs scrub status -R /mnt2/data
scrub status for f7abfd64-e3e3-46fc-af22-e76ab54ee9db
	scrub started at Sun Jul  9 13:44:52 2017 and finished after 03:57:47
	data_extents_scrubbed: 106519513
	tree_extents_scrubbed: 823435
	data_bytes_scrubbed: 6791621533696
	tree_bytes_scrubbed: 13491159040
	read_errors: 0
	csum_errors: 0
	verify_errors: 0
	no_csum: 411850
	csum_discards: 0
	super_errors: 0
	malloc_errors: 0
	uncorrectable_errors: 0
	unverified_errors: 0
	corrected_errors: 0
	last_physical: 6913581383680

Hi

Seems like I’m having the same problem. I get the same (identical) e-mail message. My last successful scrub was on 24 July:

[root@nitrogen ~]# btrfs scrub status -R /mnt2/pool-1
scrub status for 2815c029-9214-4217-9571-2d7fb501da0c
scrub started at Mon Jul 24 01:42:02 2017 and finished after 02:48:09
data_extents_scrubbed: 58144551
tree_extents_scrubbed: 318195
data_bytes_scrubbed: 3700702498816
tree_bytes_scrubbed: 5213306880
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 223040
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 3778576121856

@peter:

Thanks for contributing to this forum thread by way of confirmation. Very much looks like you and @magicalyak now have examples of this issue. I have had another quick look at this and have as a result opened an issue:

I now have a suspicion of what has happened but need a little more time to investigate. As I currently don’t have a reproducer here I may well have to ask you both to test a proposed fix (pending); assuming I don’t manage to reproduce this issue locally.

Thanks for your persistence on this one and I hope to have a proposed fix in due course, all interim progress will be noted in the referenced issue.

Happy to test. Just let me know. Btw a scrub run manually on the cli works fine.
Also is there a feature request to add a “run now” option for scheduled tasks?

Happy to test as well!

@magicalyak and @peter Thanks to both of you for offering to help.

Could you both look to see if you have any scrubs in the ‘Task history’ (clicking on “Last run” entry) for the problem pool that are in status ‘running’ or ‘started’, I’ve got a hunch? It may be that if there are multiple instances of these states in the history that we trip ourselves up.

Thanks. Still trying to fathom the cause!

Hi

Happy to help!

I checked the task history, se appended screen dump. There are only finished and “error” entries. The last four ended with “error”; and one more between successful ones. Another thing I notice is that they all appear to finish in about 2 seconds. However, if I do btrtfs scrub status -R it is correctly reported as taking almost 3 hours so I assume the web UI reporting is just an cosmetic issue.

I will check later tonight if I can run scrub jobs manually and/or if it helps to delete and reschedule the job. Another thought; what if I manually clear the the history file?

[root@nitrogen ~]# btrfs scrub status -R /mnt2/pool-1
scrub status for 2815c029-9214-4217-9571-2d7fb501da0c
scrub started at Mon Jul 24 01:42:02 2017 and finished after 02:48:09
data_extents_scrubbed: 58144551
tree_extents_scrubbed: 318195
data_bytes_scrubbed: 3700702498816
tree_bytes_scrubbed: 5213306880
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 223040
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 3778576121856

Not running, last run is error only. I recreated them earlier this week.

@peter and @magicalyak Thanks for the feedback / info.

Funny you should say that, as if one removes a scheduled scrub via the bin icon in Actions it removes the history of that scrub and re-instating the same scrub will not give access to the same scrub history. However the pool details page retains a list of all scrubs done, manual (via Web-UI) or scheduled. My current thought is that you both have duplicate pid (in this case process id of the scrub job) entries against a pool in your storageadmin_poolscrub database tables. I’ll try and work out an sql command to extract this info by way of a test. I’m a bit rubbish at sql so this may take me some time.

Thank folks, I think that’s given me another angle to explore. Back in a bit.

@peter and @magicalyak

OK so I think this might be worth a try. If you execute the following command (carefully) as root to see if your systems have duplicate pid (process id) entries for a given pool in storageadmin_poolscrub:

psql -U rocky storageadmin

pass=rocky

SELECT pool_id,pid,count(*) FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count(*) >1;

and if you could past the output (if any) here that might help; still a little in the dark on this one unfortunately.

And to exit the postgresql console enter the following:

\q

Also do either of you have > 1 scrub jobs listed as started or running in the pools detail page scrub tab.

I’ll have to leave this one for a bit and circle back around to it as this may not be the tree to bark up.

Cheers.