@Dragon2611 re:
Have a look at the following forum thread started by @Christian_Rost. Does your current issue look like that?
Cheers, just trying to narrow this one down.
@Dragon2611 re:
Have a look at the following forum thread started by @Christian_Rost. Does your current issue look like that?
Cheers, just trying to narrow this one down.
Not sure yet as I set the Scrubs for the 1st and the 15th of the month, don’t really want to schedule one for now as there’s a balance still running
I’m having something similar but didn’t think to investigate. I’ll see if it’s the same issue but it’s scrubs not starting. I powerdown nightly but made sure scrubs start 30 minutes after wake.
@magicalyak Hello again:
Cheers that would be good.
Just try re-creating those tasks as then they adopt the new pool API, assuming you have all the recent testing channel updates in place of course.
Deleted and re-added, still getting an error. This is the email contents on the failed cron scrub job
Cron <root@rocky> /opt/rockstor//bin/st-pool-scrub 4 \*-*-*-*-*-*
Traceback (most recent call last):
File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
if self.shouldRollover(record):
File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
msg = "%s\n" % self.format(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
return fmt.format(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
record.message = record.getMessage()
File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 78
Traceback (most recent call last):
File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
if self.shouldRollover(record):
File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
msg = "%s\n" % self.format(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
return fmt.format(record)
File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
record.message = record.getMessage()
File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 39
@magicalyak Thanks for trialing the delete re-add route, and given your cron entry (cheers) you now have the new pool API format (id rather and name) so could you please post the output of the following:
btrfs scrub status -R /mnt2/pool-name-here
But from a quick look we have a bug in error message formatting that is making this a little more tricky to fathom ie:
expects as string but is receiving a list! I don’t see how as yet our pid is a list at this point.
Could you see if a manual scrub attempt (using the force option) clears this up as I’m not sure yet what’s happening. That is assuming there are no currently ongoing scrubs.
As can be seen the problem seems to be in the simple validation of the pool id, as far as the TypeError but the larger problem is why the pool id lookup is incorrect. Haven’t seen this before. Have you rebooted recently, ie since we made the pool API changes? Or at least re-started the rockstor service?
This looks to be a different problem to the others reported here.
I’ll try tonight but I reboot nightly and use wake at 6am (giving the machine a full hour before attempting a scrub). So I’m sure the new stuff is loading. I recall a manual scrub working a few weeks ago but I’ll verify today or tomorrow at the latest.
[root@rocky ~]# btrfs scrub status -R /mnt2/bigdata
scrub status for a16b3b05-763b-419e-ac6b-33297e0529ef
scrub started at Sat Jul 15 13:14:53 2017 and finished after 03:52:11
data_extents_scrubbed: 134997767
tree_extents_scrubbed: 754941
data_bytes_scrubbed: 7965659951104
tree_bytes_scrubbed: 12368953344
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 519680
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 8530047270912
[root@rocky ~]# btrfs scrub status -R /mnt2/data
scrub status for f7abfd64-e3e3-46fc-af22-e76ab54ee9db
scrub started at Sun Jul 9 13:44:52 2017 and finished after 03:57:47
data_extents_scrubbed: 106519513
tree_extents_scrubbed: 823435
data_bytes_scrubbed: 6791621533696
tree_bytes_scrubbed: 13491159040
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 411850
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 6913581383680
Hi
Seems like I’m having the same problem. I get the same (identical) e-mail message. My last successful scrub was on 24 July:
[root@nitrogen ~]# btrfs scrub status -R /mnt2/pool-1
scrub status for 2815c029-9214-4217-9571-2d7fb501da0c
scrub started at Mon Jul 24 01:42:02 2017 and finished after 02:48:09
data_extents_scrubbed: 58144551
tree_extents_scrubbed: 318195
data_bytes_scrubbed: 3700702498816
tree_bytes_scrubbed: 5213306880
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 223040
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 3778576121856
Thanks for contributing to this forum thread by way of confirmation. Very much looks like you and @magicalyak now have examples of this issue. I have had another quick look at this and have as a result opened an issue:
I now have a suspicion of what has happened but need a little more time to investigate. As I currently don’t have a reproducer here I may well have to ask you both to test a proposed fix (pending); assuming I don’t manage to reproduce this issue locally.
Thanks for your persistence on this one and I hope to have a proposed fix in due course, all interim progress will be noted in the referenced issue.
Happy to test. Just let me know. Btw a scrub run manually on the cli works fine.
Also is there a feature request to add a “run now” option for scheduled tasks?
Happy to test as well!
@magicalyak and @peter Thanks to both of you for offering to help.
Could you both look to see if you have any scrubs in the ‘Task history’ (clicking on “Last run” entry) for the problem pool that are in status ‘running’ or ‘started’, I’ve got a hunch? It may be that if there are multiple instances of these states in the history that we trip ourselves up.
Thanks. Still trying to fathom the cause!
Hi
Happy to help!
I checked the task history, se appended screen dump. There are only finished and “error” entries. The last four ended with “error”; and one more between successful ones. Another thing I notice is that they all appear to finish in about 2 seconds. However, if I do btrtfs scrub status -R it is correctly reported as taking almost 3 hours so I assume the web UI reporting is just an cosmetic issue.
I will check later tonight if I can run scrub jobs manually and/or if it helps to delete and reschedule the job. Another thought; what if I manually clear the the history file?
[root@nitrogen ~]# btrfs scrub status -R /mnt2/pool-1
scrub status for 2815c029-9214-4217-9571-2d7fb501da0c
scrub started at Mon Jul 24 01:42:02 2017 and finished after 02:48:09
data_extents_scrubbed: 58144551
tree_extents_scrubbed: 318195
data_bytes_scrubbed: 3700702498816
tree_bytes_scrubbed: 5213306880
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 223040
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 3778576121856
Not running, last run is error only. I recreated them earlier this week.
@peter and @magicalyak Thanks for the feedback / info.
Funny you should say that, as if one removes a scheduled scrub via the bin icon in Actions it removes the history of that scrub and re-instating the same scrub will not give access to the same scrub history. However the pool details page retains a list of all scrubs done, manual (via Web-UI) or scheduled. My current thought is that you both have duplicate pid (in this case process id of the scrub job) entries against a pool in your storageadmin_poolscrub database tables. I’ll try and work out an sql command to extract this info by way of a test. I’m a bit rubbish at sql so this may take me some time.
Thank folks, I think that’s given me another angle to explore. Back in a bit.
@peter and @magicalyak
OK so I think this might be worth a try. If you execute the following command (carefully) as root to see if your systems have duplicate pid (process id) entries for a given pool in storageadmin_poolscrub:
psql -U rocky storageadmin
pass=rocky
SELECT pool_id,pid,count(*) FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count(*) >1;
and if you could past the output (if any) here that might help; still a little in the dark on this one unfortunately.
And to exit the postgresql console enter the following:
\q
Also do either of you have > 1 scrub jobs listed as started or running in the pools detail page scrub tab.
I’ll have to leave this one for a bit and circle back around to it as this may not be the tree to bark up.
Cheers.
Hi!
So here is the output:
Blockquote [root@nitrogen ~]# psql -U rocky storageadmin
Password for user rocky:
psql (9.2.18)
Type “help” for help.
storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)
I’ll be back later with more after rebooting and starting a scrub job
And with a scrub job running:
storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)
Hmmm?
I noticed a zombie gunicorn process that was started at the same time as the scrub job.
Blockquote
Tasks: 224 total, 2 running, 221 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.2 us, 12.3 sy, 0.0 ni, 70.5 id, 16.9 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 7084448 total, 3255612 free, 393192 used, 3435644 buff/cache
KiB Swap: 5862396 total, 5862396 free, 0 used. 6342896 avail Mem
root 14792 0.0 0.0 0 0 ? Z 21:52 0:00 [gunicorn] defunct
Blockquote [root@nitrogen ~]# pstree
systemd─┬─NetworkManager─┬─dhclient
│ └─2*[{NetworkManager}]
├─agetty
├─atd
├─auditd───{auditd}
├─avahi-daemon───avahi-daemon
├─btrfs───4*[{btrfs}]
├─chronyd
├─crond
├─dbus-daemon
├─docker-wrapper─┬─docker───5*[{docker}]
│ └─2*[{docker-wrapper}]
├─gssproxy───5*[{gssproxy}]
├─irqbalance
├─master─┬─pickup
│ └─qmgr
├─netatalk─┬─afpd
│ ├─cnid_metad
│ └─{netatalk}
├─polkitd───5*[{polkitd}]
├─postgres───15*[postgres]
├─rsyslogd───2*[{rsyslogd}]
├─shellinaboxd───shellinaboxd
├─smartd
├─sshd───sshd───bash───pstree
├─supervisord─┬─data-collector───5*[{data-collector}]
│ ├─django───2*[{django}]
│ ├─gunicorn─┬─gunicorn───2*[{gunicorn}]
│ │ └─gunicorn─┬─gunicorn
│ │ └─2*[{gunicorn}]
│ └─nginx───2*[nginx]
├─systemd-journal
├─systemd-logind
└─systemd-udevd