Scheduled Scrub not running

peter · August 19, 2017, 7:41pm

Hi!

So here is the output:

Blockquote [root@nitrogen ~]# psql -U rocky storageadmin
Password for user rocky:
psql (9.2.18)
Type “help” for help.

storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)

I’ll be back later with more after rebooting and starting a scrub job

peter · August 19, 2017, 7:57pm

And with a scrub job running:

storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)

Hmmm?

I noticed a zombie gunicorn process that was started at the same time as the scrub job.

Blockquote
Tasks: 224 total, 2 running, 221 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.2 us, 12.3 sy, 0.0 ni, 70.5 id, 16.9 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 7084448 total, 3255612 free, 393192 used, 3435644 buff/cache
KiB Swap: 5862396 total, 5862396 free, 0 used. 6342896 avail Mem

root 14792 0.0 0.0 0 0 ? Z 21:52 0:00 [gunicorn] defunct

Blockquote [root@nitrogen ~]# pstree
systemd─┬─NetworkManager─┬─dhclient
│ └─2*[{NetworkManager}]
├─agetty
├─atd
├─auditd───{auditd}
├─avahi-daemon───avahi-daemon
├─btrfs───4*[{btrfs}]
├─chronyd
├─crond
├─dbus-daemon
├─docker-wrapper─┬─docker───5*[{docker}]
│ └─2*[{docker-wrapper}]
├─gssproxy───5*[{gssproxy}]
├─irqbalance
├─master─┬─pickup
│ └─qmgr
├─netatalk─┬─afpd
│ ├─cnid_metad
│ └─{netatalk}
├─polkitd───5*[{polkitd}]
├─postgres───15*[postgres]
├─rsyslogd───2*[{rsyslogd}]
├─shellinaboxd───shellinaboxd
├─smartd
├─sshd───sshd───bash───pstree
├─supervisord─┬─data-collector───5*[{data-collector}]
│ ├─django───2*[{django}]
│ ├─gunicorn─┬─gunicorn───2*[{gunicorn}]
│ │ └─gunicorn─┬─gunicorn
│ │ └─2*[{gunicorn}]
│ └─nginx───2*[nginx]
├─systemd-journal
├─systemd-logind
└─systemd-udevd

magicalyak · August 20, 2017, 1:39am

storageadmin=# SELECT pool_id,pid,count(*) FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count(*) >1;
 pool_id | pid | count 
---------+-----+-------
(0 rows)

magicalyak · August 20, 2017, 1:41am

FYI

storageadmin=# SELECT * FROM storageadmin_poolscrub;
 id | pool_id |  status  |  pid  |          start_time           |           end_time            | kb_scrubbed | data_extents_scrubbed |
 tree_extents_scrubbed | tree_bytes_scrubbed | read_errors | csum_errors | verify_errors | no_csum | csum_discards | super_errors | mall
oc_errors | uncorrectable_errors | unverified_errors | corrected_errors | last_physical 
----+---------+----------+-------+-------------------------------+-------------------------------+-------------+-----------------------+
-----------------------+---------------------+-------------+-------------+---------------+---------+---------------+--------------+-----
----------+----------------------+-------------------+------------------+---------------
  1 |       3 | finished | 26203 | 2017-07-15 10:14:53.958678-07 | 2017-07-15 14:07:04.958678-07 |  7778964796 |             134997767 |
                754941 |         12368953344 |           0 |           0 |             0 |  519680 |             0 |            0 |     
        0 |                    0 |                 0 |                0 | 8530047270912
(1 row)

peter · August 20, 2017, 6:47am

OK, my manually started scrub job finished successfully:

Blockquote scrub status for 2815c029-9214-4217-9571-2d7fb501da0c
scrub started at Sat Aug 19 21:52:11 2017 and finished after 02:54:54
data_extents_scrubbed: 58687013
tree_extents_scrubbed: 322899
data_bytes_scrubbed: 3732788396032
tree_bytes_scrubbed: 5290377216
read_errors: 0
csum_errors: 0
verify_errors: 0
no_csum: 225408
csum_discards: 0
super_errors: 0
malloc_errors: 0
uncorrectable_errors: 0
unverified_errors: 0
corrected_errors: 0
last_physical: 3778576121856

Blockquote
storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)

Gunicorn has opened lots of temp files, deleted according to lsof

Create a scheduled scrub job:

Terminated with status error after 2 sec (according to web ui)

Blockquote
storageadmin=# SELECT pool_id,pid,count() FROM storageadmin_poolscrub GROUP BY pool_id,pid HAVING count() >1;
pool_id | pid | count
---------±----±------
(0 rows)

No zombie process this time

phillxnet · August 20, 2017, 9:30am

@peter and @magicalyak Thanks for this folks, looks like my db suspicion in that area was in deed the wrong tree. Oh well, I’ll have another look, but still haven’t seen the exact same thing here yet.

So I’m assuming that you are still experiencing the same log entries / email on that failed scheduled scrub? I.e the

... returned non-string (type list) ...

one.

Also could you try running a manually initiated (via Web-UI) scrub (when no others are running or scheduled) but adding the force option as when used there is a little bit of extra checking that is done. Don’t think it’s going to help as all it does it turn existing running or started jobs into Terminated state and I don’t think you have any of those anyway.

terminated-scrub-task

I’ll ponder some more and see if anything comes up.

Dragon2611 · August 20, 2017, 9:52am

Looks my mid month scrub job for the Root (RockStor Drive) ran ok, then again there’s not a lot on there other than the OS itself so there isn’t much to scrub.

Be more interesting to see what happens at the start of the month when it does the main pool.

phillxnet · August 20, 2017, 10:24am

@Dragon2611 Thanks for the update and do let us know how your main pool schedule scrub goes. Still unable to reproduce the “non-string (type list)” issue but at least we have the GitHub issue and this thread to keep an eye on it.

peter · August 20, 2017, 4:10pm

Hi!

Correct;

The manual/web UI one:

ran successfully and with no errors
gave me a zombie process (might have been a coincidence)
shows up in the web UI/Storage>Pools>Scrubs table
gave me no records in the db as above

The scheduled one

did not run successfully, seems to have terminated within 2 sec (however that may or may not be correct)
did not start a zombie process
not listed in the web UI/Storage>Pools>Scrubs table
received the e-mail notification you mentioned

phillxnet · August 20, 2017, 4:36pm

@peter Thanks for the detailed report / update.

No worries on that one, you should see it if you execute the sql command indicated by @magicalyak early re “FYI” which shows all entries, my command was looking for repeat instances of a pid (wrong tree to bark up).

By way of update on my part I have opened an new issue that specifies a related but different bug. I will tend to this latter issue first as it also results in blocked scheduled tasks but I can reproduce it and it should help to make the scheduled scrub code a little more robust:

github.com/rockstor/rockstor-core

scheduled scrub blocked by halted, conn-reset, and cancelled states

opened 03:51PM - 20 Aug 17 UTC

closed 12:46PM - 28 Aug 17 UTC

phillxnet

bug

In pr #1791 "improve scrub status reporting resolution." the additional scrub en…d states of 'halted' (interrupted), 'conn-reset' (connection reset by peer), and 'cancelled' (aborted) were introduced. Although the indicated pr addressed the mis-reporting of 'halted' (interrupted) as a running state it did not extend knowledge of the referenced additional states to the scheduled scrub manager. So although subsequent manual scrubs could be performed 'on top' of a now correctly identified 'halted' scrub a scheduled scrub was still not able to recognise this new scrub state as terminal: only 'error' and 'finished' were recognised as terminal (non running). So from a scheduled scrub perspective any scrub performed manually (via Web-UI) or via a prior schedule (untested) that was then halted, cancelled, or resulted in a connection reset would block all future scheduled jobs on that same pool as although manual scrub awareness was extended to encompass the 'halted', 'conn-reset' and 'cancelled' states, the scheduled scrub mechanism was not. The work around for this state is to enact a Web-UI initiated manual scrub that ends in state finished. There after the existing scheduled scrubs on the same pool would no longer be blocked as they are able to recognise 'finished' as a terminal scan state and initiate a new scrub task as a result. The force option is not required to enact this 'reset'. Issue Tasks: - [x] extend scheduled scrub manager's knowledge of the recently added terminal states. - [x] test existing scheduled scrub tasks ability to recover / re-start post issue fix. This issue was found while investigating the cause of #1798 but is not believed to be a mutual fix as it does not explain the list type specified in that issue.

I don’t think it shares the same cause but is an improvement in the same area.

If you are game there might be a clue in the following db table which holds the current task status (largest id) and history:

psql -U rocky smartdb

pass as before

select * from smart_manager_task;

and the task_dev_id field there point to the contents of the following table:

select * from smart_manager_taskdefinition;

ie the figures in task_def_id column of the task table indicate the id entries of the taskdefinition table.

We will get there soon and I have mostly sorted the cause of this latest issue so bit by bit.

Thanks for you help in this area, much appreciated.

peter · August 20, 2017, 4:42pm

Hi

One thought; since we now know that there are workarounds to both the login issue (schedule a reboot each night) and to the scrub thing (run them manually) I have added a note about that in my original post; to highlight that further for other users it could be useful to mark the name of the post as [Workaround]

I’ll have a look at the other issue later on

phillxnet · August 20, 2017, 4:49pm

@peter

That’s an idea. Bit of a bad one that open files thing and again only seems to affect some users. Oh well, just need more info and time really.

Cheers.

magicalyak · August 20, 2017, 4:53pm

I know the manual will run but wanted to add scheduled scrubs did work in the past. I rebuilt the os a few months ago and only recently re added them noticing the error.
How are we submitting the scrub job? Just in cron? If so shouldn’t we see the entries there?

magicalyak · August 20, 2017, 4:55pm

If we’re not using cron I can figure the commands we need to enter. If the call could write to cron then problem solved let me know if you want me to contribute if that’s the path.

phillxnet · August 20, 2017, 6:16pm

@magicalyak Hello again.

Yes and we do (sort of) eg:

cat /etc/cron.d/rockstortab 

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# These entries are auto generated by Rockstor. Do not edit.
*/5 * * * * root /opt/rockstor-dev//bin/st-pool-scrub 7 \*-*-*-*-*-*

And the indicated script is what I’m currently making some bug fixing changes to. At least that’s the hope.

Should have a pr ready soon, just need to do some more testing. But haven’t yet tracked down the “… returned non-string (type list)…” thing yet. Tricky without a reproducer so fixing what I can for now in that area.

were it not for the ever present details

Thanks for the offer to help, if you still have a reproducer of the “list when string” expected scheduled scrub it would help to see some additional logging to know what the contents of the list elements were.
I was thinking something along the lines of:

logger.debug('joined presentation of passed pid=%s' % '/n'.join(pid))

in the try clause at:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool_scrub.py#L36


      
          from datetime import timedelta
          
          import logging
          
          logger = logging.getLogger(__name__)
          
          
          class PoolScrubView(rfc.GenericView):
              serializer_class = PoolScrubSerializer
          
              def _validate_pool(self, pid, request):
                  # Catch all Pool values and log
                  # all_pools = Pool.objects.all().values()
                  # logger.info("Pool objects are: ({})".format(all_pools))
                  try:
                      return Pool.objects.get(id=pid)
                  except:
                      e_msg = "Pool with id ({}) does not exist.".format(pid)
                      handle_exception(Exception(e_msg), request)
          
              def get_queryset(self, *args, **kwargs):

As your and @peter’s reported emails seem to indicate that something is calling _validate_pool with a pid as list such as is the case when more than a single item are returned by a queryset when only a single match is expected, which in the single instance is then a string, not a list. That’s my best guess so far. That way we would have more to work with as we could see where the multiple pids (pool id’s) are coming from.

Hope that helps, but for the time being I’m looking to the more recent and reproducible scheduled scrub blocker I found while investigating this one. Hope to circle back around to this one though.

Cheers.

magicalyak · August 20, 2017, 10:29pm

I think there may be two issues
Right now I think it’s the crontab for rockstor but this is just me guessing. Note I have two scrubs but does this crontab look right to you? Note the double slashes??

[root@rocky ~]# cat /etc/cron.d/rockstortab 
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
MAILFROM=rocky@gamull.com
# These entries are auto generated by Rockstor. Do not edit.
0 1 * * * root /opt/rockstor//bin/st-system-power 3 \*-*-*-*-*-*
0 7 * * 3 root /opt/rockstor//bin/st-pool-scrub 4 \*-*-*-*-*-*
30 18 * * 0 root /opt/rockstor//bin/st-pool-scrub 5 \*-*-*-*-*-*

magicalyak · August 20, 2017, 11:34pm

Log only shows

Aug 20 18:30:01 rocky.gamull.com CROND[16439]: (root) CMD (/opt/rockstor//bin/st-pool-scrub 5 \*-*-*-*-*-*)

Manual Run

[root@rocky log]# /opt/rockstor//bin/st-pool-scrub 5 \*-*-*-*-*-*
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 78
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 39

phillxnet · August 21, 2017, 10:29am

@magicalyak

Yes that is fine, not elegant, it does work that way. I have them working here just fine with that arrangement so no worries there, at least on the functional side.

Could you try turning on debug log level:

/opt/rockstor/bin/debug-mode ON

Note that this command will restart your rockstor service though.
Run without the “ON” to find out the options.
Don’t think that’s going to help though but worth a try.

It may help, at least to get more info, if you could add that logging line I detailed in my last post inside the try clause, at the same indent (with only spaces) as the return command and just before it. That might tell us what the contents of this mystery list are. You would then have to restart the rockstor service for this to take effect.

The problem with this issue is that only you and @peter have reported it and I can’t reproduce or track down (yet) what is causing it so still at the ‘more info’ stage really.

See how that goes if you are game but no worries otherwise as not good to play with code file editing ‘in production’ but given your history in the rockon-registry repo I figured you might be up for it. Quite frustrating at this end not having a reproducer. But at least it’s leading to other fixes which I can hopefully git in soon by way of my next (pending) pull request.

Nice on the manual run by the way:
But here I simply get a long pause as it does it bit / or waits for my every 5 min job to complete and then nothing:

[root@rockdev ~]# /opt/rockstor-dev//bin/st-pool-scrub 7 \*-*-*-*-*-*
[root@rockdev ~]#

with Web-UI pool details noting it’s occurance:

and another in the task history:

The zeroing of the start time seconds in the latter is intentional but not sure why yet.

We should also normalise those tables at some point.

Cheers.

magicalyak · August 21, 2017, 2:03pm

Note the top errors are pretty old and likely unrelated but I included them “just in case” the failed to get status errors (note the dates) are the relevant ones.

[25/Jul/2017 16:09:28] ERROR [storageadmin.middleware:32] Exception occured while processing a request. Path: /api/commands/refresh-snapshot-state method: POST
[25/Jul/2017 16:09:28] ERROR [storageadmin.middleware:33] Error running a command. cmd = /sbin/btrfs property get /mnt2/data/rockon-root/btrfs/subvolumes/e036e54af38ee90556073ed98d90ec8bf1514d919c733f4ac7a455c17e0e1a7a. rc = 1. stdout = ['']. stderr = ['ERROR: failed to open /mnt2/data/rockon-root/btrfs/subvolumes/e036e54af38ee90556073ed98d90ec8bf1514d919c733f4ac7a455c17e0e1a7a: No such file or directory', 'ERROR: failed to detect object type: No such file or directory', 'usage: btrfs property get [-t <type>] <object> [<name>]', '', '    Gets a property from a btrfs object.', '', '    If no name is specified, all properties for the given object are', '    printed.', '    A filesystem object can be a the filesystem itself, a subvolume,', "    an inode or a device. The '-t <type>' option can be used to explicitly", '    specify what type of object you meant. This is only needed when a', '    property could be set for more then one object type. Possible types', '    are s[ubvol], f[ilesystem], i[node] and d[evice].', '', '']
Traceback (most recent call last):
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/decorators/csrf.py", line 58, in wrapped_view
    return view_func(*args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/views/generic/base.py", line 71, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 452, in dispatch
    response = self.handle_exception(exc)
  File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 449, in dispatch
    response = handler(request, *args, **kwargs)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py", line 145, in inner
    return func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/storageadmin/views/command.py", line 323, in post
    import_snapshots(share)
  File "/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py", line 139, in import_snapshots
    share.name)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 507, in snaps_info
    snap_name, writable = parse_snap_details(mnt_pt, fields)
  File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 469, in parse_snap_details
    '%s/%s' % (mnt_pt, fields[-1])])
  File "/opt/rockstor/src/rockstor/system/osi.py", line 115, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /sbin/btrfs property get /mnt2/data/rockon-root/btrfs/subvolumes/e036e54af38ee90556073ed98d90ec8bf1514d919c733f4ac7a455c17e0e1a7a. rc = 1. stdout = ['']. stderr = ['ERROR: failed to open /mnt2/data/rockon-root/btrfs/subvolumes/e036e54af38ee90556073ed98d90ec8bf1514d919c733f4ac7a455c17e0e1a7a: No such file or directory', 'ERROR: failed to detect object type: No such file or directory', 'usage: btrfs property get [-t <type>] <object> [<name>]', '', '    Gets a property from a btrfs object.', '', '    If no name is specified, all properties for the given object are', '    printed.', '    A filesystem object can be a the filesystem itself, a subvolume,', "    an inode or a device. The '-t <type>' option can be used to explicitly", '    specify what type of object you meant. This is only needed when a', '    property could be set for more then one object type. Possible types', '    are s[ubvol], f[ilesystem], i[node] and d[evice].', '', '']
[26/Jul/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/data/scrub
[26/Jul/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/data/scrub/status
[31/Jul/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[31/Jul/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[02/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/data/scrub
[02/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/data/scrub/status
[07/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[07/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[09/Aug/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/data/scrub
[09/Aug/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/data/scrub/status
[14/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[14/Aug/2017 07:00:03] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[16/Aug/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/data/scrub
[16/Aug/2017 07:00:02] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/data/scrub/status
[20/Aug/2017 18:30:04] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[20/Aug/2017 18:30:04] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[20/Aug/2017 19:33:52] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[20/Aug/2017 19:33:52] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[20/Aug/2017 19:34:44] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[20/Aug/2017 19:34:44] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[20/Aug/2017 19:35:41] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[20/Aug/2017 19:35:41] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[21/Aug/2017 07:30:03] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[21/Aug/2017 07:30:03] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[21/Aug/2017 09:55:57] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[21/Aug/2017 09:55:57] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[21/Aug/2017 09:55:57] DEBUG [scripts.scheduled_tasks.pool_scrub:87] task(5) finished with state(error).
[21/Aug/2017 10:02:02] ERROR [scripts.scheduled_tasks.pool_scrub:76] Failed to start scrub at pools/bigdata/scrub
[21/Aug/2017 10:02:02] ERROR [scripts.scheduled_tasks.pool_scrub:37] Failed to get scrub status at pools/bigdata/scrub/status
[21/Aug/2017 10:02:02] DEBUG [scripts.scheduled_tasks.pool_scrub:87] task(5) finished with state(error).

root mail

From rocky@gamull.com  Mon Aug 21 10:02:02 2017
Return-Path: <rocky@gamull.com>
X-Original-To: root
Delivered-To: root@rocky.gamull.com
Received: by rocky.gamull.com (Postfix, from userid 0)
	id DD5332858A8; Mon, 21 Aug 2017 10:02:02 -0400 (EDT)
From: "(Cron Daemon)" <rocky@gamull.com>
To: root@rocky.gamull.com
Subject: Cron <root@rocky> /opt/rockstor//bin/st-pool-scrub 5 \*-*-*-*-*-*
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=8>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0>
X-Cron-Env: <LANG=en_US.UTF-8>
X-Cron-Env: <SHELL=/bin/bash>
X-Cron-Env: <PATH=/sbin:/bin:/usr/sbin:/usr/bin>
X-Cron-Env: <MAILTO=root>
X-Cron-Env: <MAILFROM=rocky@gamull.com>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
Message-Id: <20170821140202.DD5332858A8@rocky.gamull.com>
Date: Mon, 21 Aug 2017 10:02:02 -0400 (EDT)

Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 78
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/handlers.py", line 76, in emit
    if self.shouldRollover(record):
  File "/usr/lib64/python2.7/logging/handlers.py", line 154, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 324, in getMessage
    msg = str(self.msg)
TypeError: __str__ returned non-string (type list)
Logged from file pool_scrub.py, line 39

phillxnet · August 21, 2017, 4:14pm

@magicalyak it would appear from your logs:

That you have not re-created your scrub tasks post 3.9.1-2 release (see later) ie from my first comment in this thread here and my first too yourself in this thread :

phillxnet:

@magicalyak Hello again:

magicalyak:
I’ll see if it’s the same issue but it’s scrubs not starting.
Cheers that would be good.

Just try re-creating those tasks as then they adopt the new pool API, assuming you have all the recent testing channel updates in place of course.

ie the url for your logged tasks is using the pool name where as from the end of the last testing release (last stable) and all throughout the current testing channel release cycle we have been using the db id of a pool not it’s name in the API calls. And more specifically the second testing channel update released in the current testing channel, 3.9.1-2:

ie:

scripts.scheduled_tasks.pool_scrub - update_state constructs url=pools/2/scrub/status

and then uses that to query the scrub state.

Sorry I should have double checked that you were on a relevant release and I obviously made a misleading mistake in my post:

Silly me.

I was thinking that the number indicated the pool but of course as per my post to @peter:

So for my example here:

cat /etc/cron.d/rockstortab 

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# These entries are auto generated by Rockstor. Do not edit.
*/5 * * * * root /opt/rockstor-dev//bin/st-pool-scrub 7 \*-*-*-*-*-*

that “7” points to:

psql -U rocky smartdb

select * from smart_manager_taskdefinition;
 id |         name         | task_type |                    json_meta                    | enabled |   crontab   | crontabwindow 
----+----------------------+-----------+-------------------------------------------------+---------+-------------+---------------
  7 | luks-on-bcache-scrub | scrub     | {"pool_name": "pool-on-luks-dev", "pool": "2"}  | t       | */5 * * * * | *-*-*-*-*-*
  6 | root-scrub3          | scrub     | {"pool_name": "rockstor_rockstor", "pool": "1"} | f       | */5 * * * * | *-*-*-*-*-*
(2 rows)

Oh well, my mistake; see how you get on after updating to the latest testing release and then deleting and re-creating your scheduled scrubs.

On my side of the updates wagon we now have the following pending pull request awaiting review:

github.com/rockstor/rockstor-core

scheduled scrub blocked by halted, conn-reset, and cancelled states. Fixes #1800

rockstor:master ← phillxnet:1800_scheduled_scrub_blocked_by_halted_conn-reset_and_cancelled_states

opened 02:14PM - 21 Aug 17 UTC

phillxnet

+7 -3

Please see issue text for context. Essential this pr extends the known terminal …states of a scrub, from the scrub scheduler point of view, from the previous states of 'error' and 'finished' to include the additional non running states of 'halted', 'cancelled', and 'conn-reset'. Pre pr and pre scrub status enhancement pr #1791 a prior scheduled or manually initiated (via Web-UI) scrub that was in state 'halted' (interrupted eg by a reboot) would cause all future scheduled scrubs of the associated pool to not execute. This was as a result of the 'running' (pre pr #1791) or (post pr #1791) 'halted' state as both were considered as non-terminal, ie as not 'finished' or not 'errror' and hence all future scrub tasks would be blocked. By extending the terminal status to include the now correctly identified but equally non-running additional states existing scheduled scrub jobs are initiated successfully. Tested by placing an ongoing Web-UI initiated scrub into both 'cancelled' (via 'btrfs scrub cancel') and 'halted' / interrupted (by rebooting mid scrub) and then ensuring that pre pr the existing scheduled scrub would fail: DEBUG [scripts.scheduled_tasks.pool_scrub:61] Non terminal state(cancelled) for task(2). Checking again. DEBUG [scripts.scheduled_tasks.pool_scrub:66] Non terminal state(cancelled) for task(2). A new task will not be run. and then post pr that the same existing schedule scrub was initiated successfully. Fixes #1800 Ready for review.

Thanks @magicalyak and I hope the above has sufficiently explained my misunderstanding.