Scrub won't start (unfinished job in history table)

Christian_Rost · July 29, 2017, 9:29am

Once upon a time I started a scrub and it finished good. Strange thing is, that there are two entries in scrub history for the same scrub, whereas the first of the two entries did not finish (there is NO scrub running for any pool ATM!).

When trying to start a new scrub from GUI it throws the following error:

    Traceback (most recent call last):
File “/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py”, line 41, in _handle_exception
yield
File “/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py”, line 88, in post
‘and start a new scrub, use force option’ % pid)
TypeError: %d format: a number is required, not unicode

How can I start a new scrub?

phillxnet · July 29, 2017, 12:04pm

@Christian_Rost Hello again.
Thanks for reporting your findings. A complication in this case is that Rockstor’s reporting of the problem state has a very recent bug in it, the %d should be changed to %s:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/views/pool_scrub.py#L87


      
          
          with self._handle_exception(request):
              ps = self._scrub_status(pool)
              if command == "status":
                  return Response(PoolScrubSerializer(ps).data)
              force = request.data.get("force", False)
              if PoolScrub.objects.filter(
                  pool=pool, status__regex=r"(started|running)"
              ).exists():
                  if force:
                      p = PoolScrub.objects.filter(
                          pool=pool, status__regex=r"(started|running)"
                      ).order_by("-id")[0]
                      p.status = "terminated"
                      p.save()
                  else:
                      e_msg = (
                          "A Scrub process is already running for "
                          "pool ({}). If you really want to kill it "
                          "and start a new scrub, use the force "
                          "option."

An issue has been opened for this message formatting bug:

github.com/rockstor/rockstor-core

fix scrub in progress message format regression

opened 11:29AM - 29 Jul 17 UTC

closed 01:22PM - 08 Sep 17 UTC

phillxnet

bug

Thanks to forum member Christian_Rost in the following forum thread for reportin…g this issue. When a scrub operation is, or is assumed to be, in progress, and another scrub task is requested via the UI. The message presented to the user should contain the pid of the pool. As this pid is presented in Unicode but is decoded via %d the user error message becomes a programmatic error message: File "/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py", line 88, in post 'and start a new scrub, use force option' % pid) TypeError: %d format: a number is required, not unicode We can for the time being (prior to improving our general formatting of such things via .format) address this by casting to string rather than decimal, even though the pool id (pid) is a decimal. This at least will fix the observed formatting exception. Please update the following forum thread with this issue resolution: https://forum.rockstor.com/t/scrub-wont-start-unfinished-job-in-history-table/3588

and a pull request to address it is awaiting review:

If you fancy trialing the ‘fix’ at least for the message formatting part of this then you simply need to change:
%d
to
%s
in the above highlighted line. Then a reboot, or Rockstor service restart, should yield the full message.
No worries if you are not game to edit that file on your installation (it’s a risky thing and not advised unless you are game and happy for this system to become unusable as a consequence(unlikely) and are also happy editing on the command line). We have nano pre-installed if you are game though.

The installed version of this file is in:
/opt/rockstor/src/rockstor/storageadmin/views/pool_scrub.py

To your point: the message should read:

A Scrub process is already running for pool (a-number-here). If you really want
to kill it and start a new scrub, use force option

So you could try ticking the force option that is presented just prior to starting the scrub.

As a curiosity you could also check the state of play scrub wise via the command line:

btrfs scrub status /mnt2/pool-name

Hope that helps and don’t worry about doing that edit unless you are game as all it will do is fix the message formatting bug; but you may be interested in dabbling and the info is presented in that light.

Hope that helps.

Christian_Rost · July 29, 2017, 1:13pm

Thanks, Phil!

I were not precise enough. The error occurs even when running with ‘force’ flag.
A scrub is running for some time now by command line. My question in terms how to start a scrub was meant by using the GUI.

I’ll give editing the source a shot without starting a vim over nano discussion Might take some time since I’ll be AFK for a few hours, but I’ll feed back.

Best regards,

Christian

Christian_Rost · July 30, 2017, 9:54am

Sorry, but editing the source file just like you mentioned just comes up with another error:

        Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

This error shows, no matter if I check force option or not.
My manually started scrub finished successfully after about 4.5h and there is nothing running ATM. So there should definitely be no reason for scrub to not start even without force option.

I do still think of the history table as being the matter, since I guess that this is being looked up rather than issuing a syscall. (See my posted screenshot)

Best regards,

Christian

phillxnet · July 30, 2017, 10:47am

@Christian_Rost Thanks for the feedback and as mentioned your circumstance had the complication of the error reporting formatting bug.
ie (from the pull request quoted) before the formatting change:

and after:

and I needed to get that out of the way first as it would affect anyone attempting to start a scrub while an existing one was running. Also it was standing in the way of the real exception message.

Now that you have the formatting issue out of the way we should be able to see the actual exception error rather than the formatting one. As you see the Traceback for yours and for the instance of a running scrub are identical; what is the message you receive directly under “Houston, we’ve had a problem.” (in red text).

Also what is the exact output of:

btrfs scrub status /mnt2/pool-name

as requested before, as this is also used (as well as the table) by Rockstor code to assess current status.

Agreed but the ‘force’ option is intended to apply in such cases so we just need to narrow down a little more what’s not working as intended and why, at least in your case.

It would be great to get to the bottom of this one so your assistance is appreciated, I’m just not sure just yet exactly where the problem is, once it’s more closely identified and we have a reproducer then an issue, and consequent fix, should follow.

After getting the above command output (for this thread), does the problem persist after a reboot?

Thanks.

Christian_Rost · July 30, 2017, 11:37am

The error message shows (with ‘Force’ Option checked and after reboot):

Houston, we’ve had a problem.

A Scrub process is already running for pool(3). If you really want to kill it and start a new scrub, use force option

        Traceback (most recent call last):
File “/opt/rockstor/eggs/gunicorn-0.16.1-py2.7.egg/gunicorn/workers/sync.py”, line 34, in run
client, addr = self.socket.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

Here is some system output that might be of relevance:

btrfs scrub status /mnt2/DATA/

scrub status for 2e798215-006d-4c8f-b85f-bcf0ae6c41ca
scrub started at Sat Jul 29 14:29:44 2017 and finished after 04:35:19
total bytes scrubbed: 8.87TiB with 0 errors

date

So 30. Jul 13:36:10 CEST 2017
(Just to point out, that the manually started scrub finished ‘long’ time ago.)

btrfs fi df /mnt2/DATA/

Data, RAID1: total=4.49TiB, used=4.43TiB
System, RAID1: total=64.00MiB, used=736.00KiB
Metadata, RAID1: total=6.00GiB, used=4.74GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

btrfs fi du -s /mnt2/DATA/
 Total   Exclusive  Set shared  Filename
4.43TiB 4.43TiB 0.00B /mnt2/DATA/

BTW: I started a scrub on my BACKUP pool without ‘Force’ option and it works just fine.

Sorry, that I am not more of a help. I did perl programming for OTRS for some years, but python is completely new to me.

phillxnet · August 6, 2017, 8:17pm

@Christian_Rost

Thanks for the outputs. Just to let you know that I have now opened an issue and made some progress on at least one cause of this bug. I hope to track down the other (the ‘force’ option lacking the essence of it’s name) as part of the same issue but may have to break that out into another:

github.com/rockstor/rockstor-core

improve scrub status reporting resolution

opened 12:28PM - 06 Aug 17 UTC

closed 11:22PM - 08 Aug 17 UTC

phillxnet

enhancement

Thanks to arbrick, Christian_Rost and Dragon2611 in the following forum threads …for highlighting this issue. On closer inspection it appears that the existing scrub_status() mechanism can inadvertently report as running a prior interrupted scrub. It is thought that this may, in part, be responsible for the referenced reports. - [x] fix portability bug in fs unit tests. Fixes #1787 - [x] Add unit tests to account for current expected behaviour of scrub_status(). - [x] Add unit test to indicate current suspected false positive for 'running' status on 'interrupted' scrubs. - [x] improve scrub status reporting resolution to include 'interrupted'. Forum threads respectively: https://forum.rockstor.com/t/scrub-fails-to-end-fails-to-restart-after-shutdown/2769 https://forum.rockstor.com/t/scrub-wont-start-unfinished-job-in-history-table/3588 https://forum.rockstor.com/t/scheduled-scrub-not-running/3608

phillxnet · August 7, 2017, 7:23pm

@Christian_Rost and @Dragon2611 from:

As per:

I ended up breaking that issue out on it’s own:

github.com/rockstor/rockstor-core

regression in 'force' option - scrub and balance

opened 05:44PM - 07 Aug 17 UTC

closed 11:25PM - 08 Aug 17 UTC

phillxnet

bug

Thanks to forum members arbrick, Christian_Rost and Dragon2611 for helping to br…ing this bug to light. The intention with the force option for scrubs and balances is two fold: 1. change the status of the last listed job (if currently 'started' or 'running') to 'terminated'. 2. pass the -f option to the respective btrfs commands and monitor this under a new job. It is surmised that a recent Django or related update altered the requirement by which the existing 'force' option was passed from the Web-UI front end to the backend balance/scrub command wrappers. The result was that the force option was ineffectual in both 1. and 2. above. However in the case of a scrub there existed an additional bug that prevented 2. above. This is also fixed in the pr to be associated with this issue. Tasks addressed by the pending pr: - [x] Fix 'force' UI option transition to backend - scrub/balance - [x] Fix -f addition to scrub command with force option. Please update the following forum threads with the resolution of this issue: https://forum.rockstor.com/t/scrub-fails-to-end-fails-to-restart-after-shutdown/2769 https://forum.rockstor.com/t/scrub-wont-start-unfinished-job-in-history-table/3588 https://forum.rockstor.com/t/scheduled-scrub-not-running/3608

and both these issues now have associated pull requests so upon the review process going as hoped these fixes should be in place in the testing channel updates soon.

Hope that helps.