Ztaskd broken after leap updates?

freaktechnik · December 4, 2020, 2:03am

So I just restarted my NAS after 60 days of uptime. And weirdly ztask was very broken afterward. I’ve tried going through potential packages that may have broken it (pyzqm, tornado etc.) but I’ve only been able to fix it by essentially applying the patch from https://github.com/leapcode/bitmask_client/pull/932/files to django_ztask. Obviously this is a bandaid at best, but it made my ztask no longer crash loop. The stack trace is one that’s shown up on here quite a bit already:

Traceback (most recent call last):
  File "/opt/rockstor/bin/django", line 44, in <module>
    sys.exit(djangorecipe.manage.main('rockstor.settings'))
  File "/opt/rockstor/eggs/djangorecipe-1.9-py2.7.egg/djangorecipe/manage.py", line 9, in main
    management.execute_from_command_line(sys.argv)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/opt/rockstor/eggs/django_ztask-0.1.5-py2.7.egg/django_ztask/management/commands/ztaskd.py", line 43, in handle
    self._handle(use_reloader, replay_failed)
  File "/opt/rockstor/eggs/django_ztask-0.1.5-py2.7.egg/django_ztask/management/commands/ztaskd.py", line 87, in _handle
    self.io_loop.add_handler(socket, _queue_handler, self.io_loop.READ)
  File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 727, in add_handler
    self._impl.register(fd, events | self.ERROR)
TypeError: argument must be an int, or have a fileno() method

freaktechnik · February 4, 2021, 9:05pm

Still seeing this on 4.0.5-0

phillxnet · February 5, 2021, 5:30pm

@freaktechnik Hello again,
Thanks for keeping an eye on this one. As it goes your follow up is rather timely. I’ve just recently chosen our replacement for the long orphaned django-ztask and have detailed my proposal for this swap-out in the following issue:

github.com/rockstor/rockstor-core

Replace orphaned django-ztask with Huey

opened 05:12PM - 05 Feb 21 UTC

closed 05:47PM - 08 Mar 21 UTC

phillxnet

Our current background process scheduler django-ztaskd has been an orphaned proj…ect for some time. This in turn hampers our Django and Python update plans, detailed in the following issues and associated comments: ### "Update Django within current Python 2 constrains" #2254 https://github.com/rockstor/rockstor-core/issues/2254#issuecomment-771667824 https://github.com/rockstor/rockstor-core/issues/2254#issuecomment-771823905 https://github.com/rockstor/rockstor-core/issues/2254#issuecomment-771945822 ### "move to python 3" #1877 https://github.com/rockstor/rockstor-core/issues/1877#issuecomment-619563667 This issue proposes Huey as a hopefully near fit, well maintained, Python 2.7/3, and Django native without plugins replacement: https://github.com/coleifer/huey https://pypi.org/project/huey/ Latest doc links: https://huey.readthedocs.io/en/latest/index.html v1.10.1 (july 2018) started auto testing on Python 3.7: main readme only mentions 3.4. So good on newer Python. v2.1.0 (June 2019) "Fix gzip compatibility issue when using Python 2.x." https://github.com/coleifer/huey/blob/master/CHANGELOG.md#210 So presumably still Python 2.7 at that point at least. This issue was originally envisaged as a sub part of the above referenced Django update. That way we start out with a far newer version, given the project states supporting/testing only on currently supported versions of Django. So if done before our update we may end up implementing a considerably younger, and possibly more problematic version. With the added work of then possibly having to migrate to newer versions once we update our Django. However we have a strict co-dependence chicken-egg situation here. And as proposed in the original text of the above Django update, we may very well require a testing branch within which we partially complete the Django update with significant breakage, and then follow-up with this issue to complete our ability to achieve even running rockstor services.

As detailed in that issue and it’s references, this is going to be a tricky one as we finally embark on the main part of addressing our technical dept, having now pretty much addressed our prior OS concerns via the ‘Built on openSUSE’ move.

If you have a simple clean reproducer for the failure you are seeing in this service it may well help to prove any fix we hope to instantiate via this planned supplant of django-ztask with Huey? If you could first post here it may help to attract others reports of functional failure and we can then prioritise this effort accordingly. And possibly refine the reproducer ready for it’s addition to the GitHub issue. Initially, as per the issue, I’m tempted to do it all in our next testing development cycle given the significant number of changes required. But I’m still unclear on how your reported failures impact folks in their day-to-day use.

Thanks again for you report; much appreciated.

freaktechnik · February 5, 2021, 5:42pm

I got the error whenever I tried to start/stop a rock-on. However, as mentioned, I’m not sure if there isn’t some specific package update or similar that broke it, and it would still be fine on a clean install.

phillxnet · February 17, 2021, 10:31am

@freaktechnik Re:

I have now made the recent connection to a prior CentOS customisation found by @maxhq way back concerning an update to python-tornado:

As you suspect re:

And if so we need to establish if this is a ‘standard’ update within Leap 15.2. I’m making progress on the previously referenced issue to replace django-ztaskd but it’s slow progress currently, but progress never-the-less.

Let us know if you manage to track this down further as it currently represents a potential ‘show-stopper’ for our Rockstor 4 release if it doesn’t relate to additional repositories on top of an install resulting from our new DIY rockstor-installer.

Cheers, and thanks again for the follow up report here.

freaktechnik · April 6, 2021, 10:42am

With the change in the latest update I no longer have to modify ztask for rockons to work (since there is no more ztask, I think). Thanks!

phillxnet · April 6, 2021, 7:37pm

@freaktechnik Thanks for the test/feedback. Much appreciated. And re:

Certainly hope so. Althought I may have missed a straggler in some dark corner. But yes Huey is our new replacement and bang up to date. So one of the oldest parts of Rockstor has just become the newest . But it was a much larger change than I had antisipated or would have liked to make, ideally, in the RC stage but never mind. Had to be done as otherwise it was holding back all other updates (no python 3 or newer Django option with the now abandoned ztaskd thing.

I suspect we have a new set of bugs but we can at least work throught them, and others, as we move the whole code base onto newer ‘stuff’.

Could you take a look at the report by @greven here:

I have a small suspicion we have one too many threads going on and so have inadvertently created some intermittent issues such as described there. So sometimes ‘stuff’ can reach db and other times not, or the like. During that Huey pr I had to change the number of workers we run and I’m wondering if, in the process (or out of it more like) we occasionally have potentially random failures such as that. May be completely unrelated though and we need more info to know.

Oh well, bit by bit.