So I just restarted my NAS after 60 days of uptime. And weirdly ztask was very broken afterward. I’ve tried going through potential packages that may have broken it (pyzqm, tornado etc.) but I’ve only been able to fix it by essentially applying the patch from https://github.com/leapcode/bitmask_client/pull/932/files to django_ztask. Obviously this is a bandaid at best, but it made my ztask no longer crash loop. The stack trace is one that’s shown up on here quite a bit already:
Traceback (most recent call last):
File "/opt/rockstor/bin/django", line 44, in <module>
File "/opt/rockstor/eggs/djangorecipe-1.9-py2.7.egg/djangorecipe/manage.py", line 9, in main
File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/__init__.py", line 354, in execute_from_command_line
File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/__init__.py", line 346, in execute
File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/base.py", line 394, in run_from_argv
File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/core/management/base.py", line 445, in execute
output = self.handle(*args, **options)
File "/opt/rockstor/eggs/django_ztask-0.1.5-py2.7.egg/django_ztask/management/commands/ztaskd.py", line 43, in handle
File "/opt/rockstor/eggs/django_ztask-0.1.5-py2.7.egg/django_ztask/management/commands/ztaskd.py", line 87, in _handle
self.io_loop.add_handler(socket, _queue_handler, self.io_loop.READ)
File "/usr/lib64/python2.7/site-packages/tornado/ioloop.py", line 727, in add_handler
self._impl.register(fd, events | self.ERROR)
TypeError: argument must be an int, or have a fileno() method
@freaktechnik Hello again,
Thanks for keeping an eye on this one. As it goes your follow up is rather timely. I’ve just recently chosen our replacement for the long orphaned django-ztask and have detailed my proposal for this swap-out in the following issue:
As detailed in that issue and it’s references, this is going to be a tricky one as we finally embark on the main part of addressing our technical dept, having now pretty much addressed our prior OS concerns via the ‘Built on openSUSE’ move.
If you have a simple clean reproducer for the failure you are seeing in this service it may well help to prove any fix we hope to instantiate via this planned supplant of django-ztask with Huey? If you could first post here it may help to attract others reports of functional failure and we can then prioritise this effort accordingly. And possibly refine the reproducer ready for it’s addition to the GitHub issue. Initially, as per the issue, I’m tempted to do it all in our next testing development cycle given the significant number of changes required. But I’m still unclear on how your reported failures impact folks in their day-to-day use.
I got the error whenever I tried to start/stop a rock-on. However, as mentioned, I’m not sure if there isn’t some specific package update or similar that broke it, and it would still be fine on a clean install.
I have now made the recent connection to a prior CentOS customisation found by @maxhq way back concerning an update to python-tornado:
As you suspect re:
And if so we need to establish if this is a ‘standard’ update within Leap 15.2. I’m making progress on the previously referenced issue to replace django-ztaskd but it’s slow progress currently, but progress never-the-less.
Let us know if you manage to track this down further as it currently represents a potential ‘show-stopper’ for our Rockstor 4 release if it doesn’t relate to additional repositories on top of an install resulting from our new DIY rockstor-installer.
Cheers, and thanks again for the follow up report here.
@freaktechnik Thanks for the test/feedback. Much appreciated. And re:
Certainly hope so. Althought I may have missed a straggler in some dark corner. But yes Huey is our new replacement and bang up to date. So one of the oldest parts of Rockstor has just become the newest . But it was a much larger change than I had antisipated or would have liked to make, ideally, in the RC stage but never mind. Had to be done as otherwise it was holding back all other updates (no python 3 or newer Django option with the now abandoned ztaskd thing.
I suspect we have a new set of bugs but we can at least work throught them, and others, as we move the whole code base onto newer ‘stuff’.
Could you take a look at the report by @greven here:
I have a small suspicion we have one too many threads going on and so have inadvertently created some intermittent issues such as described there. So sometimes ‘stuff’ can reach db and other times not, or the like. During that Huey pr I had to change the number of workers we run and I’m wondering if, in the process (or out of it more like) we occasionally have potentially random failures such as that. May be completely unrelated though and we need more info to know.