Unknown internal error doing a POST to /api/rockons/update (old fix gone)

Brief description of the problem

Processing: Screen Shot 2023-09-25 at 11.30.22 PM.png…

ROCK-ONS non-functional, can not complete Rock-Ons Update to load in list of docker programs. Suggested fix for this problem no longer available. Lost ability to start all my rockons after updating system - #8 by sanderweel

Old fix install an older version of docker:
sudo zypper install --oldpackage docker-19.03.15_ce-lp152.2.9.1.x86_64

“Package ‘docker-19.03.15_ce-lp152.2.9.1.x86_64’ not found.”

Detailed step by step instructions to reproduce the problem

Installing Rockstor-Leap15.4-generic.x86_64-4.5.8-0.install.iso
Done three clean installs to 8gb ZimaBoard, installing exactly as YouTube video did, install to home directory. Another clean Install I created a special Rock-Ons-root share on a non-boot SSD as Rockstor online doc suggested. With all the installs I noticed on the Rock-Ons tab page it says Rock-Ons service is not running, even when Services and the Rock-Ons tab page say it is on. So I turn it off and then back on to make it happy so I can get to the Update button. I press the Update and after a long wait I get the unknown error. I tried the suggest fix for the problem and the old docker file is no long available.

Even work around on how to remove the Rockstor docker software and installing standard docker alone avoiding the problem would be helpful.

Web-UI screenshot

[Drag and drop the image here]

Error Traceback provided on the Web-UI

none

@jimla1965 welcome to the Rockstor community.

First, let’s check whether docker is actually running at the command line:
systemctl status docker (I assume it does).

Also, to ensure that you can reach the Rockon repository, can you try this:

curl https://rockstor.com/rockons/root.json

Can you take a look at the log entries, starting with

tail -n200 /opt/rockstor/var/log/rockstor.log

and see whether you see any errors regarding calls related to Rockons?

I don’t think the docker version is causing the issue for you, but of course, I might be wrong on that.

I have not seen the ZimaBoard before (which doesn’t mean much), I am curious how have you set it up? SSD for boot drive and then spinning HDDs for the storage portion? Or all SSDs?

2 Likes

Thank you for the welcome.

I was a lead that design OS for Microsoft for 10 years but have been retired for many years, FAT 32 was my last feature I shipped. I love BTRFS but I’m not too good at fixing Linux problems. I am very impressed of the Rockstor OS.

systemctl status docker:
Active: active (running) since Mon 2023-09-25 23:20:54 HST; 17h ago

imla@Zima:~> curl https://rockstor.com/rockons/root.json
{
“Airsonic Advanced”: “airsonic-advanced.json”,
“Bitcoin”: “bitcoind.json”,
“Booksonic”: “booksonic.json”,
“Collabora Online”: “collabora-online.json”,
.
.

Nothing about Rock-ons in this log. Nothing added to this log if I repeat the error.

Zimaboard is a low-cost single board server with a Celeron N3450 Quad Core.

I booted off of a NVMe M.2 SSD and a built-in 32gb eMMC drive. Did not try a flash usb or a Hard drive.

What I understand I either need upgrade docker to version 20.10.7 or later is the best. Or avoid the bug by enabling IPv6 networking which is disable by Rockstor by default. Both I don’t know how to do or if it is safe to do.

It might just be a timeout issue, that has been experienced in a couple of other scenarios (but unfortunately also not really conclusively been resolved):

wondering whether you’re seeing some similar messages around WORKER TIMEOUT

What is the docker version you have? On mine it shows:

~ # docker --version
Docker version 24.0.5-ce, build a61e2b4c9
1 Like

I have the same version.

Docker version 24.0.5-ce, build a61e2b4c9

I also have 120 second time out and timeout in gunicorn.log, just like the other post.
jimla@Zima:/mnt2/ROOT/opt/rockstor/var/log> cat gunicorn.log
[2023-09-27 13:42:41 +0000] [11550] [CRITICAL] WORKER TIMEOUT (pid:668)
[2023-09-27 13:42:41 +0000] [668] [INFO] Worker exiting (pid: 668)
[2023-09-27 13:42:42 +0000] [4756] [INFO] Booting worker with pid: 4756

I’t looks like a showstopper for me, if docker won’t work I can’t use rockstor for a media server.

The only two docker apps I need to run is Plex and Jellyfin.

Any suggestions for workaround the Rockstor’s Rock-ons Update 120sec WORKER TIMEOUT bug.

Until we can figure this out better, maybe you install the Rock-on configuration locally, as if it was your own Rock-on. As described in:

https://rockstor.com/docs/interface/overview.html#adding-your-own-rock-on

manually created the directory:

mkdir /opt/rockstor/rockons-metastore

download the plex configuration file here:

and correspondingly jellyfin:

Optionally, but in case the Rock-on page starts working for you to avoid a duplicate error message, In each file, change in line 2 the name of the Rock-on, e.g. PlexPlex local and, more importantly, change the container name in line 4, e.g. plex-linuxserver.ioplex-linuxserver.io_local

image

In the jellyfin it’s line 14 I believe:

Place both files in the above created directory. Select the Update button. I do hope, those 2 will show up and you won’t be stopped by the error message. If the error message pops up, you might be able to clear it by refreshing the page and then the 2 “locals” will still be there.

I can’t test it myself, since I am not running into this problem, so your mileage may vary. But maybe it’s worthwhile to try it at least with one of them.

The another interim workaround is, of course to run these containers from the command line with all of the parameters… you can PM me if it comes to that, and we can possibly construct the command line for that together.

2 Likes

One other thought for @phillxnet and @Flox: after reading all over the place on this symptom, could it make sense to increase the gunicorn -timeout parameter to, say, 200 or 300 temporarily to see whether that would address it?
It currently is using the default timeout (i.e. it’s not set in the configuration file).
I also saw the recommendation that if one is using it in combo with nginx that there are these two that should be set (in nginx) to some higher number (in the example it was 120s for both):

proxy_connect_timeout
proxy_read_timeout

I know, if anything, it will cure the symptom but not address the root cause, but might be a workaround solution for this occurring, until we can gather more information or revamp the Rock-on processing or what have you … here we seem to have an instance where it consistently doesn’t seem to work vs. the other reported case we’ve had, so it’s at least reproducible…

2 Likes

Thanks a lot, @Hooverdan , for all this work!
I think it’s a good idea and worth trying at the very least. I would maybe increase it to something even higher if there is indeed some substantial slow down along the way.

I’m really eager to have some time to spend again on our Rock-Ons fetching logic so that we can reduce the likelihood of this sort of shortcomings. Once I’m done with my current Tailscale endeavor, I’ll probably have a look at it again.

1 Like

Ok, @jimla1965 if you are game to try this, you could add the timeout parameter in the gunicorn configuration, and a second step adding the other two parameter to the nginx configuration.

@Flox, again, not being the expert here, but I think, this is how the change would need to be done. Please double-check and I can correct …

EDIT: changing the gunicorn edit based on @Flox’s comment below

nano /opt/rockstor/etc/supervisord.conf

This will open up the supervisor configuration file which holds the gunicorn configuration. Increase the --timeout parameter to, say, 600 like so, in row 47

before:

; Gunicorn
; TODO Move to using systemd socket invocation of a systemd service.
; See: "Step 7 — Creating systemd Socket and Service Files for Gunicorn"
; https://www.digitalocean.com/community/tutorials/how-to-set-up-django-with-postgres-nginx-and-gunicorn-on-ubuntu-18-04#step-7-creating-systemd-socket-and-service-files-for-gunicorn
[program:gunicorn]
command=/opt/rockstor/.venv/bin/gunicorn --bind=127.0.0.1:8000 --pid=/run/gunicorn.pid --workers=2 --log-file=/opt/rockstor/var/log/gunicorn.log --pythonpath=/opt/rockstor/src/rockstor --timeout=120 --graceful-timeout=120 wsgi:application

after:

; Gunicorn
; TODO Move to using systemd socket invocation of a systemd service.
; See: "Step 7 — Creating systemd Socket and Service Files for Gunicorn"
; https://www.digitalocean.com/community/tutorials/how-to-set-up-django-with-postgres-nginx-and-gunicorn-on-ubuntu-18-04#step-7-creating-systemd-socket-and-service-files-for-gunicorn
[program:gunicorn]
command=/opt/rockstor/.venv/bin/gunicorn --bind=127.0.0.1:8000 --pid=/run/gunicorn.pid --workers=2 --log-file=/opt/rockstor/var/log/gunicorn.log --pythonpath=/opt/rockstor/src/rockstor --timeout=600 --graceful-timeout=120 wsgi:application```

Save (<Ctrl>+o, confirm by selecting <Enter>) and exit the file (<Ctrl>+x).

For nginx, you could add the proxy parameters here underneath this configuration block within the http section:

nano /opt/rockstor/etc/nginx/nginx.conf
This should open the existing configuration file.

	log_format main
		'$remote_addr - $remote_user [$time_local] '
		'"$request" $status $bytes_sent '
		'"$http_referer" "$http_user_agent" '
		'"$gzip_ratio"';

	client_header_timeout	10m;
	client_body_timeout		10m;
	send_timeout			10m;

so it would look something like this:

	log_format main
		'$remote_addr - $remote_user [$time_local] '
		'"$request" $status $bytes_sent '
		'"$http_referer" "$http_user_agent" '
		'"$gzip_ratio"';

	client_header_timeout	10m;
	client_body_timeout		10m;
	send_timeout			10m;
    proxy_connect_timeout   10m;
    proxy_read_timeout      10m;

Save and Exit.

Easiest is to just reboot the system to have these changes, either via the WebUI, or at the command line with systemctl reboot

2 Likes

@Hooverdan , thank you so much, and my apologies for not being able to participate more than that…
We currently use supervisord to control gunicorn so the change would be at:

2 Likes

Duh, thanks @Flox. I have updated the previous response to accurately reflect what you just pointed out. I also changed all three timeout parameters to reflect a 10 minutes timeout …

1 Like

I’m doing a “Modify RAID level” with 16TB drives. It is taking a lot of time, after that completes I will start with the timeout changes to see if it fixes the problem.

Then if that does not work I will try manually installing the docker apps.

Thanks for taking the time to give suggestions.

3 Likes

Good luck with the modifying … we will be waiting with bated breath for your update (when and if you can) :slight_smile:

With your changes it is still timing out 120 seconds with the normal error, not the new 600 seconds. So I also tried changing --graceful-timeout=600 and rebooted to see if that would make a difference.(I’m not sure if this did anything.)

When I went into Rock-Ons before doing an update a 2nd time the list was populated so the work around seems to work. The timeout still happens at 120 seconds but the container list is loaded after rebooting.

So I would advise others with similar problems:

  1. Make the changes Hooverdam suggested
  2. Reboot
  3. Go into Rock-ons and do an update. You may still get an error but wait 10 minutes to allow update to happen in background.
  4. Reboot
  5. Go into Rock-ons check if there is a container list

If no list also try changing --graceful-timeout=600 and reboot.

Also typo in Hooverdam suggestion
nano /opt/rockstor/etc/nginx/niginx.conf has a typo and need sudo:
sudo nano /opt/rockstor/etc/nginx/nginx.conf

I have 100% reproducible problem, I don’t think 10 minute timeout & error message is a reasonable permanent fix, I am willing to do more testing if there is a chance the fix will be put into the product. A better customer experience would be installing the Rock-ons totally at install time using the boot drive by default, ready to use. Then at the point of installing first container or running update, ask if they would like to improve performance by suggesting a default share name and best drive both which user can change. Don’t require the customer to leave Rock-ons to create the share, just do it for them.

I was in charge of Quality Assurance for the final Windows 95 Beta because of my special risk/reward judgement ability. We did not have time or resources to fix all the bugs, any bug fix had 50% chance of creating a new bug. Part of my job as a Test Lead was I had to make the hard call on which Beta bug were important to fix from a customer point of view for entire product set their priority level , and assign them to the proper development team.

2 Likes

@jimla1965 thanks for reporting back on this.

As for:

I have corrected that unfortunate typo. Based on your sudo comment I assume you were using the shellinabox system shell from the WebUI. If so, you are obviously correct that for the Web-based system shell you need to use the elevated prompt, either by using sudo before every command you execute, or elevate the user with su - during the length of the system shell session.
I had assumed, you were using something like PuTTY to connect into the system using the root user.

In any case,

agreed, it certainly should be considered a workaround at this time, and as you’ve already reported, this is not 100% foolproof either. As @Flox mentioned, this reproducible situation is a good starting point to revisit the Rock-on processing logic.

Thank you for offering, I am pretty sure that your help is needed to further flesh out what the root cause is and with your background in the QA space that should prove helpful, too.

I think, in this area it’s not necessarily a bug per se, but a more efficient way of pulling and processing the Rock-ons can be found. And, you are correct

I just hope, it’s actually less than 50% in this context :slight_smile:

Since we have limited dev capacity at this time (which is another call to forum members that if you have some skills and interest in btrfs, django, python, js, etc., you’re more than welcome to contribute!), this might not get addressed immediately, as the team is also focusing on continuing the progression to the newest (or close to that) version of the underlying components (django and python versions), which requires accounting for many structural changes that have occurred over the years.

Thanks again for reporting back. @Flox any further ideas on how to further flesh out the underlying root cause?

1 Like

@Hooverdan @Flox @jimla1965 Thanks for rooting out more of our issues here.
Re:

I know we are still using an older worker type here. I.e. we have only recently moved to being able to entertain ‘proper’ threads for example: i.e. our recent Py3 shift in the current testing channel enables our use of gthread gunicorn workers that I’m keen to move to. So we have some pending improvement in the pipeline as it were.

https://docs.gunicorn.org/en/latest/settings.html#worker-class

So all in we are creeping up on such things via our ongoing technical dept approach in current testing. And again, what @Hooverdan stated, all this interaction re field testing of ‘testing’ is invaluable and much appreciated.

Thanks for all your input on this one. I should be back from my stint on Appman soon so will try to approach some of this then. Next in line however, for me at least, is our Django updates so bit-by-bit. But there-after I hope to jump us to Py3.8.

2 Likes

To keep track of this symptom, I have create an issue on github:

1 Like

Hi @jimla1965 ,

Pardon the delayed feedback on the following; time has been very very rare for me lately. I wanted to chime in on your input and ideas below:

I see 2 points here:

  • point 1: improving the Rock-Ons update mechanism by, in part, making use of a local resource, possibly available at Rockstor install time.
  • point 2: ease the creation of shares required by Rock-ons during their install by having Rockstor take care of some of that “requirements” work.

Point 1: improve the Rock-Ons update mechanism

Yes. Absolutely agree. Our current mechanism has been serving users relatively well for a long time but as the size of the Rock-Ons registry grew rather large, this mechanism started showing some limitation. Improving how we update the list of available Rock-Ons has been and still is a rather high priority item on the list. We are fortunate enough to have a few members of this community who detailed their ideas on how to do just that and implement a few improvements, so I think we have a good ground on which to build now. I unfortunately cannot seem to find them all at the moment, but a few examples can be found below:

Although our top priority is to focus on our now-reducing technical debt, reworking the Rock-Ons update mechanism might be initiated at the same time. This would be a rather large change, however, so this will most likely need to occur in steps and I do think we can, for instance, start by reducing our interaction with the remote Rock-On registry and make use instead of local resources. This should greatly improve situations such as yours in which this remote fetching mechanism appears consistently problematic. I can see a path to implement this without needing to alter our current database structure, and this would represent a big advantage in being able to take this on while we are still making big changes to our Django and Python updates.

Point 2: automate share creation for Rock-Ons

Here again: yes… completely agree.
This too is a point on our to-do list, and part of a wider set of improvements to shares/paths management during Rock-Ons install. See the following GitHub issues, for instance:

Hope this helps, and thanks again for sharing your feedback, ideas, and your willingness to help test their implementation!

2 Likes