[SOLVED] RockOn's can't be stopped anymore after Oct' 20 yum updates - API calls fail with 403 error in logs

Hooverdan · October 8, 2020, 11:56pm

Hi,
this is still on the non-OpenSUSE version of Rockstor (version: 3.9.2-57). A few days ago I took some updates which included nginx among other things I believe.
Now I have noticed that I cannot stop RockOns via the WebUI. I can flip the switch, but the usual spinning cog does not appear. Neither can I stop the RockOn Service, nor when I hit the Update button for the RockOn directory anything happens (the blue refresh logo pops up for about a second) and then disappears.

I’ve looked at items like this one: Plex not stoppable

but I don’t see anything in the logs like this. The only message (and that’s why I pointed out nginx as one of the update packages) is this

DoesNotExist: NetworkConnection matching query does not exist.
[07/Oct/2020 16:08:34] ERROR [storageadmin.util:40] (Error while processing Rock-on profile at (http://rockstor.com/rockons/NginxProxyManager.json).). Lower level exception: (HTTPConnectionPool(host='rockstor.com', port=80): Max retries exceeded with url: /rockons/NginxProxyManager.json (Caused by <class 'socket.error'>: [Errno 4] Interrupted system call)).
[07/Oct/2020 16:08:34] ERROR [storageadmin.util:44] Exception: HTTPConnectionPool(host='rockstor.com', port=80): Max retries exceeded with url: /rockons/NginxProxyManager.json (Caused by <class 'socket.error'>: [Errno 4] Interrupted system call)
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/rockon.py", line 440, in _get_available
cur_res = requests.get(cur_meta_url, timeout=10)
  File "/opt/rockstor/eggs/requests-1.1.0-py2.7.egg/requests/api.py", line 55, in get
return request('get', url, **kwargs)
  File "/opt/rockstor/eggs/requests-1.1.0-py2.7.egg/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
  File "/opt/rockstor/eggs/requests-1.1.0-py2.7.egg/requests/sessions.py", line 279, in request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
  File "/opt/rockstor/eggs/requests-1.1.0-py2.7.egg/requests/sessions.py", line 374, in send
r = adapter.send(request, **kwargs)
  File "/opt/rockstor/eggs/requests-1.1.0-py2.7.egg/requests/adapters.py", line 209, in send
raise ConnectionError(e)

The RockOns that are running, seem to be running fine, and I know I can obviously stop them using the command line, but was curious whether anybody else is encountering this. Yesterday, I did reboot the server because I wanted the Plex RockOn updated, and since the normal stop/start process didn’t work I assumed, that I might just do a reboot, but that obivously did not address the issue.

Here’s the yum history from that update:

Transaction performed with:
Installed rpm-4.11.3-43.el7.x86_64 @base
Installed yum-3.4.3-167.el7.centos.noarch @base
Installed yum-plugin-fastestmirror-1.1.31-54.el7_8.noarch @updates
Packages Altered:
Updated nginx-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel
Updated nginx-all-modules-1:1.16.1-1.el7.noarch @epel
Update 1:1.16.1-2.el7.noarch @epel
Updated nginx-filesystem-1:1.16.1-1.el7.noarch @epel
Update 1:1.16.1-2.el7.noarch @epel
Updated nginx-mod-http-image-filter-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel
Updated nginx-mod-http-perl-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel
Updated nginx-mod-http-xslt-filter-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel
Updated nginx-mod-mail-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel
Updated nginx-mod-stream-1:1.16.1-1.el7.x86_64 @epel
Update 1:1.16.1-2.el7.x86_64 @epel

Any suggestions?

Thanks.

phillxnet · October 9, 2020, 10:56am

@Hooverdan Hello again.
Re:

which ended up raising:

This looks more like it could have been an exhaustion at the backend on the rockstor.com server so it couldn’t respond in time, or an internet/network outage causing the “Max retires” to be hit.
I’ve taken a look at the logs of that server (Apache) and it looks to be functioning as expected. And the load generally is pretty low. But I may have some improving to do on the Apache config as we grow in users!

So from the logs side I’d call this a network glitch. But it’s wrong that it’s a poor show that it in turn breaks the Web-UI. I know that we currently query this server way too much within the Web-UI and if that query fails we really should fail elegantly.

So in short not much to add here actually. But the log indicates a failure to retrieve form rockstor.com.

Has this situation magically mended itself now?

Hope that helps.

Hooverdan · October 9, 2020, 3:16pm

@phillxnet unfortunately not. When trying to reboot again just now, it gave me the appropriate warning message (are you sure and all that), but it didn’t actually reboot. I then remembered that when I rebooted 2 days ago in an attempt to force the RockOn container update (this was for the Plex server), I also ended up having to SSH into the box and force the reboot at OS level.
After this reboot, the behavior remains the same.

I get the quota error message regularly in the rockstor.log file (but that’s expected at this time and should be unrelated).
The latest entry, right after the reboot is this one:

File “/opt/rockstor/src/rockstor/storageadmin/views/network.py”, line 154, in update_connection
name=dconfig[‘connection’])
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/manager.py”, line 127, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File “/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/db/models/query.py”, line 334, in get
self.model._meta.object_name
DoesNotExist: NetworkConnection matching query does not exist.

in the nginx stderr log I just get this:

nginx: [warn] the “ssl” directive is deprecated, use the “listen … ssl” directive instead in /opt/rockstor/etc/nginx/nginx.conf:47

But I also assume that this is not related.

When I look at the ztask.log I am seeing something curious (have not looked in that log I think, ever) …

2020-10-09 07:57:48,306 - ztaskd - INFO - Calling fs.btrfs.start_resize_pool
2020-10-09 07:57:48,308 - ztaskd - ERROR - Error calling fs.btrfs.start_resize_pool. Details:
Error running a command. cmd = /usr/sbin/btrfs device delete /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S /mnt2/4xRAID5. rc = 1. stdout = ['']. stderr = ['ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3ZS', 'ERROR: not a block device: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZNA2S', '']

I have not triggered or scheduled a pool-resizing, since I replaced all my Hard drives earlier in the year. I don’t even have these HGST drives physically in the system anymore (I replace those with WDs). In the WebUI I have not seen any errors surfaced … I can’t imagine this is related, but am listing it here anyway.

Additional Edit: For fun I tried to delete one of my scheduled tasks (a monthly scrub that can’t seem to update its last scheduled run, maybe the pool_resize message is related to that?), and I get the confirmation pop-up as well, but nothing is deleted. So this seems to be affecting other areas of the UI as well …

Thanks

Flox · October 9, 2020, 3:27pm

I tend to lean that way as well as the few things you described seem to indicate that “somehow” something is hanging and blocking Rockstor’s database to be updated. Your attempt at deleting a scheduled task is a perfect example, and the error in the log related to the NetworkConnection is also related as this is a normal log message when the system seems a new network connection for the first time: it sees a new network connection, logs that it doesn’t already exist in the database, and then proceed to update the database with it. The fact that you keep seeing that error would tell me that the last part (updating the database) does not happen.

Now, thanks to your nice tracking down work, it seems we need to solve your stuck start_resize_pool task… I would wait for @phillxnet’s expertise in that, though, as he’s the expert on disks/pools related work.

Hooverdan · October 9, 2020, 3:40pm

Thanks - I slogged through the ztask.log file and found that this resizing task has been failing since January which is right after I finished replacing all of the disks. I guess, the last device delete didn’t complete correctly and has been trying periodically to execute it again …

phillxnet · October 9, 2020, 3:42pm

@Hooverdan, @Flox, Re:

Unfortunately I can’t look into this right now but I don’t think this is what’s holding things up. As @Flox point out, maybe it’s a stuck db. Could it be an update at the same time as the nginx has frozen the db.
@Flox are you under the impression this old task could be blocking current db writes?

I’ll try and look to the ztaskd thing soon thought.

And as I’m currently preparing 4.0.3-0 for release I’ve also fully updated the Leap 15.2 test machines the pre-release rpms are tested on (x86_64 and aarch64) and a reboot request via the Web-UI is working in that setting. Different OS I know but they often share recent security updates and the like.

OK, @Hooverdan I see you’ve just tracked down that task. So that good. And we can for the time being ignore that anyway hopefully.

Flox · October 9, 2020, 3:47pm

Good question… You may be right and maybe that’s just a red herring… @Hooverdan, since this task has been stuck since January, apparently, have you noticed the current “weird/odd” webUI behavior between January and now, or is it new? Were you able to do any kind of rock-ons operation between January and your most recent yum updates?
You first message seems to indicate it is new, so this task might not be the cause of your problem, indeed.

Hooverdan · October 9, 2020, 3:54pm

@Flox, yes, this is new behavior (that’s why I attributed it to the nginx updates). I have installed and been using various RockOns (e.g. Minecraft and my own version of Logitech Media Server among others), created new shares because of those and that has worked all fine. The last major activity (aside from periodic visits to the dashboard) was end of September (restarting Plex RockOn to automatically update the server version).

Hooverdan · October 12, 2020, 6:34pm

@Flox
not sure, whether this helps or distracts, but when tracing the call with httpWatch, it seems to return a “forbidden” status 403. In the screenshot below I have taken two actions after the start page loaded (i.e. the Dashboard):

First, I selected the RockOn page which works fine, and displays.
Then I switched one of the currently running RockOns to “off”. The screen for a fraction of a second shows the blue refresh icon and then shows the RockOn status as “off” (a refresh would show it back in “on” status).

That’s where I see that a “forbidden” (403) response comes back.

Flox · October 12, 2020, 6:50pm

Interesting… thanks a lot for the further details.

I’m unsure why you get a 403 on that call. I suppose you see something similar if you try to delete a scheduled task like you mentioned in one of your earlier posts?

Given your initial thoughts on Nginx due to the yum updates, that could be related but I’m not that familiar with this area so I’m likely to be wrong. Have you tried reinstalling nginx via yum, by any chance? I believe it is yum reinstall nginx (only from memory).

Other than that, there may be something wrong (somehow) with your nginx.conf file. Have you ever modified it by any chance?

Alternatively, you could look further in the logs (if you haven’t already done so) by watching the logs (tail -f /opt/rockstor/var/log/*) and pasting back what happens when you trigger one of the actions on the webUI that fail (turning OFF the rock-on or delete a scheduled task). This is probably overkill but maybe there’s something useful there that we missed.

Hooverdan · October 12, 2020, 7:11pm

Yes, I have also traced an activity like task deletion, same post message:

I have never manually touched the nginx.conf file (I remember seeing some threads about it being empty, but mine was populated), meaning unless an update changed it, I have not done anything to it.

When I run your suggested tail command and then perform some of the above actions (turn off RockOn, delete a scheduled task), I don’t see any changes being added to the terminal output, so doesn’t seem to even get to the core …

I can try the reinstallation of nginx and see what happens, or was your question targeting whether a possible reinstallation caused the problem to appear?

Flox · October 12, 2020, 7:16pm

OK, so we can rule out any specificity to one database over another.

Good, we can rule that out as well.

Oh well, it was worth a try. We could try after turning debug mode on, but I would be curious about the reinstall nginx route first.

I was wondering whether something somehow happened during the recent nginx updates you had that could have possibly changed something relevant in it. I know samba has proven to be quite picky after some of its updates in the past, creating all sorts of issues; I don’t recall reading similar update problem with nginx, but it could be worth a try, especially given it seems to be the only package that was updated in the last round of yum updates.
I thus meant trying yum reinstall nginx and see how it goes.

Hooverdan · October 12, 2020, 7:29pm

Well, I tried the reinstall … performed admirably, however no changes to behavior :).

As I have a VM with OpenSuse and Rockstor, I went and took another look at the /etc/nginx/nginx.conf files. They look rather different, however I don’t know whether that has to do with the versions of nginx employed on Tumbleweed or not (I did this a while ago, when it was not clear whether you will maintain leapfrog AND tumbleweed compatibility, but it still has been working despite the millions of updates it had to absorb since then).
On the other hand I don’t know what the nginx.conf “should” contain on a vanilla Rockstor instance (3.9x version), meaning no manual interference…

I’m just posting my current one … and maybe someone can tell whether this is already screwed up.

[root@rockstorw ~]# cat /etc/nginx/nginx.conf
# For more information on configuration, see:
#   * Official English Documentation: http://nginx.org/en/docs/
#   * Official Russian Documentation: http://nginx.org/ru/docs/

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

# Load dynamic modules. See /usr/share/doc/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;

events {
worker_connections 1024;
}

http {
log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

access_log  /var/log/nginx/access.log  main;

sendfile            on;
tcp_nopush          on;
tcp_nodelay         on;
keepalive_timeout   65;
types_hash_max_size 2048;

include             /etc/nginx/mime.types;
default_type        application/octet-stream;

# Load modular configuration files from the /etc/nginx/conf.d directory.
# See http://nginx.org/en/docs/ngx_core_module.html#include
# for more information.
include /etc/nginx/conf.d/*.conf;

server {
    listen       80 default_server;
    listen       [::]:80 default_server;
    server_name  _;
    root         /usr/share/nginx/html;

    # Load configuration files for the default server block.
    include /etc/nginx/default.d/*.conf;

    location / {
    }

    error_page 404 /404.html;
    location = /404.html {
    }

    error_page 500 502 503 504 /50x.html;
    location = /50x.html {
    }
}

# Settings for a TLS enabled server.
#
#    server {
#        listen       443 ssl http2 default_server;
#        listen       [::]:443 ssl http2 default_server;
#        server_name  _;
#        root         /usr/share/nginx/html;
#
#        ssl_certificate "/etc/pki/nginx/server.crt";
#        ssl_certificate_key "/etc/pki/nginx/private/server.key";
#        ssl_session_cache shared:SSL:1m;
#        ssl_session_timeout  10m;
#        ssl_ciphers HIGH:!aNULL:!MD5;
#        ssl_prefer_server_ciphers on;
#
#        # Load configuration files for the default server block.
#        include /etc/nginx/default.d/*.conf;
#
#        location / {
#        }
#
#        error_page 404 /404.html;
#        location = /404.html {
#        }
#
#        error_page 500 502 503 504 /50x.html;
#        location = /50x.html {
#        }
#    }

}

Cheers.

Hooverdan · October 12, 2020, 7:47pm

Well, checking whether Rockstor’s nginx is actually using the standard config file (which I should have done earlier, sorry about that), it looks like it’s actually using this config file. I ran
ps x | grep nginx

which in return gives me this:

 9253 ?        S      0:00 nginx: master process /usr/sbin/nginx -c /opt/rockstor/etc/nginx/nginx.conf
 24709 pts/0    S+     0:00 grep --color=auto nginx

And here’s the Rockstor specific file:

daemon off;
worker_processes  2;

events {
		worker_connections  1024;
		use epoll;
}

http {
		include         /opt/rockstor/etc/nginx/mime.types;
		default_type    application/octet-stream;

		log_format main
				'$remote_addr - $remote_user [$time_local] '
				'"$request" $status $bytes_sent '
				'"$http_referer" "$http_user_agent" '
				'"$gzip_ratio"';

		client_header_timeout   10m;
		client_body_timeout             10m;
		send_timeout                    10m;

		connection_pool_size            256;
		client_header_buffer_size       1k;
		large_client_header_buffers     4 8k;
		request_pool_size                       4k;

		gzip on;
		gzip_min_length 1100;
		gzip_buffers    4 8k;
		gzip_types      text/plain;

		output_buffers  1 32k;
		postpone_output 1460;

		sendfile        on;
		tcp_nopush      on;
		tcp_nodelay     on;

		keepalive_timeout       75 20;
		ignore_invalid_headers  on;
		index index.html;

		server {
				listen 443 default_server;
				server_name "~^(?<myhost>.+)$";
				ssl on;
				ssl_protocols TLSv1.2 TLSv1.1 TLSv1;
				ssl_certificate /opt/rockstor/certs/rockstor.cert;
				ssl_certificate_key /opt/rockstor/certs/rockstor.key;

				location /site_media  {
						root /media/; # Notice this is the /media folder that we create above
				}
				location ~* ^.+\.(zip|rar|bz2|doc|xls|exe|pdf|ppt|txt|tar|mid|midi|wav|bmp|rtf|mov) {
						access_log   off;
						expires      30d;
				}
				location /static  {
						root /opt/rockstor/;
				}
				location /logs {
						root /opt/rockstor/src/rockstor/;
				}
				location / {
						proxy_pass_header Server;
						proxy_set_header Host $http_host;
						proxy_set_header X-Forwarded-Proto https;
						proxy_redirect off;
						proxy_set_header X-Real-IP $remote_addr;
						proxy_set_header X-Scheme $scheme;
						proxy_connect_timeout 75;
						proxy_read_timeout 120;
						proxy_pass http://127.0.0.1:8000/;
				}
				location /socket.io {
						proxy_pass http://127.0.0.1:8001;
						proxy_redirect off;
						proxy_http_version 1.1;
						proxy_set_header Upgrade $http_upgrade;
						proxy_set_header Connection "upgrade";
				}
				location /shell/ {
						valid_referers server_names;
						if ($invalid_referer) { return 404; }
						proxy_pass http://127.0.0.1:4200;
						proxy_redirect off;
						proxy_http_version 1.1;
						proxy_set_header Upgrade $http_upgrade;
						proxy_set_header Connection "upgrade";
				}
		}
}

It shows as modified yesterday, but the only real activity I performed yesterday was to reboot the server, so I assume the date/time stamp is touched upon reboot --> which I just now confirmed by rebooting the server, and, no, that didn’t change any behavior either.

Flox · October 12, 2020, 9:11pm

Seems like you just ruled out any problem with nginx itself…

I wonder what could cause you to receive a 403 on all DELETE and POST requests (I suspect you would also see the same problem if you were to edit a scheduled task or something along those lines). My knowledge in this area is very limited however, and all I can think of is the CSRF token. If I’m correct, django will respond with a 403 if there’s a problem/mismatch with your CSRF token. This leads me to a probably stupid question: have you tried logging out of your rockstor webUI session and log back in? How about using a different browser that hasn’t logged into the Rockstor webUI? I wonder if it could be due to some issue with your browser cache (it seems to me we need to rule out any problem on the client side this way as well).

Hooverdan · October 12, 2020, 9:30pm

Holy Toledo, Batman

I used a different system on which I know I haven’t logged in from in a long time. and, your hunch is probably correct! I was able to stop a RockOn without a problem, and running this
tail -f /var/log/nginx/access.log
against the access.log entries (essentially the same output as I’ve been tracing with httpWatch) shows no 403 errors in the background. I can’t believe it would have been that simple - well, now I have to figure out why my current system is behaving badly.

I did the logout thing, and attempted to log back in (without closing the tab or clearing the cache), and now I am getting an explicit 403 message:

I will try the other scenarios and see what happens.

Hooverdan · October 12, 2020, 10:01pm

Well, I don’t know what’s wrong with my Chrome or Edge browser on my main system. Neither of the 2 will allow me to log in (same error messages). I cleaned out the browser caches/histories, what have you … no dice. Only way making it work on the current machine was to use a portable version of Firefox that I have handy, so it’s strange. My other machine is running on the same update schedule, etc. so it’s not really configured all that differently…

But since this is not a Rockstor problem per se, I will not add much more to this conversation.

If I find out how to resolve my weird browser behavior without reinstalling them, I will put it here for posterity.

But, in any case, as usual big thanks to @Flox for pointing at the right issue and @phillxnet for helping brainstorm… learned something new again (not the way I wanted to, but …).

Hooverdan · October 12, 2020, 10:22pm

Just to close out. I found that one of my browser extensions (on both browsers) was messing with my privacy settings --> towards extreme clamp down. It was a VPN extension, that tightened all security settings so that when tunneling nothing would leak out. Rockstor’s admin page (as well some of my other appliance pages that I just now tested) fell victim to this behavior. While I like the privacy measures, I’d rather have them turned on during VPN only and relaxed when not. So, in case something similar happens to somebody else, I hope this will help.
I updated the title and amended a bit to point to the greater symptom.

I will open a separate topic for the ztask the keeps hanging around in my logs.

Thanks again!

Flox · October 13, 2020, 12:54am

Nice one tracking that down! I’m sorry I led you on the wrong path at the beginning and made you waste time on it.

Thanks a lot for your diligence and reporting your finding and solution back, that’s always very helpful!

I have indeed seen a few references to harsh browser privacy/security settings creating some issues with CRSF tokens, so I’m glad it was “just” that, indeed.

You’re right that it’s not a Rockstor problem per se, but that highlights improvements we can make on surfacing this kind of things to the user so that it’s more evident.

Glad again you’re all set!