Unknown internal error or Gateway timeout

mortenlj · April 2, 2024, 8:28am

Brief description of the problem

WebUI is super slow or unresponsive.

Detailed step by step instructions to reproduce the problem

Don’t know how to reproduce, but loading the WebUI is super slow. Some pages show up eventually, others either fail completely, or fail with “Houston, we’ve had a problem” and an empty stacktrace.

Problem seems to be worst when listing Rock-ons, but happens all over.

This is a new installation, but WebUI was working fine after installation yesterday. Today, totally unresponsive and slow. SSH to the machine works, showing reasonable load, nothing sticks out as problematic.

One strange thing, but it was there yesterday too. dmesg contains a stream of errors (repeating every 10 seconds) about I/O error reading fd0, which isn’t surprising, since there isn’t a floppy drive at all. Don’t understand why it is trying to read fd0 though …


[44344.291398] blk_update_request: I/O error, dev fd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[44344.291480] floppy: error 10 while reading block 0

Web-UI screenshot

Error Traceback provided on the Web-UI

phillxnet · April 2, 2024, 9:52am

@mortenlj Welcome to the Rockstor community forum.

Yes this is strange, and I suspect it is down to some incompatibility re our base OS and your hardware. As you say there is not floppy disk, and we also have nothing to do with floppy disks !!

It may have worked initially as you could have been on the installer kernel still. Could you describe the steps that lead to the current state. I.e. did install using one of our installers, and if so which one? And did you then apply any updates. Also what is the system dive used on this install: as sometimes pending drive failure there can be major slow-down. We fail all over the place with time-outs such as your indicate if the system dive is just too slow. Which is why we recommend against such things as system drive on USB 2.0: it’s just too slow. We are speeding-up and improving on that front however. Such as in our current testing channel.

So let us know if you applied any updates and some info about the hardware, age etc. As this looks like a major system drive slow-down for some reason. Although one drive slow-down can drag the entire system down as it can block shared resources.

Hope that help.

mortenlj · April 2, 2024, 10:58am

I did a few reboots yesterday (relocating from “under the desk” to the “server room in the basement”), and it was working fine, so should have been on the installed kernel and not the installer kernel.

I installed using Rockstor-Leap15.4-generic.x86_64-4.5.8-0.install.iso.
After installation I activated the Testing channel, and updated to ROCKSTOR 5.0.8-0.
I also applied all package updates.

Installed onto a Western Digital WDC WD2500JS-00M (232 GB). Created a couple shares on that pool for rock-ons, mariadb and transmission (mariadb and transmission installed, but not actually in use).

Created a data pool on two TOSHIBA HDWD130 (2.7 TB) disks using raid1, and created several shares in that pool. Started moving data to those shares yesterday, but things are slow regardless of the data transfer.

The machine is quite old, but should still pack a punch:

AMD Phenom™ II X4 955 Processor (4 cores)
8 GB RAM

Digging more, it seems to be the WebUI process that is slow. Nginx seems to be running and returns 504 Gateway timeout. SSH to the machine works and is responsive. Transfering data using Samba and “rsync-over-ssh” works and is responsive, with reasonable speeds.

Hooverdan · April 2, 2024, 12:34pm

Do you have any nvidia drivers installed by any chance? I found some posts where this “phantom” floppy error was caused by nvidia drivers (not clear on why they would).

mortenlj · April 2, 2024, 12:44pm

Not unless the Rockstor installer installs it. There is a ATI Radeon HD 5770 card in the machine.

Hooverdan · April 2, 2024, 1:21pm

Rockstor wouldn’t… ok.

In the BIOS is the floppy controller is already turned off (not that it will explain why you get the errors, because if none is connected/has never been connected, I would not expect the system go look for one …

mortenlj · April 2, 2024, 1:43pm

I have been poking around more, instead of doing my paid work (), because I thought it was weird that it seems to be only the WebUI that is slow.

Found some logs that I think are strange, but I don’t know what could cause it, or if it is related to the problems I’m seeing.

From /opt/rockstor/var/log/supervisord_data-collector_stderr.log:

Traceback (most recent call last):
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    response = conn.getresponse()
               ^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connection.py", line 461, in getresponse
    httplib_response = super().getresponse()
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/http/client.py", line 1378, in getresponse
    response.begin()
  File "/usr/lib64/python3.11/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
                              ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib64/python3.11/site-packages/gevent/_socketcommon.py", line 696, in recv_into
    self._wait(self._read_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 538, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/cli/api_wrapper.py", line 66, in set_token
    response = requests.post(
               ^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rockstor/.venv/lib/python3.11/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/opt/rockstor/src/rockstor/smart_manager/data_collector.py", line 1048, in prune_logs
    self.aw.api_call(
  File "/opt/rockstor/src/rockstor/cli/api_wrapper.py", line 85, in api_call
    self.set_token()
  File "/opt/rockstor/src/rockstor/cli/api_wrapper.py", line 81, in set_token
    raise Exception(msg)
Exception: Exception while setting access_token for url(http://127.0.0.1:8000): HTTPConnectionPool(host='127.0.0.1', port=8000): Read timed out. (read timeout=2). content: None
2024-04-02T13:30:34Z <Greenlet at 0x7f1e4a4e76a0: <bound method SysinfoNamespace.prune_logs of <smart_manager.data_collector.SysinfoNamespace object at 0x7f1e5069f5d0>>> failed with Exception

Hooverdan · April 2, 2024, 4:47pm

Again, not sure that it will have an impact on the symptoms you’re describing, but the recommendation is that you have the Rockon related shares not on the OS drive (maybe the Rockon root, but not the config/data shares) to keep OS separate from the data. this also makes a reinstall (when/if necessary) more painless, as the config/data settings remain intact and you only need to use the configuration restore steps (plus a few pre-reqs as described in the documentation) to get back to your original state …

mortenlj · April 3, 2024, 7:13am

I managed to uninstall the rock-ons I had installed, but it made no difference to the responsiveness of the system.

Further investigations:

Loading the dashboard triggers many requests to various /api/ URLs. Most of the time, roughly half of these are quick, and the other half are slow and usually result in a 504. Looking at the access logs, those slow ones usually finish after 40-60 seconds, which is long after the nginx timeout.
Most of the time, it is the calls to /api/pools, /api/shares and /api/disks that times out, so it seems the problems are related to storage in some way.

The behavior where some requests are quick and others are super slow is sometimes a symptom of too few worker threads, so I attempted to increase the number of gunicorn workers and threads, but it made no difference.

Once the rock-ons were uninstalled, the rock-ons page is also responsive, which it wasn’t when there were rock-ons to show. I guess that fits with the problem being related to storage somehow, where no rock-ons means no need to query storage information.

phillxnet · April 3, 2024, 6:08pm

@mortenlj Just a quick thought on this, along with your excellent feedback
Re:

There were a tone of changes between 4.5.8-0 to the now most current testing channel 5.0.8-0, including some socket stuff that feeds the front-end. Try doing a browser ‘developer tools’ - ‘network’: and ensuring that the “disable cache” option is ticked. That will ensure all local browser cache for the Rockstor Web-UI is refreshed: but only once you then refresh the page with the developer tools open and thus configured.

You should then get more feedback on what is failing from the client (browser) perspective. It may be that you simply have older libs resident in browser cache that are no longer compatible with Rockstor’s way newer back-end.

Just a thought and worth a try. Plus there may be more info to be had to track stuff down if this doesn’t actually fix the slow-down for you.

Hope that helps, at least by way of some context.

mortenlj · April 3, 2024, 7:10pm

I’ve been using developer tools so far, but did another run with explicit emptying/disabling of cache.
The behavior is like before, with requests to the various storage related APIs being slow, resulting in a 504 Gateway timeout from nginx.

It seems clear that the problem lies in the backend when handling various storage related API requests, but I have no idea where to go from here. I tried connecting to postgres to try and get some stats to see if the problem lies in the database, but as far as I can tell, there are no slow queries of any kind. It looks like it’s some sort of problem in the actual Python code in the backend, but not knowing the source code I have no idea where to start looking to debug it.

I guess I could attempt a reinstall, and not doing the upgrade (or is there a way to downgrade?), to see if things then work better.

Hooverdan · April 3, 2024, 7:32pm

Not a simple way to downgrade really, a reinstall will be far faster. If you go for that, try to get a configuration backup and save it off the Rockstor appliance, so that your settings can be restored back in (in case you had samba going etc.). take a quick look at the documentation (ignore some the aged screenshots early on, it needs to be updated):

https://rockstor.com/docs/howtos/reinstall.html

and specifically the sequence on how to restore a configuration file:
https://rockstor.com/docs/interface/system/config_backup.html#config-restore

Flox · April 4, 2024, 12:50am

It has been quite some time since I’ve tried it, but maybe you could get to further pinpoint the problematic API call using Django rest framework browsable API. From memory, you can visit it using your Rockstor IP and the API endpoint. For instance, browse to https://rockstor-ip/api/disks (I forgot about the need for the trailing slash in our endpoints; sorry I can’t try myself at the moment). It should try a GET by default, but you can manually adapt anything you need/desire there. Be careful when triggering POST/PUT/DELETE, of course.

It will only reproduce what you have already experienced, but maybe it’ll help pinpoint one endpoint over another or an oddity in their response.

Forgive me if you already detailed that, but can you see any error or warning in the Disks page or SMART reports of your disks?

mortenlj · April 4, 2024, 5:44am

Tried all the mentioned endpoints, but no noticeable difference in their behavior. They are all super slow. I even tried hitting the gunicorn worker directly, skipping nginx, and as expected, no difference then either. The problem is in the django backend, related to something to do with storage. I’ve installed rockstor in a VM, to compare, and I’ve noticed a few different discrepancies, so I think something went wrong during install. I’m going to attempt a reinstall when I have time, hopefully this evening.

I’ve not been able to access the disks page today, but from what I can remember, there hasn’t been any sort of SMART errors or warnings when I’ve viewed the page earlier.

mortenlj · April 4, 2024, 1:55pm

While I had the machine connected to keyboard/monitor for reinstall, I checked the BIOS and was able to disable the floppy controller, which has eliminated these messages.

Completed a reinstall, and everything seems to be working. I have not done any kind of updates yet, figured I will let the system sit for a day or two before attempting updates, since the last time the slowness only appeared on day two.

Thanks for all suggestions so far, will be interesting to see if the problems reappear, or if there was something that went wrong during the original install that caused the problems.

mortenlj · April 7, 2024, 4:31pm

Been running a few days after reinstall without any issues. I have not activated an update channel yet, but I did apply OS package updates yesterday. Still seems to be working fine.

I think I will wait until the testing channel is a little further along before attempting an update, although I do think there was something that went wrong during the first install that caused the problems I was seeing.