Web UI errors (too many open files) while runing a balance

Hey folks, I added a couple of drives and started a balance last Friday. I noticed that it was proceeding slowly so I shutdown the rockons service and just let it run.

Last night (Sunday), I noticed that I was receiving lots of errors just trying to navigate the Web UI, and that one CPU core was maxed at 100%, and when I was eventually able to load the status page, it showed only 9% complete!

So, I poked around on here and found Balance is really disruptive - running the command quoted there to disable quotas helped the CPU load and greatly sped up the balance - this morning it had gone from 9% to > 50% complete.

However, but Iā€™m still experiencing a lot of issues while just trying to navigate the web UI. Different commands and urls lead to the issues, but they basically all boil down to either ā€œ[Errno 24] Too many open filesā€ or else nondescript ā€œunknown internal errors.ā€ (Presumably those are also be due to too many files errors.)

Hereā€™s a couple of examples:

Houston, weā€™ve had a problem.
Exception while running command([ā€˜/usr/bin/hostnamectlā€™, ā€˜ā€“staticā€™]): [Errno 24] Too many open files

Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py", line 41, in _handle_exception
    yield
  File "/opt/rockstor/src/rockstor/storageadmin/views/appliances.py", line 41, in get_queryset
     self._update_hostname()
  File "/opt/rockstor/eggs/Django-1.8.16-py2.7.egg/django/utils/decorators.py", line 145, in inner
    return func(*args, **kwargs)
  File "/opt/rockstor/src/rockstor/storageadmin/views/appliances.py", line 48, in _update_hostname
    cur_hostname = gethostname()
  File "/opt/rockstor/src/rockstor/system/osi.py", line 760, in gethostname
    o, e, rc = run_command([HOSTNAMECTL, '--static'])
  File "/opt/rockstor/src/rockstor/system/osi.py", line 107, in run_command
    raise Exception(msg)
  Exception: Exception while running command(['/usr/bin/hostnamectl', '--static']): [Errno 24] Too many open files

or

Houston, weā€™ve had a problem.
Unknown internal error doing a GET to /api/shares?page=1&format=json&page_size=5000&count=&sortby=name

Lots of different commands and URLs get one of those errors, but it seems pretty consistent - like 4/5 refreshes get an error rather than actually displaying the page I want.

Is this a common / known issue, or is it something specific to my server?

@nfriedly Hello again:

Yes quite likely I would say.

It is a know issue but is not one that affects all configurations, and the main cause is as yet unknown. Any progress you can make on the cause would definitely help. We have another forum thread open that has otherā€™s contributions in this area but as of yet no firm progress. I havenā€™t see it on any of my Rockstor instances as of yet.

Take a look at the following forum report by @peter:

and Iā€™ll link the currently open issue referenced in that thread here also for ease:

https://github.com/rockstor/rockstor-core/issues/1656

Iā€™ve also just linked back to this and the above mentioned forum thread in that issue for context.

Hope that helps and please feel free to contribute any info / pointers on that issue.

Hi

I can confirm that both scheduled and interactive scrub jobs now completes successfully on 3.9.2-1.

The Web UI problem ā€œtoo many files openā€ is still there. On a newly rebooted system I can use the web UI but after several hours uptime the problem is back.

1 Like

Kind of irritating that I canā€™t figure this out for myself. For what itā€™s worth, I just discovered that I can reliably trigger this error in the WebUI by opening Plex (via rockon) >> then triggering the ā€œAnalyzeā€ function on my Movies, TV, Music and Photos libraries from the Plex WebUI.

So, perhaps some plex maintenance routines might be the culprit?

I just got this error on a new Rockstor VM with the following sequence of actions:

  1. LUKS format two disks, and use them as a RAID1 pool
  2. After a few hours, LUKS format a third disk, add it to the pool without changing RAID level.
  3. Let it sit for a night or two.

The VM has 4GB of RAM, a 100GB base disk, 3x32GB pool disks, 2 CPUs, and the only Rock-on I have enabled is the http to https redirect.

Do we just need to up the ulimits? Iā€™m handy with the CLI and can give that a shot if necessary.

@TexasDex Welcome to the Rockstor community and thanks for the additional report.

Yes we still see this occasionally, you might like to take a look at the discussion in the earlier sighted open issue for this:

https://github.com/rockstor/rockstor-core/issues/1656

From that I suspected (in that thread) that it was AFP related. Do you also have AFP enabled?
Or did you have the Web-UI open over a long period?

I donā€™t think the LUKS element is relevant here though.

You might also like to take a look at the discussion / proposed fix by forum member @erisler in the following proposed pull request:

https://github.com/rockstor/rockstor-core/pull/1934

which in turn links back to the following forum thread:

Which also links in the AFP element.

Iā€™m afraid my knowledge in these web tech areas is far surpassed by @suman, @Flyer, and I think we are awaiting @Flyerā€™s final comment on the above pull request. Please do take a look at these links to see where we are up to so far on this one as it would be great to finally get to the cause on this one.

Hope that helps.

Yeah, I had a tab with the web UI open the entire time.

And yes, I believe I do have AFP enabled.

+1

Also happening to me.