Rockstor-bootstrap taking forever

Hi,

since a few days my rockstor system is not booting up probably. The “rockstor-bootstrap” service is taking forever and will never finish activating. Systemd tells me:

● rockstor-bootstrap.service - Rockstor bootstrapping tasks
   Loaded: loaded (/etc/systemd/system/rockstor-bootstrap.service; enabled; vendor preset: disabled)
   Active: activating (start) since Sat 2016-08-06 23:10:06 CEST; 11min ago
 Main PID: 3566 (bootstrap)
   CGroup: /system.slice/rockstor-bootstrap.service
           └─3566 /usr/bin/python /opt/rockstor/bin/bootstrap

The web interface is available but very slow. Rock-On (docker) and other services are not starting up (systemd status):

    State: starting
     Jobs: 10 queued
   Failed: 0 units
    Since: Sat 2016-08-06 23:08:49 CEST; 15min ago

gunicorn.log

2016-08-06 23:10:08 [3580] [INFO] Starting gunicorn 0.16.1
2016-08-06 23:10:08 [3580] [INFO] Listening at: http://127.0.0.1:8000 (3580)
2016-08-06 23:10:08 [3580] [INFO] Using worker: sync
2016-08-06 23:10:08 [3592] [INFO] Booting worker with pid: 3592
2016-08-06 23:10:08 [3591] [INFO] Booting worker with pid: 3591
2016-08-06 23:12:11 [3580] [CRITICAL] WORKER TIMEOUT (pid:3592)
2016-08-06 23:12:11 [3580] [CRITICAL] WORKER TIMEOUT (pid:3592)
2016-08-06 23:12:11 [5483] [INFO] Booting worker with pid: 5483
2016-08-06 23:14:19 [3580] [CRITICAL] WORKER TIMEOUT (pid:3591)
2016-08-06 23:14:19 [3580] [CRITICAL] WORKER TIMEOUT (pid:3591)
2016-08-06 23:14:19 [7816] [INFO] Booting worker with pid: 7816
2016-08-06 23:14:29 [3580] [CRITICAL] WORKER TIMEOUT (pid:5483)
2016-08-06 23:14:29 [3580] [CRITICAL] WORKER TIMEOUT (pid:5483)
2016-08-06 23:14:29 [8156] [INFO] Booting worker with pid: 8156
2016-08-06 23:16:32 [3580] [CRITICAL] WORKER TIMEOUT (pid:7816)
2016-08-06 23:16:32 [3580] [CRITICAL] WORKER TIMEOUT (pid:7816)
2016-08-06 23:16:32 [10198] [INFO] Booting worker with pid: 10198
2016-08-06 23:18:38 [3580] [CRITICAL] WORKER TIMEOUT (pid:10198)
2016-08-06 23:18:38 [3580] [CRITICAL] WORKER TIMEOUT (pid:10198)
[...]

Where does the output of the bootstrap goes to?
Any idea?

Thanks.

Since it is a systemd service, you might try journalctl -u rockstor-bootstrap, or else manually look through the logs since last boot with journalctl -xb.

Hi Steven,

thanks for your reply. The journalctl -u rockstor-bootstrap is empty, I think because it never gets activated. journalctl -u rockstor seems normal:

[root@majestix ~]# journalctl -xu rockstor
-- Logs begin at Sun 2016-08-07 15:39:28 CEST, end at Sun 2016-08-07 15:49:38 CEST. --
Aug 07 15:40:49 majestix supervisord[3550]: 2016-08-07 15:40:49,655 CRIT Supervisor running as root (no user in config file)
Aug 07 15:40:49 majestix supervisord[3550]: 2016-08-07 15:40:49,675 INFO RPC interface 'supervisor' initialized
Aug 07 15:40:49 majestix supervisord[3550]: 2016-08-07 15:40:49,675 CRIT Server 'unix_http_server' running without any HTTP authentication checking
Aug 07 15:40:49 majestix supervisord[3550]: 2016-08-07 15:40:49,676 INFO supervisord started with pid 3550
Aug 07 15:40:50 majestix supervisord[3550]: 2016-08-07 15:40:50,681 INFO spawned: 'nginx' with pid 3564
Aug 07 15:40:50 majestix supervisord[3550]: 2016-08-07 15:40:50,687 INFO spawned: 'gunicorn' with pid 3565
Aug 07 15:40:50 majestix supervisord[3550]: 2016-08-07 15:40:50,690 INFO spawned: 'data-collector' with pid 3566
Aug 07 15:40:50 majestix supervisord[3550]: 2016-08-07 15:40:50,693 INFO spawned: 'ztask-daemon' with pid 3567
Aug 07 15:40:52 majestix supervisord[3550]: 2016-08-07 15:40:52,774 INFO success: data-collector entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
Aug 07 15:40:52 majestix supervisord[3550]: 2016-08-07 15:40:52,776 INFO success: ztask-daemon entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
Aug 07 15:40:55 majestix supervisord[3550]: 2016-08-07 15:40:55,781 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
Aug 07 15:40:55 majestix supervisord[3550]: 2016-08-07 15:40:55,782 INFO success: gunicorn entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)

I cannot find any hint on a problem in journalctl -xb

Thanks.

This is odd, your bootstrapping process seems to hang already at /usr/bin/python /opt/rockstor/bin/bootstrap. Try running this command manually. The first thing this should do is perform a btrfs device scan, but it should generate output whether or not that succeeds.

Running it manually returns the following:

BTRFS device scan complete
Error connecting to Rockstor. Is it running?
Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/commands/bootstrap (Caused by <class 'httplib.BadStatusLine'>: '')
Error connecting to Rockstor. Is it running?
Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/commands/bootstrap (Caused by <class 'httplib.BadStatusLine'>: '')
Error connecting to Rockstor. Is it running?
Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/commands/bootstrap (Caused by <class 'httplib.BadStatusLine'>: '')

Looks like that rockstor.service is not booting up probably. I wonder why this output is not returned by the service.

Do the system journals and/or rockstor.log give any clues on why this is so? Have you tried restarting the rockstor service or system? I can see that your gunicorn server is getting worker process timeouts, which is surprising given the long (120 seconds) timeout limit, but I can’t so far figure out the cause with the system components apparently running fine.

Thanks for your reply.

I did not count how many times I restarted the system. Also tried to restart the services. This did not help at all. As far as I understand the components involved, the timeout occurs somewhere between the gunicorn server and the nginx server? But why does the UI than runs fine? Its slow, but it works.
The system is not the fastest, but should be fine: AMD Turion II Neo, 2,2GHz, 10GB RAM.

Yesterday I tried to reinit rockstor (deleting the .initrock and running the command). This did not work also. Except that I now can not reactivate my stable subscription.

I use the docker daemon for my own containers without using rock-on ui. But this should not affect, since the docker environment is not up at this state.

If you need futher logs, just tell me.

Well, if neither journald nor /opt/rockstor/var/log contains any further clues about this timeout, I’m afraid someone with more knowledge of the web components will have to look into this. I’m fairly new to web development and still learning the ropes of the code.

It might help to also report this issue at the github repo. All I could suggest as a workaround is to manually increase the timeout values in /opt/rockstor/etc/supervisord.conf and restart rockstor.service, but frankly two minutes should be plenty and your system is even slightly faster than mine.

Thanks.

Created an issue: https://github.com/rockstor/rockstor-core/issues/1421

Maybe related to: https://github.com/rockstor/rockstor-core/issues/1419

I did some more research today and found out, that docker is not removing subvolumes when deleting/removing containers or images. So I got 1001 subvolumes laying around. Could this be the reason the bootstrap/gunicorn runs into timeout?

About the docker issue:

Just want to confirm: After cleaning up the docker directory by removing the subvolumes, rockstor is booting fine.

3 Likes