Docker weirdness (task blocked for more than 120 seconds)

doenietzomoeilijk · April 28, 2019, 12:42pm

I’m seeing some weird behaviour on my Rockstor machine, and I hope someone more knowledgable than me might be able to shed some light on things.

Every now and then, “an action” concerning Docker seems to fail, and I’ll see something like this in the logs:

sd_journal_get_cursor() failed: 'Cannot assign requested address' [v8.24.0-34.el7]
INFO: task btrfs-cleaner:8092 blocked for more than 120 seconds.
Not tainted 4.12.4-1.el7.elrepo.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
INFO: task dockerd:48368 blocked for more than 120 seconds.

…and several other things timing out in a similar way, after which the Docker daemon unceremoniously shits the bed:

2019/04/28 12:06:57 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

There’s call traces whenever I see the hung_task_timeout messages in the logs, but those are a bit long to stuff into the thread right away - I could dump those into a file if someone thinks that might help. They tend to contain a lot of references to btrfs-related things.

Of course, processes in containers keep running, and I can start Docker by restarting the Rockon service, or worst-case the entire server, but it’s kind of annoying to say the least. I could reliably trigger the error by running a nextcloud container, for example, and running things like Watchtower usually brings the house down, too. Sometimes, but not always, the problems seem to affect other services, notably PHP.

I’m kinda stumped as to why this happens. I’m fairly certain that Docker things work for other people, otherwise the forum would contain more similar complaints. The server does it’s thing and gets through scrubs just fine, so it’s not like my disks are on the verge of fiery death. The system is somewhat low-powered (it’s a Celeron-powered HP Gen8 microserver with 6GB of RAM), but it doesn’t seem cramped, overloaded or out of RAM or something. I’ve considered blaming it on the rather ancient version of Docker running on the system, or on BTRFS, but again, it seems to work for other people, so that can’t really be it, right?

Currently I’m experiencing a variation on the theme: Portainer declared my docker daemon dead, but running a docker ps from the command line and inspecting containers through Cockpit worked just fine. I’ve stopped and started the service using the Rockstor interface, and now it’s still dead. Sigh.

I’m out of ideas, and hoping someone has an interesting avenue to explore. Maybe it’s an abundance of open file descriptors or something silly like that, but I’m not knowledgeable enough to troubleshoot that properly. For now it seems like I have to reboot the machine yet again to get it to a functioning state, and that’s getting rather annoying at this point. Any help or pointers would be greatly appreciated.