Btrfs race or hang conditions

magicalyak · September 22, 2017, 1:29pm

Currently I’ve had occasional race conditions with btrfs on my system. This seems related to many subvolumes. The relevant log entries are below.

Sep 22 09:07:26 rocky.gamull.com kernel: Workqueue: btrfs-delayed-meta btrfs_delayed_meta_helper [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel: Call Trace:
Sep 22 09:07:26 rocky.gamull.com kernel:  __schedule+0x28a/0x880
Sep 22 09:07:26 rocky.gamull.com kernel:  schedule+0x36/0x80
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_tree_lock+0xad/0x210 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? remove_wait_queue+0x60/0x60
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_search_slot+0x714/0x9b0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? __enqueue_entity+0x6c/0x70
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_lookup_inode+0x3e/0xc0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? btrfs_insert_delayed_items+0xc0/0x480 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  __btrfs_update_delayed_inode+0x74/0x220 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_async_run_delayed_root+0x1e1/0x200 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  normal_work_helper+0x136/0x2f0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_delayed_meta_helper+0x12/0x20 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  process_one_work+0x149/0x360
Sep 22 09:07:26 rocky.gamull.com kernel:  worker_thread+0x4d/0x3c0
Sep 22 09:07:26 rocky.gamull.com kernel:  kthread+0x109/0x140
Sep 22 09:07:26 rocky.gamull.com kernel:  ? rescuer_thread+0x380/0x380
Sep 22 09:07:26 rocky.gamull.com kernel:  ? kthread_park+0x60/0x60
Sep 22 09:07:26 rocky.gamull.com kernel:  ? do_syscall_64+0x67/0x150
Sep 22 09:07:26 rocky.gamull.com kernel:  ret_from_fork+0x25/0x30
Sep 22 09:07:26 rocky.gamull.com kernel: INFO: task kworker/u50:4:9322 blocked for more than 120 seconds.
Sep 22 09:07:26 rocky.gamull.com kernel:       Tainted: G          I     4.12.4-1.el7.elrepo.x86_64 #1
Sep 22 09:07:26 rocky.gamull.com kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 22 09:07:26 rocky.gamull.com kernel: kworker/u50:4   D    0  9322      2 0x00000080
Sep 22 09:07:26 rocky.gamull.com kernel: Workqueue: btrfs-delayed-meta btrfs_delayed_meta_helper [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel: Call Trace:
Sep 22 09:07:26 rocky.gamull.com kernel:  __schedule+0x28a/0x880
Sep 22 09:07:26 rocky.gamull.com kernel:  schedule+0x36/0x80
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_tree_lock+0xad/0x210 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? remove_wait_queue+0x60/0x60
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_search_slot+0x714/0x9b0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? start_transaction+0x120/0x440 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_lookup_inode+0x3e/0xc0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  ? btrfs_insert_delayed_items+0xc0/0x480 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  __btrfs_update_delayed_inode+0x74/0x220 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_async_run_delayed_root+0x1e1/0x200 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  normal_work_helper+0x136/0x2f0 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  btrfs_delayed_meta_helper+0x12/0x20 [btrfs]
Sep 22 09:07:26 rocky.gamull.com kernel:  process_one_work+0x149/0x360
Sep 22 09:07:26 rocky.gamull.com kernel:  worker_thread+0x4d/0x3c0
Sep 22 09:07:26 rocky.gamull.com kernel:  kthread+0x109/0x140
Sep 22 09:07:26 rocky.gamull.com kernel:  ? rescuer_thread+0x380/0x380
Sep 22 09:07:26 rocky.gamull.com kernel:  ? kthread_park+0x60/0x60
Sep 22 09:07:26 rocky.gamull.com kernel:  ? do_syscall_64+0x67/0x150
Sep 22 09:07:26 rocky.gamull.com kernel:  ret_from_fork+0x25/0x30

I’ve heard adding serialization may help with this, it seems to be related to having many subvolumes

phillxnet · September 23, 2017, 8:33am

@magicalyak Hello again.

How many subvolumes (including snapshots)?

Also is this a multi core machine as a fair bit of the kernel side of btrfs is apparently multi-threaded and there have been reports of a single core having wait issues where a multi core doesn’t. Just a thought.

The referenced issue seems to concern mainly boot up mounts. All data mounts within Rockstor are serialised already as we run a single thread to do them. AT least that is my understanding. Only the system disk related mounts are managed by fstab / systemd.

magicalyak · September 23, 2017, 11:26am

It’s four pages of shares
This machine is a dual socket rack server with 6-core procs.

doenietzomoeilijk · April 1, 2018, 12:56pm

Sorry for necro-ing this, but I may or may not have the same issue. Things hum along mostly fine, but horsing around with a certain Docker container (not as a RockOn) seems to trigger it at some point. This results in the container not being shut down, and doing something like ls in the volume that’s mounted in that container freezes my terminal. Mounting volumes over AFP and some of the Docker containers are dead, but some containers still hum along just fine - it seems like everything touchtig a share on the pool hangs or at least times out. Things like btrfs-scrub just sit there, doing nothing.

See the relevent portions of my journal here.

This is a testing release of Rockstor on a HP gen8, so dual core, 4G of RAM. with a pool consisting of 3 disks in RAID1, with a total of 19 shares and 126 snapshots on it, the latter being purely Docker stuff.

I hope there’s anything that can be done about this (newer kernel?) or get a better idea of what’s actually going wrong.

magicalyak · September 13, 2018, 2:44pm

No problem, I’m actually hitting an issue now like that. I am thinking of rebuilding my rockons share again and seeing if it’s just corruption or something related to that? Right now it’s very strange. My Sonarr container just doesn’t hit the UI but others do. At boot docker ps takes forever and I see failed calls to docker in the rockstor.log

For example, testing a new rockon I see
[13/Sep/2018 10:07:18] ERROR [storageadmin.views.rockon:81] Rockon(Ghost-test) is in pending state but there is no pending or failed task for it.

The error indicates the storageadmin view which is likely the rockons share
I generally rebuild the rockon share instead of trying to salvage it but that’s me.

magicalyak · April 20, 2020, 2:49pm

@phillxnet just revisiting. Does this mean that we shouldn’t have the rockon share on the root disk? I actually had been doing that but wanted to confirm this would happen if I did.