Hi, one of the kworkers is using 100% CPU…, can anyone tell me why / what to do about it / how to troubleshoot?
top - 21:03:32 up 1:01, 1 user, load average: 3.09, 3.17, 3.29
Tasks: 262 total, 2 running, 257 sleeping, 3 stopped, 0 zombie
%Cpu(s): 0.2/50.0 50[|||||||||||||||||||||||||| ]
KiB Mem : 7343312 total, 261904 free, 1625124 used, 5456284 buff/cache
KiB Swap: 6072316 total, 6072316 free, 0 used. 5566992 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4874 root 20 0 0 0 0 R 100.0 0.0 31:00.10 kworker/u8+
53 root 20 0 0 0 0 D 0.0 0.0 23:56.30 kworker/u8+
4401 root 20 0 0 0 0 S 0.0 0.0 3:02.30 btrfs-tran+
4365 root 20 0 0 0 0 S 0.0 0.0 0:24.78 btrfs-tran+
4405 root 20 0 423808 53024 9264 S 0.0 0.7 0:17.75 gunicorn
4281 root 20 0 559680 48676 11368 S 0.0 0.7 0:12.16 data-colle+
4288 root 20 0 421108 52016 9156 S 0.0 0.7 0:11.96 gunicorn
5864 root 20 0 475560 15648 12012 S 0.0 0.2 0:07.30 smbd
35 root 20 0 0 0 0 S 0.0 0.0 0:04.32 kswapd0
7449 root 20 0 157808 4368 3564 T 0.0 0.1 0:02.94 top
7 root 20 0 0 0 0 S 0.0 0.0 0:01.79 rcu_sched
1 root 20 0 123504 5640 3928 S 0.0 0.1 0:01.40 systemd
7177 root 20 0 0 0 0 S 0.0 0.0 0:01.13 kworker/0:3
607 root 20 0 0 0 0 S 0.0 0.0 0:01.10 usb-storage
4265 root 20 0 222348 17388 7328 S 0.0 0.2 0:00.98 supervisord
4283 nginx 20 0 110592 9532 6860 S 0.0 0.1 0:00.97 nginx
25 root 39 19 0 0 0 S 0.0 0.0 0:00.76 khugepaged
Thanks for the info @McFaul. I wonder if this is caused by a btrfs thread. This link might help you find out which thread is causing it for the most part.
How is your Pool, Share and Snapshot makeup? How many do you have and how are they distributed?
After a little more troubleshooting I now think it is the BTRFS thread.
I have two pools, a 12 drive one and an 8 drive one. two shares on the 12 drive, one share on the 8 drive. It’s just a giant media storage box, so nothing fancy, and no rockons or anything complicated like that.
two weeks ago I was doing a scrub on the 8 drive pool and it said that one of the disks had failed. I removed the disk, both physically and from the BTRFS pool, and rebalances so I had a 7 drive pool with no missing devices. Once the drive was out, I did a full surface scan of the “failed” drive, it was fine, so I did a full disk erase (on my windows machine) to zero the drive. then I put it back in the Rockstor and re-added it to the (now 7 drive) pool, however it hasn’t balanced yet.
Now the reason I think it is the BTRFS thread, after a few hours, the CPU is now basically idle… if I try and coy a file to the 8 drive pool, that thread goes back up to 100% and stays there… and then the file transfer rate transfer rate goes to zero and the copy times out “an unexpected network error occurred”
I tried cat /proc//stack and all i got was:
but now im less sure its BTRFS,
Once the CPU was idle… i’ve started a balance, which it does need as 7 of the disks have 5TB each on them, and the 8th only has 200GB.
Now im back up to 99% CPU usage… but this time it actually shows as being BTRFS using the CPU, rather than kworker:u8
i’ll let the balance finish then re-try the copy
I recently got the same problem so I went to btrfs irc channel for help (since here I did not get any answer).
In my case it occured during writting mainly (and whole system freeze/slow down). Thus I was adviced and at the end the result in my case was simple:
in principle I ran out of space (well, “space”). it really depends on what is your current pool state: try “btrfs fi show /pool” and “btrfs fi df /pool” and you will get the most important info.
If you are getting used space close to total it is bad, also if there is no unallocated space. basically the btrfs is trying to find some free chunks or something like that (also it is not only case for data but also can be for metadata).
I was advice the best thing is to keep at least 10% free, but I think much more can be needed. Also you can try balance your pool which can help ( https://btrfs.wiki.kernel.org/index.php/FAQ ).
As I understood it is common problem of btrfs.
I hope this may help a bit. But of course it doesnt need to be your case.
You are exactly right, I knew that some of the devices were very low on space (but some had LOTS of space), and i have been trying to run a balance… and it said it was working (it didn’t get a low space error), but it wasn’t actually moving anything.
So then i figured it may not have enough space on those devices to do the balance, so i deleted a bunch of files, and now the balance is running (and i can see the full devices emptying and the empty device filling).
This has also “cured” my 100% CPU usage… but I had not associated the two until i saw your post, so thanks for that!
Hi @PumaDAce, interesting topic covering an important btrfs well known issue
Talking about btrfs fi etc, etc : it actually returns desired info but needs 2 steps and some reading (this is the reason why some guys - just checking with fi show - think to be ok and then get out of space)
btrfs-progs contributors are working on a better solution to collect real space usage (Ex. from btrfs-progs 4.5 we’ll have
btrfs fi du to collect more reliable info in a single command)