100% CPU usage kworker

McFaul · April 5, 2016, 8:05pm

Hi, one of the kworkers is using 100% CPU…, can anyone tell me why / what to do about it / how to troubleshoot?

Thanks

top - 21:03:32 up  1:01,  1 user,  load average: 3.09, 3.17, 3.29
Tasks: 262 total,   2 running, 257 sleeping,   3 stopped,   0 zombie
%Cpu(s):   0.2/50.0   50[||||||||||||||||||||||||||                           ]
KiB Mem :  7343312 total,   261904 free,  1625124 used,  5456284 buff/cache
KiB Swap:  6072316 total,  6072316 free,        0 used.  5566992 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4874 root      20   0       0      0      0 R 100.0  0.0  31:00.10 kworker/u8+
   53 root      20   0       0      0      0 D   0.0  0.0  23:56.30 kworker/u8+
 4401 root      20   0       0      0      0 S   0.0  0.0   3:02.30 btrfs-tran+
 4365 root      20   0       0      0      0 S   0.0  0.0   0:24.78 btrfs-tran+
 4405 root      20   0  423808  53024   9264 S   0.0  0.7   0:17.75 gunicorn
 4281 root      20   0  559680  48676  11368 S   0.0  0.7   0:12.16 data-colle+
 4288 root      20   0  421108  52016   9156 S   0.0  0.7   0:11.96 gunicorn
 5864 root      20   0  475560  15648  12012 S   0.0  0.2   0:07.30 smbd
   35 root      20   0       0      0      0 S   0.0  0.0   0:04.32 kswapd0
 7449 root      20   0  157808   4368   3564 T   0.0  0.1   0:02.94 top
    7 root      20   0       0      0      0 S   0.0  0.0   0:01.79 rcu_sched
    1 root      20   0  123504   5640   3928 S   0.0  0.1   0:01.40 systemd
 7177 root      20   0       0      0      0 S   0.0  0.0   0:01.13 kworker/0:3
  607 root      20   0       0      0      0 S   0.0  0.0   0:01.10 usb-storage
 4265 root      20   0  222348  17388   7328 S   0.0  0.2   0:00.98 supervisord
 4283 nginx     20   0  110592   9532   6860 S   0.0  0.1   0:00.97 nginx
   25 root      39  19       0      0      0 S   0.0  0.0   0:00.76 khugepaged

suman · April 5, 2016, 11:56pm

Thanks for the info @McFaul. I wonder if this is caused by a btrfs thread. This link might help you find out which thread is causing it for the most part.

How is your Pool, Share and Snapshot makeup? How many do you have and how are they distributed?

McFaul · April 6, 2016, 5:29am

Hi,

After a little more troubleshooting I now think it is the BTRFS thread.

I have two pools, a 12 drive one and an 8 drive one. two shares on the 12 drive, one share on the 8 drive. It’s just a giant media storage box, so nothing fancy, and no rockons or anything complicated like that.

two weeks ago I was doing a scrub on the 8 drive pool and it said that one of the disks had failed. I removed the disk, both physically and from the BTRFS pool, and rebalances so I had a 7 drive pool with no missing devices. Once the drive was out, I did a full surface scan of the “failed” drive, it was fine, so I did a full disk erase (on my windows machine) to zero the drive. then I put it back in the Rockstor and re-added it to the (now 7 drive) pool, however it hasn’t balanced yet.

Now the reason I think it is the BTRFS thread, after a few hours, the CPU is now basically idle… if I try and coy a file to the 8 drive pool, that thread goes back up to 100% and stays there… and then the file transfer rate transfer rate goes to zero and the copy times out “an unexpected network error occurred”

McFaul · April 6, 2016, 7:37am

Hi again,

I tried cat /proc//stack and all i got was:

[<ffffffffffffffff>] 0xffffffffffffffff

but now im less sure its BTRFS,

Once the CPU was idle… i’ve started a balance, which it does need as 7 of the disks have 5TB each on them, and the 8th only has 200GB.

Now im back up to 99% CPU usage… but this time it actually shows as being BTRFS using the CPU, rather than kworker:u8

i’ll let the balance finish then re-try the copy

PumaDAce · April 11, 2016, 9:00pm

Hello,
I recently got the same problem so I went to btrfs irc channel for help (since here I did not get any answer).
In my case it occured during writting mainly (and whole system freeze/slow down). Thus I was adviced and at the end the result in my case was simple:
in principle I ran out of space (well, “space”). it really depends on what is your current pool state: try “btrfs fi show /pool” and “btrfs fi df /pool” and you will get the most important info.
If you are getting used space close to total it is bad, also if there is no unallocated space. basically the btrfs is trying to find some free chunks or something like that (also it is not only case for data but also can be for metadata).
I was advice the best thing is to keep at least 10% free, but I think much more can be needed. Also you can try balance your pool which can help ( https://btrfs.wiki.kernel.org/index.php/FAQ ).
As I understood it is common problem of btrfs.

I hope this may help a bit. But of course it doesnt need to be your case.

McFaul · April 12, 2016, 7:47am

Hi,

You are exactly right, I knew that some of the devices were very low on space (but some had LOTS of space), and i have been trying to run a balance… and it said it was working (it didn’t get a low space error), but it wasn’t actually moving anything.

So then i figured it may not have enough space on those devices to do the balance, so i deleted a bunch of files, and now the balance is running (and i can see the full devices emptying and the empty device filling).

This has also “cured” my 100% CPU usage… but I had not associated the two until i saw your post, so thanks for that!

Chris

Flyer · April 12, 2016, 8:54am

Hi @PumaDAce, interesting topic covering an important btrfs well known issue

Talking about btrfs fi etc, etc : it actually returns desired info but needs 2 steps and some reading (this is the reason why some guys - just checking with fi show - think to be ok and then get out of space)
btrfs-progs contributors are working on a better solution to collect real space usage (Ex. from btrfs-progs 4.5 we’ll have btrfs fi du to collect more reliable info in a single command)

Flyer