Pool\Share inaccessible - btrfs-transacti at 100% CPU

grizzly · February 13, 2016, 9:31pm

Prob more of a BTRFS question than Rockstor par se. Our Rockstor has a pool with a share. It can no longer be browsed by an NFS or Samba client. Just freezes. This started after deleting some old snapshots using the Rockstor UI and then doing a test replication (more likely instigated by the former). The share is pretty big - 5tb.

It seems to be related to the fact there’s a process that’s usually at 100% utilisation: btrfs-transacti, because occasionally this drops to near 0%, and I can briefly browse the share again. I’ve tried rebooting Rockstor, but the same happens. Are BTRFS snapshots a bit like VMWare, in that if you delete many of them, you can expect a long wait while a merge-with-parent operation occurs? Is it possible I have a corrupt pool or will this eventually sort itself out? It’s been 3 - 4 hours so far.

If corrupt, some googling suggests I should try booting off a USB stick and running command: btrfs check --repair /dev/sdXX I assume, I’d run it against Pool1’s /dev/sdXX path? If so how do I discover what this is? Is running this command destructive?

grizzly · February 14, 2016, 8:25pm

Seems to have sorted itself out overnight. When I woke up, btrfs-transacti had ceased processing and I could access the share again. I expect this was busy merging the snapshots I deleted. One to beware of: deleting old snapshots of large shares will freeze your NAS!

stupidcomputers · September 5, 2016, 7:32pm

I have been running rockstor for over a year now. I have this problem frequently as well. I make heavy use of scheduled snapshots on a 19TB btrfs formatted raw block device attached to a Rockstor VM.

When this happens, the btrfs-transacti consumes all available CPU on the VM it runs in. I am not able to access shares via smb during this time. If I leave it alone, it eventually finishes and everything comes back.

When I look at IOPS on the disk set holding the block device, I see minimal IO usage.

Are there any parameters to be tweaked to speed up this process? What data is the cpu crunching on if it is not generating disk IO? Can that data be optimized?

grizzly · September 5, 2016, 8:32pm

Quotas are unstable in btrfs at present. Try disabling:
btrfs quota disable /mnt2/< your pool name >

stupidcomputers · September 5, 2016, 10:01pm

The best information regarding impacts I could find on this feature is here:

I disabled quotas assuming I will no longer know how much space a snapshot will consume for testing. Let’s see what happens!

sfranzen · September 5, 2016, 10:45pm

Another thing that may be impacting performance is btrfs subvolume deletion. I don’t know how this is currently configured in rockstor, but I could check that out tomorrow.

sfranzen · September 6, 2016, 3:13pm

Well, the command is simply run_command([BTRFS, 'subvolume', 'delete', snap], log=True), without any options, so that can’t have been the cause.