Snapshot limits and how to modify them

We have a Rockstor system in daily use by about 100 people with a total of 51 shares and around 20TB of data… We finally got everyone using these system a few months ago (migrated off an old Windows server).
We have configured daily snapshots of all the shares (as hourly snapshot were causing issues when being deleted) - retaining 50 per share.
We are now continuing to have have issues with smbd processes being in state ‘D’ which requires a reboot before some of the shares become available (which takes over an hour!!)
I believe this could all be because we have too many snapshots and some btrfs processes iterate over all the snapshots even when they don’t need to [1]

In our experience, deleting many snapshots at once also causes a similar hung state; so I’d like to start decrementing the number of retained snapshots over time…

This would involve changing 50 schedules one-at-a-time in the GUI each day for about 38 days to get the number down to 12… this would be extremely onerous and prone to human error.
I would like to create a cron-job that reduces the number of retained snapshots by 1 each day for each of the shares - down to a minimum of 12
Is there an API that I could utilise to modify the schedules or shall I just hack-into the postgres database?

Many thanks for any advice!

[1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

@ScarabMonkey Hello again and sorry for the slow response.

Many have found that disabling quotas, now possible as of current stable channel updates, see:

https://github.com/rockstor/rockstor-core/issues/1592

(issue contains pics of closing pull requests solution)

improves on performance. Maybe you could temporarily disable quotas, see if the multiple snapshot deletion is a little more practical time wise and then make the required changes on number of retained snapshots. You could always re-enable quotas later once things have become more practical snapshot count / performance wise.

I know this doesn’t answer your question directly but it may be a work around for the time being. Note that if your are running a stable channel release and have Rock-ons enabled (docker-ce) then quotas are in fact disabled (momentarily) and then re-enabled on every boot anyway. See outstanding issue:

Hope that helps but note that our quota support, enabled or disabled, is an ongoing effort; but then so is that of btrfs itself.

Let us know how you get on. The performance of large arrays with many snapshots is definitely, currently at least, an Achilles heel for btrfs and consequently Rockstor.

Apologies again for not replying sooner.

Remember to ensure via “yum info rockstor” that your installed version is actually from stable channel.

As a very late follow-up … yes indeed disabling quotas solved the issue!

1 Like

Yeah, the quotas gave me a rough couple of last days.

I subscribed to updates, updated to the latest, which pulled in docker ce. I had moved a couple drives into my main pool a while back, and it was fairly unbalanced. I started a balance, and promptly went to bed.

The next morning, the UI was unresponsive. I checked the balance, it was still on 0 of X chunks processed. I tried cancelling the balance, that command hung. Tried rebooting, system got part way done shutting down, and hung…great. Waited a while for disk activity light to stop for a while and hard rebooted.

System restarted, hung trying to mount, with kernel timeout messages about btrfs_transaction showing up in syslog. Eventually it mounted about ten minutes later and immediately started to balance again, causing anything trying to access the pool to hang. Same with balance cancel.

Downloaded a tumbleweed image and booted off it, tried to access the pool. Mounting rw would timeout, but ro would mount right away. Data was all there so that was good.

A day of googling trying to sort out the transaction timeouts and running various rescue operations and getting nowhere, I grabbed a pair of 10tb drives and planned to copy the data over manually from the RO mount.

I restarted the system, and missed the boot menu to boot of the usb stick. While I waited for the boot to complete I just happened to come across a post talking about quotas, many snapshots (thanks docker), and large arrays having very extreme performance issues.

Waited for the mount to complete, the balance started back up. Disabled quota on the pool hoping the command would get through eventually. About 15min later it completed and I was able to cancel the balance with no issue.

At least I’ve got an extra 20tb now :joy:

1 Like