Snapshot limits and how to modify them

ScarabMonkey · June 7, 2018, 1:55pm

We have a Rockstor system in daily use by about 100 people with a total of 51 shares and around 20TB of data… We finally got everyone using these system a few months ago (migrated off an old Windows server).
We have configured daily snapshots of all the shares (as hourly snapshot were causing issues when being deleted) - retaining 50 per share.
We are now continuing to have have issues with smbd processes being in state ‘D’ which requires a reboot before some of the shares become available (which takes over an hour!!)
I believe this could all be because we have too many snapshots and some btrfs processes iterate over all the snapshots even when they don’t need to [1]

In our experience, deleting many snapshots at once also causes a similar hung state; so I’d like to start decrementing the number of retained snapshots over time…

This would involve changing 50 schedules one-at-a-time in the GUI each day for about 38 days to get the number down to 12… this would be extremely onerous and prone to human error.
I would like to create a cron-job that reduces the number of retained snapshots by 1 each day for each of the shares - down to a minimum of 12
Is there an API that I could utilise to modify the schedules or shall I just hack-into the postgres database?

Many thanks for any advice!

[1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

phillxnet · July 2, 2018, 7:48pm

@ScarabMonkey Hello again and sorry for the slow response.

Many have found that disabling quotas, now possible as of current stable channel updates, see:

(issue contains pics of closing pull requests solution)

improves on performance. Maybe you could temporarily disable quotas, see if the multiple snapshot deletion is a little more practical time wise and then make the required changes on number of retained snapshots. You could always re-enable quotas later once things have become more practical snapshot count / performance wise.

I know this doesn’t answer your question directly but it may be a work around for the time being. Note that if your are running a stable channel release and have Rock-ons enabled (docker-ce) then quotas are in fact disabled (momentarily) and then re-enabled on every boot anyway. See outstanding issue:

github.com/rockstor/rockstor-core

docker-ce dictating pool quota enabled disabled status

opened 04:45PM - 12 Mar 18 UTC

phillxnet

After moving to docer-ce in stable 3.9.2-3 (pr #1865) it was found that docker-c…e, upon starting (ie on boot or when it's service was switched on) would disable quotas. This at the time was very disruptive to many basic Rockstor functions so a work around was found in: "rock-ons-root host pool quota disabled by docker-ce" #1872 and implemented in: "rock-ons-root host pool quota disabled by docker-ce. Fixes #1872" #1873 essentially adding " '--storage-opt', 'btrfs.min_space=1G' " to dockerd's initialization. This 'work around' effected a re-enabling of quotas by dockerd almost directly after it had initially disabled them, which it still did. This quota cycling of the affected pool on every boot was alleviated some what by improving Rockstor's quota disabled/cycling capabilities under issue: "improve quotas not enabled behaviour" #1869 and pr: "improve quotas not enabled behaviour. Fixes #1869" #1874 Given the above recent improvement in quota disabled behaviour a Web-UI selector for quotas / pool was introduced via issues: "[New feature] Add option to disable BTRFS quota/qgroups" #1592 and "Feature: Quota rebuild script?" #1785 and implemented in pr: "Add option to disable BTRFS quota-qgroups. Fixes #1592" #1903 So we now entertain the user selectable option of Quotas Disabled. But given that a pool's quota state is remembered by the pool itself and all our existing mechanism observe and are informed by the behaviour we have again surfaced an issue with docker-ce dictating pool quota status. This, to the issue author, looks to be the same upstream issue as was observed in the first referenced issue above (re-referenced here for ease: https://github.com/docker/for-linux/issues/78). It is not clear how we are to proceed as currently docker-ce is now dictating the final quota state for the pool that hosts it's rock-ons-root share (subvol). This is inappropriate and not in keeping with btrfs defaults of adopting the last quota stats set, which the rest of Rockstor observes. A workaround for those wishing to adopt a quota disabled state for a pool currently hosting the rock-ons-root is to re-create their rock-ons-root on another pool (where enabled quotas are acceptable) or potentially to revert the changes in #1873; but the latter would simply have docker-ce dictate that quotas were disabled rather than enabled shortly after boot and again circumvent the user setting (added in #1903) and the btrfs default behaviour of maintaining the last setting requested. Suggestions welcome: however, in the issue authors opinion, docker-ce should not dictate/hard wire quota status, irrespective of it's initialisation parameters. And as such this is viewed as an upstream bug that can only be addressed upstream.

Hope that helps but note that our quota support, enabled or disabled, is an ongoing effort; but then so is that of btrfs itself.

Let us know how you get on. The performance of large arrays with many snapshots is definitely, currently at least, an Achilles heel for btrfs and consequently Rockstor.

Apologies again for not replying sooner.

Remember to ensure via “yum info rockstor” that your installed version is actually from stable channel.

ScarabMonkey · March 5, 2019, 4:27pm

As a very late follow-up … yes indeed disabling quotas solved the issue!

Code_Monkey · March 6, 2019, 4:12am

Yeah, the quotas gave me a rough couple of last days.

I subscribed to updates, updated to the latest, which pulled in docker ce. I had moved a couple drives into my main pool a while back, and it was fairly unbalanced. I started a balance, and promptly went to bed.

The next morning, the UI was unresponsive. I checked the balance, it was still on 0 of X chunks processed. I tried cancelling the balance, that command hung. Tried rebooting, system got part way done shutting down, and hung…great. Waited a while for disk activity light to stop for a while and hard rebooted.

System restarted, hung trying to mount, with kernel timeout messages about btrfs_transaction showing up in syslog. Eventually it mounted about ten minutes later and immediately started to balance again, causing anything trying to access the pool to hang. Same with balance cancel.

Downloaded a tumbleweed image and booted off it, tried to access the pool. Mounting rw would timeout, but ro would mount right away. Data was all there so that was good.

A day of googling trying to sort out the transaction timeouts and running various rescue operations and getting nowhere, I grabbed a pair of 10tb drives and planned to copy the data over manually from the RO mount.

I restarted the system, and missed the boot menu to boot of the usb stick. While I waited for the boot to complete I just happened to come across a post talking about quotas, many snapshots (thanks docker), and large arrays having very extreme performance issues.

Waited for the mount to complete, the balance started back up. Disabled quota on the pool hoping the command would get through eventually. About 15min later it completed and I was able to cancel the balance with no issue.

At least I’ve got an extra 20tb now