Rockstor writes more than 1GB/day to root disk while just idling?

s.ma · February 1, 2022, 2:15am

Dear Rockstor community,

I was experimenting with Rockstor in a VM and what caught my attention was the ongoing writes on the root filesystem while the system is just idling.

I’m seeing this with iotop -a and then sorting by data written (pressing twice the left-arrow key). There is a process “[btrfs-transacti]” that keeps going and writing out data all day, surpassing 1000MB.

I had originally wanted to install to an USB stick (as seen with other NAS OSes), but have already found that you advice against regular USB keys, recommending to use a “fast” variant (2.0, 3.0?), or better an external HDD or SSD. So I was planning to get a small SSD with USB adapter. However I kept wondering if the observed writes wouldn’t even wear out, say a 8GB SSD, more then desirable, especially if there is more than a GB/day with real activity.

Can you confirm the large amount of idle write activity? Or maybe, will installs onto USB devices use a ram overlay, or other measures, to avoid the writes on the flash, and I am just seeing the writes because the rootfilesystem is on a virtual HDD? Wasn’t Suse in the news with read-only-rootvolumes for appliances and transactional updates that reboot into an updated snapshot.

PS: To check yourself: zypper install --no-recommends iotop ; iotop -a
(then press 2x left-arrow key, and wait a while)

s.ma · February 1, 2022, 4:28pm

I was guessing that the [btrfs-transacti] process is mostly updating metadata, but this older post even seems to indicate that appropriately small root disks run out of space quickly, requiring a balance.

Is this filling-up still going on, today?
And no balance and scrub tasks created for new Rockstor filesystems (“pools”) by default?

phillxnet · February 1, 2022, 9:05pm

@s.ma First of a belated Welcome to the Rockstor community. I jumped in on some of you other posts before doing this customary welcome.

I’m afraid I’m a little short on time now but I can chip in on this one:

We go with a slightly modified ROOT snapper config. Take a look at this in our installer config:

github.com

rockstor/rockstor-installer/blob/master/config.sh#L92-L106


      
          #=====================================
          # Configure snapper
          #-------------------------------------
          if [ "$kiwi_btrfs_root_is_snapshot" = 'true' ]; then
                  echo "creating initial snapper config ..."
                  # we can't call snapper here as the .snapshots subvolume
                  # already exists and snapper create-config doens't like
                  # that.
                  cp /etc/snapper/config-templates/default /etc/snapper/configs/root
                  # Change configuration to match SLES12-SP1 values
                  sed -i -e '/^TIMELINE_CREATE=/s/yes/no/' /etc/snapper/configs/root
                  sed -i -e '/^NUMBER_LIMIT=/s/50/10/'     /etc/snapper/configs/root
          
                  baseUpdateSysConfig /etc/sysconfig/snapper SNAPPER_CONFIGS root
          fi

Scrubs are as per the default in our new Leap 15.3 upstream.

That process is just btrfs. You need to track down what’s actually owning the writes not the write process itself within btrfs. Sorry having to dash now. Also note that in our v3 and before offerings we were CentOS based. Now in v4 and beyond we are “Built on openSUSE”. So all very different on the upsteam front.

We do no, on purpose, have any data pool scrubs / balances by default as that is user configurable and we can’t know what the use want’s. But we stick to openSUSE defaults on the root as we are trying to follow upstream on our system as much as possible.

Always nice to have more eyes on these things. And yes we have made some improvements. You might also want to look to see if the rotational report of your device is reading as you expect. That can affect how we do stuff. A quick look at the following my lead you to some of what we do re rotation in a system device:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/scripts/initrock.py

"""
Copyright (c) 2012-2023 Rockstor, Inc. <https://rockstor.com>
This file is part of Rockstor.

Rockstor is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

Rockstor is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
"""

import json
import logging

This file has been truncated. show original

also take a look at what it calls on rotational property of sys dev:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/scripts/flash_optimize.py

"""
Copyright (c) 2012-2020 RockStor, Inc. <https://rockstor.com>
This file is part of RockStor.

RockStor is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

RockStor is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
"""

import sys
import re

This file has been truncated. show original

The later and how/when it is called has not received much attention for a few years so it would be good to have other eyes on it. See what you think.

Hope that helps, at least a little. Let us know if you track down the ‘writer’. Also do you have your Rockstor Web-UI open during these large write events. Things are quieter with the Web-UI closed. Just a thought. We also need to do more db optimisation but again; all in good time. Plus that’s quite the speciality.

s.ma · February 5, 2022, 3:42pm

Thank you very much for the good pointers, so I could test this further.

I were able to retry, this time installing to a virtual SSD in the VM, and got to see the optimizations. I can now even list two additional optimizations to reduce the btrfs-transacti write amplification load, which may be worth linking to this issue/above mentioned scripts.

nospace_cache mount option (seemed to cut the writes a lot)
further reducing the number of snapshots

Unfortunately, even with the additional improvements there are still too many writes for USB flash devices during idle times, even without having any web-gui open. Kind of sad, that I don’t seem to be able to install this for my task at hand. (Small NAS for friends and family.) Even when preventing the system logs to hit the disk (with Storage=volatile in journald config).

I guess the solution for Rockstor could be at some time to enable suse’s transactional upgrades feature that works with a read-only-root, and moving the rockstore postgres database into ram during boot, and back on shutdown, possibly using a tool like folder2ram.

The most “real” (not meta transaction) data writes I saw were from:

postgres: stats collector
postgres: checkpointer
postgres: wal writer
dhclient
network manager
postgres: logger
nginx: worker
python2 supervisord

Regarding the missing data pool scrubs and balances by default. Ok, though, since they are generally recommended and necessary to avoid certain undesirable states, I would have thought it may be good to have or leave some sensible defaults in place, to serve as a pointer and have examples that get tested in the wild (ok, automated installs may want a switch to skip creating the examples). However, it doesn’t even seem possible at the moment to create balance jobs under the automated tasks web-ui. Would the correct way currently be to manually edit /etc/sysconfig/btrfsmaintanance ?

s.ma · February 5, 2022, 4:49pm

Oh, if somebody happens to know a tool like folder2ram that’s available on suse, I could try to apply that to the database and nginx.

s.ma · February 5, 2022, 6:25pm

The official “read-only” feature packages seem to be these:
https://software.opensuse.org/search?utf8=✓&baseproject=ALL&q=read-only

But they seem to depend on the transactional updates role, at least I could not understand how one could let them copy some files back to disk before shutdown.

phillxnet · February 5, 2022, 7:06pm

@s.ma Hello again.

The move to a transactional server base would be non trivial. Many things are different and many folks are currently thrown by such systems. But in the long run I think it would be great to have this capability and I was chuffed that our new upstream now offers this. But it’s not for us for quite some time I suspect. It involves a whole different approach to some things. We do a lot of OS interaction and we would then have to ‘wrap’ every bit in what ever ‘special’ treatment is required for each and every interaction. Definitely worth doing but I think our technical dept is a priority along with such things as establishing a complete compatibility with AppArmor for instance. Which for now we simply disable.

So in short the jump to a openSUSE transactional server install would be great, but is currently in the long term plans not the medium or short. But you investigations re the database temp hosting in memory are super interesting. Do keep us updated on that front.

Incidentally we also have an issue open on the proposal to move to another openSUSE transactional endeavour: that of MicroOS:
https://github.com/rockstor/rockstor-core/issues/2217
of Mr Richard Brown himself.

But note that we, in turn, are not Tumbleweed compatible which MicroOS was based on at the time at least due to our Python 2 legacy. So again we would first have to address that little number before even considering this. We did use to have a Tumbleweed version but then Python 2 was dropped, understandably, from TW and we were left in the past again. I enjoyed seeing TW Rockstor up-and-running as we were essentially completely functional for around a year in that period. Alas, all in good time. See the following for our Python 2->3 endeavour. It’s the main focus of our soon to start over testing channel releases actually:
https://github.com/rockstor/rockstor-core/issues/1877

And do keep investigating he db in ram thing. Also note that we have an only very lightly configured Postgres config. That is very likely an area that could well help with the writes we do. I.e. Postgres is massively configurable and could likely ‘chache’ a tone of the stuff we likely have it just write thought to disk. A few years ago one of the main contributors added some tweaks and this sped things up a tad but it would be good to have more input on that front so if you know, or get to know stuff that could help there then super, steam in and see if it pans our in reality. We use it in an almost default config I think (from memory only as little time to check currently). Take a look as it may well be a quite workaround for your current aims. And if it works well we could include in our own code by way of a contribution / pull request.

Note that as we store the db in it’s default location, and openSUSE has rather expertly done nocow in var, we arn’t that badly affected by the btrfs/cow nature and db’s nature being rather opposed.

See our:

github.com

rockstor/rockstor-installer/blob/master/rockstor.kiwi#L94


      
                  devicepersistency="by-label"
                  btrfs_root_is_snapshot="true"
                  btrfs_quota_groups="false"
                  efipartsize="64"
          >
              <systemdisk>
                  <volume name="home"/>
                  <volume name="root"/>
                  <volume name="tmp"/>
                  <volume name="opt"/>
                  <volume name="srv"/>
                  <volume name="boot/grub2/i386-pc"/>
                  <volume name="boot/grub2/x86_64-efi"
                          mountpoint="boot/grub2/x86_64-efi"/>
                  <volume name="usr/local"/>
                  <volume name="var" copy_on_write="false"/>
              </systemdisk>
              <oemconfig>
                  <oem-swap>true</oem-swap>
                  <oem-swapsize>2048</oem-swapsize>
                  <oem-device-filter>/dev/ram</oem-device-filter>

which is an upstream default also.

Hope that helps, at least for some more context and thanks again for all the interest here.

phillxnet · February 5, 2022, 7:14pm

@s.ma I meant to confirm this:

Correct, we currently only have:

Fancy opening a GitHub feature request issue for scheduled balances

Surprisingly they have not been mentioned much at all as it goes.

Again this is not likely to receive attention from any current core developers given our backlog but it may catch someones eye and end up being submitted in a way we can pop in. Funny how this hasn’t been noticed much actually.

Incidentally I stand by a default on no scheduled tasks myself. But yes, better docs on advising sane defaults like a balance every 2 to 3 months or something. Also I’d like the scheduled balance to be able to specify a partial and to a degree at that. All good stuff and all in good time hopefully.

Again thanks for the input here. All good.

Tex1954 · February 8, 2022, 12:10am

I would like to point out a couple things, especially about SSDs as boot disks.

As mentioned in another post, current MLC,TLC, QLC etc. consumer SSDs have TBW between 400 and 700. The SIZE of the SSD has to be taken into consideration as well for the TeraBytesWritten value to make sense.

In the old days with SLC, we had fairly fast reads (250-325MBs) and slow writes (65-95MBs).

Modern SSDs use higher voltages, 3D structures and multiple bits per cell to achieve more symmetrical and faster throughput. Typical flash endurance in the old SLC days was 1000 writes per bit (cell) and they largely performed that well. (I still have many and none have failed yet!)

So, since we are talking about BYTEs, an SLC could live up to 1000 writes assuming well devised write balancing in the device. (Big ASSUME!)

So a typical old days 40G SLC SSD could in theory live up to 40 TBW.

A new modern SSD can live up to 400 TBW for a typical 512G SSD which is sadly LOWER than what used to be. In fact, it works out to be about 22% less TBW life. However, they live in style and are drag racers compared to the old 1955 Volkswagen Bug SLC drives.

I see BOTH my Rockstor setups write about 5.5 GBytes/Day. So how many days will my 80 TBW 240G SSD last?

80e12 (TBW) / 5.5e9 (GBday) is about 14545 days or about 39 years.

WHAT?!!? 39 years? Well, it’s only a rough calculation because the controller or/and other ICs used in construction could die any time. Also, this doesn’t take into account that many modern CHEAPER SSDs use a portion of the flash itself as a buffer. Better SSDs use DRAM for buffering and house keeping.
Also, when you fill up the SSD enough to interfere with that internal flash buffer, performance drops to pencil & paper speeds.

Good rules of thumb for longevity is use an SSD that is at least twice what the STATIC + Dynamic data is expected to be. (I prefer a factor of 10 or better myself!)

Also, take that 39 year expected lifetime and subtract 25% for flash buffer use.

Also, assume inefficient wear leveling and subtract the static + dynamic data size from the SSD size and multiply diff % times the new TBW.

Now I come up with (80 * .75 TBW) * ((240 - 20G)/240) = .60TBW * .92 = 55.2 TBW

Finally, do first calc using modified values.

55.2e12 (TBW) / 5.5e9 (GBday) is about 10036 days or about 27 years.

I don’t expect my Rockstor Boot SSD’s to die in my remaining lifetime!

Check out this spread sheet: ( open in new tab for full size)

Notice how the life goes to hell when you cross 50% utilization? Guess what, I totally destroyed one SSD and about hit TBW (Write performance seriously degraded) on two others in 59 days using them to 80% utilization testing them as a Raid0 buffer!

It isn’t magic, the numbers don’t lie and actual testing has proven this formula I use.

Soooo, only a few GB a day on lightly static loaded SSD? No worries!

PS: I’ve pretty much decided that unless you are rich and buy Intel Optane SSDs, using consumer SSDs as a cache or buffer of any kind is simply out of the question. Systems will try to use them to 100% as things stand and they will NOT work that way!