Bcache - developers notes

This document represents the current and pending state of bcache in Rockstor and is not an endorsement of it’s use or support status but is simply intended to share the progress and development made to date. It is intended as a developers document so any changes made should be in accordance with keeping this text up to date with the current code state, including in the case pending pr’s.

If you wish to comment on bcache support in general, as opposed to the specifics of what is actually in play currently: please consider contributing to the following forum thread: SSD Read / Write Caching. Comments and edits here are welcomed in the context of code or development contributions only as it is intended as a current state of play doc.

What is bcache

The facility to use a faster device, such as an SSD, to cache the reads and writes from one or more slower devices, such as a HDD is the focus of the in kernel bcache system. This is not currently a supported feature of Rockstor and if implemented via the command line can lead to some confusing an misleading reports from Rockstor’s user interface. However some of these ‘confusions’ can be avoided by using the udev additions / configurations detailed in this post.

Thanks to @ghensley for his encouragement and assistance in this area.

The rules as presented are only trivially altered from those presented by @ghensley during my (@phillxnet) ‘behind the scenes’ schooling from @ghensley.

Required udev rules

First the following file must be added:

/etc/udev/rules.d/99-bcache-by-id.rules:

with the following contents:

# Create by-id symlinks for bcache-backed devices based on the bcache cset UUID.
# Also, set a device serial number so Rockstor accepts it as legit.

DEVPATH=="/devices/virtual/block/bcache*", \
        IMPORT{program}="bcache_gen_id $devpath"

DEVPATH=="/devices/virtual/block/bcache*", ENV{ID_BCACHE_BDEV_PARTN}=="", \
        SYMLINK+="disk/by-id/bcache-$env{ID_BCACHE_BDEV_MODEL}-$env{ID_BCACHE_BDEV_SERIAL}", \
        ENV{ID_SERIAL}="bcache-$env{ID_BCACHE_BDEV_FS_UUID}"

DEVPATH=="/devices/virtual/block/bcache*", ENV{ID_BCACHE_BDEV_PARTN}!="", \
        SYMLINK+="disk/by-id/bcache-$env{ID_BCACHE_BDEV_MODEL}-$env{ID_BCACHE_BDEV_SERIAL}-part$env{ID_BCACHE_BDEV_PARTN}", \
        ENV{ID_SERIAL}="bcache-$env{ID_BCACHE_BDEV_FS_UUID}-p$env{ID_BCACHE_BDEV_PARTN}"

this file in turn depends upon a ‘helper’ script:

/usr/lib/udev/bcache_gen_id:

containing:

#!/bin/bash

[ -z "$DEVPATH" ] && DEVPATH="$1"

BCACHE_DEVPATH=$(readlink -f "/sys/$DEVPATH/bcache")
[ -z "$BCACHE_DEVPATH" ] && exit 1

echo "ID_BCACHE_CSET_UUID=$(basename "$(readlink -f "$BCACHE_DEVPATH/cache")")"

BDEV_PROPERTIES=$(udevadm info -q property "$(dirname "$BCACHE_DEVPATH")")
echo "$BDEV_PROPERTIES" | awk -F= -f <(cat <<-'EOF'
        /^ID_MODEL=/ {
                print "ID_BCACHE_BDEV_MODEL="$2
        }
	/^ID_SERIAL_SHORT=/ {
                print "ID_BCACHE_BDEV_SERIAL="$2
        }
	/^PARTN=/ {
                print "ID_BCACHE_BDEV_PARTN="$2
        }
	/^ID_FS_UUID=/ {
                print "ID_BCACHE_BDEV_FS_UUID="$2
        }
EOF
)

and to make this file executable we do:

chmod a+x /usr/lib/udev/bcache_gen_id

Why these rules

In the above virtual bcache dev serial attribution is based on the associated backing device fs uuid as this seemed to be a robust solution. The reasoning here is that if one was to move an image of a bcache backing device from one physical device to another then my understanding is that this new physical device would take the place of it’s predecessor so if we ascribe our virtual device serial with the uuid of the backing device, which presumably is derived via it’s early sector make-bcache -B attributes, then we have a more robust scenario. Ie virtual bcache dev (/dev/bcache0) will remain correctly associated. Obviously a re-naming would take place on the backing device but Rockstor can handle dev re-names as long as the serial remains the same. Which if we use the lsblk reported fstype and uuid of a backing device, it should.

This way we have more info available as the virtual by-id dev name will have backing device serial within it already so we might as well get something more from the serial info, ie the tracking to software superblock of the backing device. While the backing device is tracked by it’s own serial.

Building bcache

Although bcache’s kernel component is included in Rocksor’s elrepo ml kernel (but not in our CentOS installers kernel due to it’s age) there is a requirement for some user land tools. There are Fedora packages available that are linked to from the maintained main page for bcache reproduced here for convenience.
http://pkgs.fedoraproject.org/cgit/rpms/bcache-tools.git/

It is intended that an unofficial retargeted rpm including the above udev modifications be made available for those wishing to trial this unsupported element of Rockstor’s development. If you fancy rolling this rpm before I (@phillxnet) get around to having a go then do please update this wiki with your efforts. For the time being only a ‘built from source’ approach has been trialed:

cd
wget https://github.com/g2p/bcache-tools/archive/master.zip
yum install unzip
unzip master.zip
cd bcache-tools-master/

from the README the commands that are to be build are:

  • make-bcache
  • bcache-super-show
  • udev rules


The first half of the rules do auto-assembly and add uuid symlinks
to cache and backing devices. If util-linux’s libblkid is
sufficiently recent (2.24) the rules will take advantage of
the fact that bcache has already been detected. Otherwise
they call a small probe-bcache program that imitates blkid.

yum list installed | grep libblkid
libblkid.x86_64 2.23.2-26.el7_2.2 @anaconda/3

bcache build pre-requisites

yum install libblkid-devel

then

make
make install

backup existing initramfs

cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

If running same kernel version as will boot into:

dracut -f

else we need to specify version and arch.

Finally we run the following to take account of the initramfs changes at boot:

grub2-mkconfig

Once the kernel module is loaded (auto upon device being found or via modprobe bcache) there should be a:

/sys/fs/bcache/

directory.

Also note that changes / additions to the udev rules are often only put into effect after a:

udevadm trigger

Relevant Proposed changes in Rockstor

The udev additions previously covered along with proposed changes to the disk management sub-system in pr:

should mean that Rockstor is able to recognise all bcache real and virtual devices: but very limited testing in this area has been done, but in a single caching device serving 2 bcache backing devices arrangement, all associated devices were correctly presented within the user interface.

N.B. all devices were whole disk and no accommodation or testing was made for partitions in the above pr re bcache.

In the above pr the following command was used to setup a single device to cache 2 other devices in a KVM setup. The VM device serials were chosen to aid in identifying the proposed type of the devices.

make-bcache -C /dev/disk/by-id/ata-QEMU_HARDDISK_bcache-cdev -B /dev/disk/by-id/ata-QEMU_HARDDISK_bcache-bdev-1 /dev/disk/by-id/ata-QEMU_HARDDISK_bcache-bdev-2

And resulted in the following disk / device identifications:

In the above image the padlock’s indicate that the bcache virtual devices have been used as LUKS containers, simply as this was a more pressing focus for the disk management subsystem. So normally they would appear as regular ‘ready to use’ Rockstor devices.

6 Likes

Just a tiny mentioning to all people that think that they really need bcache:
“you don’t”

before you put your self in through missery of configuring this stuff you need to think about what is a usage pattern of your FS, how reads and writes are distributed … who reads / writes it … it’s it’s a guy over a 1GB eth - you’re having a laugh, if it’s a guy over a 10GB eth … you are still having a laugh you just don’t know it jet.

@phillxnet brings that stuff here for people that really have heavy FS use ( read: DB and extreme file hosting ), so until you don’t have a need for bonding fibres out of your box and use special high performance SAS controllers with 20+ disks, treat this as a fun project, not your production system solution !!! exclamation exclamation exclamation exclamation exclamation exclamation exclamation exclamation exclamation exclamation 1111111 one one one

btw, if you want to chip in - great !

Now that SSD economic and accessible like the - Sun Flash Accelerator F80 PCIe.
It reads 1800mb/s and writes at 980Mb/s this speeds in a very endurance enterprise grade device?
Isn’t it super sweet for just $80 on eBay?

  • I wonder if it can be used as bcache. Has anyone work with this PCIe 4x200GB cards in Rockstor before?

I have follow the steps indicated here in this article. I’m just missing the make-bcache -C... as I’m not sure how to construct the command or if I have to partition each of the 4x SSD 200gb in certain way or If I have to make an array raid0 before making the bcache?

This is how my dd setup looks like.

[root@datrom bcache-tools-master]# lsblk -o NAME,MAJ:MIN,RM,SIZE,TYPE,FSTYPE,MOUNTPOINT,UUID,PARTUUID
NAME MAJ:MIN RM SIZE TYPE FSTYPE MOUNTPOINT UUID PARTUUID
sdf 8:80 0 186.3G disk
sdd 8:48 0 186.3G disk
sdb 8:16 0 223.6G disk
├─sdb2 8:18 0 15.7G part swap [SWAP] a317f5e0-9703-4648-aca7-cce614ce8dfe
├─sdb3 8:19 0 207.4G part btrfs /mnt2/rockstor_rockstor 3a717616-0582-4c85-8f59-4e40750e5532
└─sdb1 8:17 0 500M part ext4 /boot c534cf91-37d3-4b59-b262-c98548b08ac6
sde 8:64 0 186.3G disk
sdc 8:32 0 186.3G disk
sda 8:0 1 29.1T disk btrfs /mnt2/Datrom 360efa5f-a685-4404-b508-59753d752216

This is why I’m looking for faster solution or access to repetitive files. Please read next

Side note:
This is actually a production system in my mini video edition work flow. I’m using 40Gb mellanox NICs to read from/to this Rockstor server. I’m using in my editing workstations Premiere Pro 2019. Project involves about 100GB of proxy multicam vide files (reduced size from the original video files that sometimes are 4k video at 100mb/s frame rates each) but the original usually is about 420Gb multiple shots per camera a total of 4 or 5 cameras or video files need to be reviewed back and forth to select best shots and do cuts, editing, color correction and such.

Can some one point me to the right direction. I can easily replicate this server scenario in another system I have in case that some want to give me a hand and used it as a testing system other then my production one.

thanks for the guide

but after install bcache on 4.12.4-1.el7.elrepo.x86_64

the system cannot boot

screen shows

then I switch to 4.10.6-1.el7.elrepo.x86_64 and booted the system normally

the bcached disks showed up and can be imported

would you please check these messeges and tell me what the problem and how can I fix it?

thanks very much !

@iecs Hello again.

That is a strange one. I suggest starting a new post with your findings as this wiki thread was intended just for development notes. So folks may be more willing to respond to a general thread than this one. And as indicated here the bcache support is fledgling currently as it depends on so many custom arrangements such as the custom udev rule etc detailed above.

Also note that development on our CentOS variant ceased as of 3.9.2-57. So any updates that we may need to address issues with regard to bchache will only appear in our Rockstor 4 variant which is now ‘Built on openSUSE’ see:

I’ve not personnaly tried our bcache support in the ‘Build on openSUSE’ effort either so that would be particularly helpful for the Rockstor 4 endeavour. You may also find that it is not longer a requirement to install it as it may be part of the base Leap 15.2 that we are basing our pending installer on see:

Hope that helps.

1 Like

I used a Raid-0 with 2 & 3 & 4 SSDs and burned them out in less than 60 days. I now use 5 (and maybe 6 later) 750 WD Black CMR 2.5" drives for the same purpose. (First setup purchased for this purpose about 6 years ago?)

In combining backups from 5 other setups of old, I moved so much data through the SSDs that they couldn’t take it. Optane would probably work but “Intel” loves their stuff and the prices are out of this world! (currently $570 for 960G SSD).

Now, I know everybodies needs are different, so an SSD cache may be helpful for some, but keep in mind the original SLC SSDs could do 1000 writes per cell and current SSDs using MLC, TLC, 3DNAND and such are generally well below below that. Current commodity range is around 400 TBW to 700 TBW for a 512G SSD.

Guess what? The SSD’s write speed degrades to almost half specified as you get closer to the TBW in a fairly linear way! This degradation seems to appear from about 1/3 used up until it finally gives up the ghost.

SSD

Prices are out of date, but you can get the general idea.

Soooo, my suggestion is, if your LAN is 1Gpbs, then do a simple Raid-0 setup with 2 to n’ fast enough drives to account for overhead of about 30% on LARGE sequential file transfers. If on the other hand, you have a ton of small to tiny files that don’t change often, then an SSD setup may work for you.

If you have a 2.5, 10, or 40Gbps LAN setup, it’s time to get professional and setup an 4 to 14 disk (or more!) Raid-10 setup or consider Raid-5 or 6. Commodity SSD’s simply do not have the endurance for repeated writes.

I am working on a full report as I finalize the last details of my build and will submit it for y’all to reconnoiter and criticize later.

For now, I think an SSD setup for OFTEN USED CACHE only makes sense if “writes” to the NAS are few and far between, and then ONLY as a WRITE cache. Using as an often changing READ cache with kill them soon enough.

:sunglasses:

3 Likes