How about ceph?

Tomasz_Kusmierz · August 2, 2017, 10:17pm

Chaps …

At risk of making few folks angry … how about ditching btrfs and going for ceph ? Or at least giving people option to use it instead of btrfs ?

I’ve been running proxmox with ceph, with cephfs on top of it for last 6 months and

so far I’ve not had a single explosion comparable with btrfs smallest hickup.
performance is not stellar
write did suck big time, but with more recent bluestore replacing filestore writes are close to reads.
I actually migrate from filestore on 10 disks to bluestore - it took best part of 8 days but it was as simple as " delete disk from array -> wipe -> commit as bluestore storage -> wait for ceph to rebalance and go into state OK -> rinse & repeat" all off that while using the rig.
it scrubs it self
it balances it self
it heals it self when things go wrong
it gives people normal pools which are raid 1 with multiple copies and eresure pools that are effectively raid 5&6
it’s built with safety of data first in mind …

magicalyak · August 4, 2017, 11:47pm

I’m assuming you mean cephfs only since Btrfs is a filesysyem right? Ceph I think isn’t as mature but I’d welcome feedback on a different viewpoint. I haven’t used ceph outside openstack. Can you give us some more information on other projects using it as the file system?

Tomasz_Kusmierz · August 5, 2017, 7:45pm

ceph is stable
cephfs is stable

there are features that are unstable that are NOT enabled (unlike btrfs having raid5 for who knows how long without having it fixed)

ceph / cephfs peformace is questionable at some points but it get’s better - and frankly I prefer development approach of data safety first performance second.

mkeith83 · August 5, 2017, 8:50pm

Ceph does seem interesting… but for a home use scenario I wouldn’t particularly care to have a dozen drives or so plugged in throughout the network, and then have something running cephfs so that these drives could be used to make an smb share available.

how does your cephfs setup look under proxmox?

Tomasz_Kusmierz · August 5, 2017, 9:19pm

started with 8x 2tb + 1x 3tb + 1x 4t then converted it to 2x 8tb + 1x 3 tb + 1x 4tb … everything while still using it …

I run a single node “cluster” (meaning it’s not a cluster) but I’ve got same redundancy as single raid 1 btrfs.
MDS (which is something that actually provides cephfs) it’s running on same machine … no problem still … goot part is that if stuff crashes, this is a worry free setup … on btrfs every time neighbour would sneeze I was getting paranoid

Karlodun · August 7, 2017, 10:15pm

Ceph is a cluster storage solution, not a “local” storage solution.
Similar question is often asked towards Databases: why should we use PostgreSQL/MySQL, when we have Hive (in hadoop)?
Well… distributed system have the disadvantage of very high fixed (per task) costs, and low variable costs (execution). This means: wether you need a small file of only 4 kb - or 40GB the execution time of the cluster system will be almost the same (provided the hardware is there), it will be significantly slower and much more expensive in comparison to a local solution in 4kb case, while it will be much-much faster (and probably even cheaper) with 40GB, especialy if you are not “loading” the data, but processing it (like hadoop).
And some additional info: many “cluster” file system run on top of local file systems. You may/can use HDFS on BTRFS drives (well, it’s a small overkill, but you can and for some features you should).

Tomasz_Kusmierz · August 7, 2017, 11:39pm

Well using ceph is just an idea … right now solutions kinda suck: zfs - pools require same disks and you need a tone of ram, btrfs - is fine with commodity equipment and not that much ram but you need to cary it in your hand like an egg and if something goes wrong first thing you will do is to check consistency of backup not how to fix the problem.

And as far as backup is easy to be automated, restore never is … and honestly I think that btrfs is “next big thing” unfortunately status of always being 2 years from done is depraving … and I started with btrfs 4-5 years ago and at the time it was very experimental but raid 1 was bullet proof, now even raid1 has tendencies to self corrupt … and from what seems from the outside like there is no drive to just stop changing things and start from 0 with reviewing code + refactoring stuff that was written obviously at 3am + removing dead code + removing pure violations of any reasonable coding standard + passing through simple static analysis tool + putting data integrity and rescue first over new funky features (who need to compress FS if data is getting corrupted anyway ??? who needs skinny extents if your FS can corrupt it’s root and then space reclaimer will wipe out all old roots that are any good).
Btrfs is unfortunately in many eyes getting more and more mismanaged - we all were lead to believe that it’s COW, so nothing is modified, just written to new locations so old data should be there - right ? well recently it turns out that “wrong” … in my case it got so bad that reclaimer decided to remove all FS root from more than a month back but surprisingly older ones were still there untouched (of course of which 80% of data was usable, remaining 20% was either pointing to mars or was just compacted to 4kB files). How about letting people know about this new features and take of the main wiki “your data never gets overwritten”.
On top of that there comes an anger that we - software engineers try to have our life, we convince companies we work in to support open source in some way or fashion and clearly some people just don’t give a f*** AT ALL. For unknown to me reason all ever help that was getting me out of hole was help from Cris Mason - he is in charge of this project, shouldn’t he delegate 2 people to “work at the front desk” and solve people problems (otherwise being fired) and him self getting on with this project and hitting people over a head with a stick to start making some progress towards NOT LOOSING DATA ?!?!?!?!?!

So that’s why I suggested ceph, yes small file write performance does suck … but they are getting better and better with performance, still I’ve got certainty that they did 105% what was in their power to make sure that my data does NOT vanish. Another benefit is that since writes and read are technically done as network operations very soon data gets issued by application to “storage” it’s overtaken by network stack and all data here get’s checksummed - making need for a ECC less of a problem, because if message get’s corrupted you will get immediate error from network stack rather than like with btrfs (and ZFS for that matter) if data / fs btree get corrupt in ram you are toast … and nothing will return error just write garbage to the disk and checksum garbage and write this bad checksum next to garbage on disk … AND THAT SUCK!

So question here is simple, where rockstor want to put it’s value:
btrfs - data russian roulette, that when it comes to using actual features of FS (try unlink 500 snapshots) i can grind to a halt or become unrecoverable
zfs - inflexible, license questionable, fs that can only use same size disks on machine with tone of ram and cpu, but your data is mostly safe
cephs - slow for small files distributed fs used in non-distributed fashion that does not require stellar amounts of ram or cpu but put’s your data first.

And it’s all about choosing and providing choices, otherwise we would be stuck with minix fs and trying to figure out how to store our collection of cat videos with only 14 characters for filename and 64mb per volume

haoto · August 11, 2017, 5:11pm

I myself have been running overlayfs/mergerfs on top of single-disk btrfs volumes on my home server for a few months. I now use snapraid to retain (offline) redundancy. Unfortunately overlayfs on btrfs turned out to be very buggy, and inotify doesn’t seem to work on mergerfs. So my setup is far from ideal.

Cephfs looks very interesting, but the setup process seems quite involved for mere mortals. Could you offer some performance figure on your single host setup so we have a better idea about what kind of trade-offs we are talking about?

Tomasz_Kusmierz · August 11, 2017, 8:26pm

So ceph is kinda of a mixed bag to be honest.

To understand Ceph … you have to understand that ceph is not a file system, it’s a storage … an object storage to be exact … for simplicity you can think of it as something like a lvm. Now within ceph you can create a pool or pools … and this is how actual data (objects) is stored in ceph - in pools. A pool can be either mirrored with X copies, or raid5&6 with how many parities you want. If it’s mirrored than it’s like btrfs raid1, ceph just makes sure that copies are evenly distributed … so you get sort of this quasi raid10 setup. Sorry for being boring bot this is sorta important.

Now, from here there are basically two ways (three but third one is for really crazy setups):

you can create and image - which can be mount as a virtual disk in many ways: virtual machine, remote filesystem, mount as a image. It is a more or less fake block device (commonly this setup is refered to as rbd)
you can create a physical FS (called cephFS) - it will require two pools, one for storing data, second for storing metadata.

More technicalities:

in rbd setup your block device is chopped to 4MB objects and every modification to fs = read block, modify, write. All software on linux / unix usually reads whole file and writes it as new if modified - so this will result with lot of reads and writes … at least if you perform operation on lot’s of files FS buffer will aggregate writes to larger chunks.
in cephFS a single file if smaller than 4BM is a single object, if more than 4BM it consists of multiple objects.

Single storage unit in ceph is OSD … it’s a deamon with a (usually) single disk for storage. Setup was a bit crazy, you had to have 2 partitions, 1 small where ceph kept it’s journal, second large where data was ultimately stored. Fs of choice was XFS due to data integrity stability and no size limitations (and some stuff with attributes that eliminated etx4). Smart person can see problem with this setup:

Data enters OSD -> write to journal of XFS of journal partition -> commit journal to disk of journal partition -> trim journal of journal partition -> write data to journal of data partition -> commit journal to data partition -> trim journal of data partition -> write in journal of journall partition that bufffered data from journal partition should be deleted -> commit journal of journall partition to disk -> trim journal of journal partition.

This equals to data causing 6 writes to disk, and on spinning rust we know what it means = seek the disk to death.

From personal experience moving 800K small files is unbearably slow. If storing 5k small files this is not a problem because it will just sink into journal of journal disk and problem is somehow not as visible.

This situation was changed 2 months ago. Ceph introduced “blue store” which is a database like FS which can guarantee atomicity of operation and is fast searchable. and has more or less btrfs style of putting data on disk without journal, while still not making FS explode on power loss and partial write and it does not have problems of typical FS (like when you want to locate an object on a disk with specific ID/filename among 10k folders). What this resulted in is setup where OSD uses a disk with partition for storing stuff and partition for database - database is just a filesystem like point of reference - what sits where, filesystem just stores object.
With bluestor it looks more like:
OSD receives data -> write objects to FS -> commit transaction in FS database -> done.
And since it’s a database you don’t need to use silly barriers etc, OSD can decide how many files at once it want to write, it does not have to just write 1 file at the time to make sure it’s atomic and sequential.

I’ve changed to bluestor all ceph instances I’ve got a performance increased of a level that I don’t really care anymore about it. I’ve got peace of mind so far that data does not slowly cook it self.

There is something that I wanted to mention in terms of how people perceive data “safety”. XFS and BTRFS will store data on a disk very quickly, still XFS does buffer it in a very long queue before it hits the disk, and btrfs stores data on disk as quickly as possible BUT it does only checkpoint new fs root every 30 seconds so if you stored data and lost power after 2 seconds there is a high chance you will not be able to find it. Both filesystem (and myriad of others) will confirm your write as complete without being 100% certain that data is safely on a disk.

Main slowness of ceph is that it will only exit a io write function of any application when data is confirmed to be stored in all required mirrors. To some this can be pedantic but this is a default behaviour and can be overwritten when person understands the risks (to me which are actually starting working like a every day operating system - but those ceph guys are paranoid about data integrity). On top of that there is a setting for rbd block device, which does the same thing - just not to application that writes something but to operating system that is using this block device, ioctl will not exit until whole storage returned OK, you can override that and let ceph just buffer data before writing. Which is still showing that they are paranoid about data … something that seems like btrfs people just completely forgot.

So to summ up,

with old backing store (filesore) and all safety locks:
write will be VERY slow, and it can start chocking when you throw at is more than total size of journals (journall is usually 2% of any disk if my memory serves me right).
reads will be 5 - 50 % of total disk throughput depending on how random reads are (of course hyper random reads will seek storage to death)
with new backing store (bluestore) and all safety lock:
writes will be half of read speed
reads will be 30% - 80% of total disk throughput depending on how random reads are (of course hyper random reads will seek storage to death)
with new backing store and disabling paranoid safety:
writes will be matching reads if not exceeding their speeds due to local caching - this depends on randomness, size etc.
reads will be same as with safety locks.

Another positive function of ceph to mention is that is always crubs in the background, for bitrot, for metadata inconsistency etc. If you will access storage it will stop - get you your data - resume scrub, something that with btrfs was always a pain.

for illustration, on my rig with safety locks still, and fully on bluestor:
4 spinners on sas backplane (max 6gb/s I think)
read:
dd outputs:
1475+1 records in
1475+1 records out
15468645003 bytes (15 GB, 14 GiB) copied, 106.485 s, 145 MB/s

write:
root@proxmox1:/cephfs# dd if=/dev/zero of=/cephfs/zero.file bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 115.868 s, 90.5 MB/s
root@proxmox1:/cephfs# rm zero.file
root@proxmox1:/cephfs# dd if=/dev/zero of=/cephfs/zero.file bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 112.69 s, 93.0 MB/s

haoto · August 15, 2017, 10:10am

Thanks for the detailed explanation! Your performance estimate looks very much worth it to me. My home NAS is only used for backup and media streaming, so read speed of 30%-80% disk throughput is more than enough. Will try to set it up when I find the time to.

Tomasz_Kusmierz · August 16, 2017, 8:32pm

Sorry for late reply to this one, difference with your analogy is that postreSQL / mySQL are reliable databases where extremely rarely anything goes wrong with near to fantastic performance … btrfs is not reliable and in some use case scenarios is very slow (unlinking files, unlinking snapshots, quotas … to name the few)

Further diving into your comparison, mysql / postgresql can provide 99% functionality anyone could ever need while still being reliable / compact if configured properly, btrfs is only option if you want to run large FS spanned over irregular commodity disks (zfs requires same size disks) with easy functionality for growing / shrinking FS, adding drives, removing drives - you name it. It’s just not getting to the point of being 99% reliable in at least one configuration (say raid1) is what is causing people look somewhere else.

At the moment btrfs is growing is code size and I’ve not seen any project that was unstable / buggy and was growing in size to then become stable and result in roaring success.