So ceph is kinda of a mixed bag to be honest.
To understand Ceph … you have to understand that ceph is not a file system, it’s a storage … an object storage to be exact … for simplicity you can think of it as something like a lvm. Now within ceph you can create a pool or pools … and this is how actual data (objects) is stored in ceph - in pools. A pool can be either mirrored with X copies, or raid5&6 with how many parities you want. If it’s mirrored than it’s like btrfs raid1, ceph just makes sure that copies are evenly distributed … so you get sort of this quasi raid10 setup. Sorry for being boring bot this is sorta important.
Now, from here there are basically two ways (three but third one is for really crazy setups):
- you can create and image - which can be mount as a virtual disk in many ways: virtual machine, remote filesystem, mount as a image. It is a more or less fake block device (commonly this setup is refered to as rbd)
- you can create a physical FS (called cephFS) - it will require two pools, one for storing data, second for storing metadata.
More technicalities:
- in rbd setup your block device is chopped to 4MB objects and every modification to fs = read block, modify, write. All software on linux / unix usually reads whole file and writes it as new if modified - so this will result with lot of reads and writes … at least if you perform operation on lot’s of files FS buffer will aggregate writes to larger chunks.
- in cephFS a single file if smaller than 4BM is a single object, if more than 4BM it consists of multiple objects.
Single storage unit in ceph is OSD … it’s a deamon with a (usually) single disk for storage. Setup was a bit crazy, you had to have 2 partitions, 1 small where ceph kept it’s journal, second large where data was ultimately stored. Fs of choice was XFS due to data integrity stability and no size limitations (and some stuff with attributes that eliminated etx4). Smart person can see problem with this setup:
Data enters OSD -> write to journal of XFS of journal partition -> commit journal to disk of journal partition -> trim journal of journal partition -> write data to journal of data partition -> commit journal to data partition -> trim journal of data partition -> write in journal of journall partition that bufffered data from journal partition should be deleted -> commit journal of journall partition to disk -> trim journal of journal partition.
This equals to data causing 6 writes to disk, and on spinning rust we know what it means = seek the disk to death.
From personal experience moving 800K small files is unbearably slow. If storing 5k small files this is not a problem because it will just sink into journal of journal disk and problem is somehow not as visible.
This situation was changed 2 months ago. Ceph introduced “blue store” which is a database like FS which can guarantee atomicity of operation and is fast searchable. and has more or less btrfs style of putting data on disk without journal, while still not making FS explode on power loss and partial write and it does not have problems of typical FS (like when you want to locate an object on a disk with specific ID/filename among 10k folders). What this resulted in is setup where OSD uses a disk with partition for storing stuff and partition for database - database is just a filesystem like point of reference - what sits where, filesystem just stores object.
With bluestor it looks more like:
OSD receives data -> write objects to FS -> commit transaction in FS database -> done.
And since it’s a database you don’t need to use silly barriers etc, OSD can decide how many files at once it want to write, it does not have to just write 1 file at the time to make sure it’s atomic and sequential.
I’ve changed to bluestor all ceph instances I’ve got a performance increased of a level that I don’t really care anymore about it. I’ve got peace of mind so far that data does not slowly cook it self.
There is something that I wanted to mention in terms of how people perceive data “safety”. XFS and BTRFS will store data on a disk very quickly, still XFS does buffer it in a very long queue before it hits the disk, and btrfs stores data on disk as quickly as possible BUT it does only checkpoint new fs root every 30 seconds so if you stored data and lost power after 2 seconds there is a high chance you will not be able to find it. Both filesystem (and myriad of others) will confirm your write as complete without being 100% certain that data is safely on a disk.
Main slowness of ceph is that it will only exit a io write function of any application when data is confirmed to be stored in all required mirrors. To some this can be pedantic but this is a default behaviour and can be overwritten when person understands the risks (to me which are actually starting working like a every day operating system - but those ceph guys are paranoid about data integrity). On top of that there is a setting for rbd block device, which does the same thing - just not to application that writes something but to operating system that is using this block device, ioctl will not exit until whole storage returned OK, you can override that and let ceph just buffer data before writing. Which is still showing that they are paranoid about data … something that seems like btrfs people just completely forgot.
So to summ up,
-
with old backing store (filesore) and all safety locks:
write will be VERY slow, and it can start chocking when you throw at is more than total size of journals (journall is usually 2% of any disk if my memory serves me right).
reads will be 5 - 50 % of total disk throughput depending on how random reads are (of course hyper random reads will seek storage to death)
-
with new backing store (bluestore) and all safety lock:
writes will be half of read speed
reads will be 30% - 80% of total disk throughput depending on how random reads are (of course hyper random reads will seek storage to death)
-
with new backing store and disabling paranoid safety:
writes will be matching reads if not exceeding their speeds due to local caching - this depends on randomness, size etc.
reads will be same as with safety locks.
Another positive function of ceph to mention is that is always crubs in the background, for bitrot, for metadata inconsistency etc. If you will access storage it will stop - get you your data - resume scrub, something that with btrfs was always a pain.
for illustration, on my rig with safety locks still, and fully on bluestor:
4 spinners on sas backplane (max 6gb/s I think)
read:
dd outputs:
1475+1 records in
1475+1 records out
15468645003 bytes (15 GB, 14 GiB) copied, 106.485 s, 145 MB/s
write:
root@proxmox1:/cephfs# dd if=/dev/zero of=/cephfs/zero.file bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 115.868 s, 90.5 MB/s
root@proxmox1:/cephfs# rm zero.file
root@proxmox1:/cephfs# dd if=/dev/zero of=/cephfs/zero.file bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB, 9.8 GiB) copied, 112.69 s, 93.0 MB/s