The Butter goes on top (Safe BTRFS raid5/6)

aehinson · September 21, 2016, 6:38pm

Ignore all of this. I have been informed that none of it works. If you need safe software raid 5/6 use a NAS solution that has ZFS like FreeNAS

So a lot of people who come to this site consistently ask for a lot of the same base features. While rockstor meets most of these with flying colors there has been one that it has not been able to do. This feature is Raid 5 and raid 6 with BTRFS. Now, two things that should be pointed out to anyone that is new to this site is that A) BTRFS is not maintained by the rockstor devs directly and B) kernel development is hard. So until BTRFS has stable raid 5/6 support I humbly submit this tutorial for safe raid 5/6 in rockstor (hint: mdadm raid with BTRFS on top)

So since BTRFS cannot be trusted to manage the raid we will be using mdadm the linux software raid configuration tool. To use this you will need to start by opening a shell. The new updates to rockstor have shell in a box built-in. However, for the rest of us just ssh into your rockstor install via your usual method. Once you have a root shell you have to run the following command to ensure mdadm is installed: > yum install mdadm Once this is done you must find the names of the drives you want to raid together. I personally open the rockstor web-ui and click on the create pool option under Storage > Pools and you will be shown a list of drives that can be added to a pool. Write these drive names down you will need them later. Next you will want to go back to your shell and run some derivation of the following command: >mdadm --create --verbose /dev/md0 --level=5 --raid-devices=3 /dev/sdb /dev/sdc /dev/sdd This command will create a software raid, as you can see it will be a raid 5 (you can change this to any level you want). Additionally it will be using 3 devices and as you can see they are listed out in "/dev/sdX" format this is where you will put the names of the drives you got from the UI earlier. Now as the raid is being built you can watch it's progress from the terminal using the following command > watch cat /proc/mdstat This command will print out the status of the raid assembly every 2 seconds. CTRL-C to quit. Once this command is done running you can open up the web-ui again and go to Storage > Disks. Now you will see that the drives that you just raided together now have a notation that says "disk is a mdadm member" this is good. You should also see a new drive that will start with "md". This is your raid disk. Now if you go into Storage > Pools > Create New Pool you should see this new raid device and you can now put a "Single" BTRFS pool on top of it. Congratulations you now have stable, safe RAID 5/6 with BTRFS **IMPORTANT CAVEAT** To all those who are ready to shoot down this configuration I would like to make some final points 1. This will not run the same way that a BTRFS raid 5/6 native will run. You will loose a small amount of read/write performance when running it like this. If you can't handle this you should run native BTRFS raid 10/0/1 or buy better hardware. [Link to performace benchmarks](https://phoronix.com/scan.php?page=article&item=btrfs_raid_mdadm&num=1) (This was not done by me but it shows very good performance specs for this style setup) 2. Reconfiguring mdadm raids will be different that BTRFS raids. There is no GUI in rockstor for it and you will have to touch a bit of command line. For more information on how to manage mdadm raid arrays please consult one of the following links [kernel.org](https://raid.wiki.kernel.org/index.php/RAID_setup), [Centos.org](https://www.centos.org/docs/5/html/Installation_Guide-en-US/s1-s390info-raid.html), or [mdadm cheat sheet](http://www.ducea.com/2009/03/08/mdadm-cheat-sheet/). All of those links have excellent documentation on mdadm raid management 3. When BTRFS raid 5/6 does become more stable and you want to make the switch it will be a bit of a hassle. Unfortunately kernel coding is hard, slow, and takes time to release stable code so I don't see this happening any time soon. No doubt BTRFS raid arrays will one day be significantly more fast and powerful than this mdadm BTRFS setup, but until then, This is the best raid 5/6 option I know of. 4. Even though mdadm is taking care of the raid array you will still have all of the scrubbing features that are normally available to single drive BTRFS filesystem. You will not lose any of that functionality with this setup. Additionally a lot of the filesystem size and used space values that don't work with BTRFS raid 5/6 will now work with mdadm+BTRFS. ~~So let me know what you think. Is this a good way to do raid or do I need to go back to lurking?~~

davioxx · September 22, 2016, 3:08am

what about metadata and bitrot etc? can you be able to recover with this config? this seems to me like we would just be like using rockstor on top of HW raid just using the FS, If we do any checksums what would happen? will we have overhead?

KarstenV · September 22, 2016, 5:33am

Good post.

I’m currently in the proces of buying new drives for my Netgear NAS, big enough so that it can hold all my data from my Rockstor NAS.
As it is now, I don’t have a complete backup of all my data (the most important data is backed up), and since I have a RAID6 setup, I am in a position where a defective drive could potentially make me loose all my data.

So I will migrate my data to my Readynas, and then consider what to do with my Rockstor setup. Your solution could be the way to go, as this would protect against defective drives.

As I really like Rockstor and the people supporting it, I would like to continue using it.

But the BTRFS support seems a bit slow and unwilling to fix errors that are important to me.
I know its because the features I request are infrequently used in commercial situations, but I still find it odd that when a supported function of a filesystem is found to have a major bug, the developers reaction is a shrug, and a note about it on the support pages. Then nothing seems to be done to fix it, leaving early adopters hanging on thin ice. Such a major bug as in RAID5/6 should have had the BTRFS developers scrambling to fix it IMHO.
This is not directed towards the Rockstor developers, but towards the BTRFS developers.

aehinson · September 23, 2016, 4:46pm

You are right this is very similar to using a HW raid. It won’t be quite as fast but it is more flexible than HW raid. As for metadata and bitrot, the metadata will be stored as if it were being put on a single drive BTRFS volume. So all of BTRFS’s error correction features will still work. As for recovery you will have to do drive failure recovery with mdadm not BTRFS. However, this will recover from drive loss in the same way that any other raid config will.

Tomasz_Kusmierz · September 23, 2016, 7:57pm

Sorry mate but your idea is a bit … bad (I’m don’t want to insult you through bad language, but this idea is not far from being called with such a language :/).

So, you want to replace a btrfs raid 5&6 with issues like write hole, with another solution that will have exactly same problems and some more.

dmraid has a write hole as well, and when it will try to “fix” strips that went bad due to not clean power down your btrfs will get shafted BIG TIME.
dmraid is a block device, it’s not fully aware of bad blocks, only when drive will specifically tell it - a lot of drives will silently replace a sector with a sector from a spare pool, dmraid will still think that everything is OK until you will specifically tell it to rebuild the raid … putting BTRFS on top of it will make stuff even worst, btrfs can and will discover a bad sector and try to fix a problem, but since btrfs does not has any backup it will give you IO error and forbid access to the data, what makes it even worst, it will not let dmraid know that sector is crap, co dmraid will never have a reason to attempt fix a strip. On top of that is bad block sits in a strip and hdd is not not reporting a bad block properly dmraid, every time it will recreate a parity data it will recreate it with a bad block data in, if your disk fails than you will recreate data from corrupt parity ! GOOD LUCK !
btrfs does NOT provide checksum for parity, so does dmraid … so if there is a faulty sector under a parity data, dmraid will rebuild from corrupt parity also dmraid does not has a scrub, you have to rebuild to find faulty parities, this means physically rewriting all the data.
since btrfs is a COW, it will write sectors in not a stripe aligned fashion, dmraid will be hit with a MASSIVE penalty of rewriting a whole strip on every sector write, providing a performance degradation and if some funky guy would come in and implement your solution on SSD, it would kill his SSD with writes (a lot them done in same place). So you fully understand, on dmraid you can have stripes of 512kB, while a single sector has 512bytes - thus amplifying every write 1024 times … dmraid is a block device, it was no concept of delayed writes and buffering like write back etc …

I can carry on about pitfalls of your idea … but honestly storage is chap, go for raid10 … if you really need raid 5 / 6 - zfs is you answer.

Reason why we want and need BTRFS to fully implement a raid5&6 is because we will finally have a full GPL fully functioning raid implementation of chap user equipment (sub 1k servers). Stuff that ZFS gives and btrfs is step behind can be only achieved by a very expensive RAID controllers (+10k servers) which connect only to SAS drives, and are designed not with little people in mind (cost, temperature, maintenance, noise).

aehinson · September 23, 2016, 9:26pm

Why does BTRFS need to create a checksum for parity? shouldn’t it treat the mdadm array like a regular block device and do all of its scrubbing without worrying about how the bits actually go on the disk?
If the parity is miscalculated by mdadm shouldn’t BTRFS see the bad data during the next scrub and try to correct it?
where can I read up on this? I am afraid I don’t understand what you mean. [quote=“Tomasz_Kusmierz, post:5, topic:2089”]
btrfs can and will discover a bad sector and try to fix a problem, but since btrfs does not has any backup it will give you IO error and forbid access to the data
[/quote]
SSDs… Yeah, I didn’t think of using this on SSDs but I can see how that would be very inadvisable.
I realize that many people say that “storage is cheep” and I can see what they mean. However, as a broke college student I have very limited budget and almost always have to use limited or donated hardware to learn on and just sticking with RAID 10 really isn’t practical in my case.
I agree that it would be nice to have stable native raid 5/6 support on BTRFS and not have all the RAM and legal issues that ZFS has however after lurking on several of the message boards of the BTRFS developers they don’t seem to worried about making stable raid 5/6 a priority.
As for the mdadm stripe sizes issue, could this be fixed by adjusting the stripe sizing? I have seen several write ups on how to better performance by changing stripe sizes and read ahead values.
It seems like a lot of the write hole problems stem from unexpected shutdown. If you were to connect your rockstor instance to a battery backup would this config be able to protect your data against drive failure provided that the power is most likely not going to randomly go out?
How would the use of mdadm bitmap feature affect these problems? It seems like it could smooth out some of the bad sector problems.
You mention dmraid in your comment however I used mdadm for this project. From what I have read these are significantly different tools. Will it change anything in the terms of this config?
I have checked several sources on using btrfs on top of mdadm raids and all any of them point out is the additional overhead (which is completely expected). Where did you find the info on the striping vs block device problems?

Sorry for the million questions. I just ditched my BTRFS raid 5 setup for this and I am trying to make it work. Also if you have links to more information I would be grateful.

Tomasz_Kusmierz · September 24, 2016, 12:35am

So:
a) it needs to create a checksum for parity … how else would you know that parity is corrupt ?
b) scrubbing a btrfs with nor raid level that sits on top of dmraid will result in absolutelly nothing, btrfs is not aware of extra copy it can recreate data from.
dude, parity created by dmraid is not visible to btrfs … whole point of having a dmraid is that it creates a virtual block device with raid hidden behind it, no filesystem is aware of what is available in the background, only IO and sysctl calls that amount to nothing.
you should really go and do a homework,
btrfs-corrupt-block
unmount your btrfs file system, corrupt a block on file system that has RAID 1 and on a system that has a no raid (single) and see how btrfs will behave.
Again, in your situation disk will silently replace your bad sector with good sector from spare pool (sector size = 512bytes or 4k, sector size in btrfs default = 16k, strip size in dm raid = 4k up to few megabytes for insane people) and you end up with a btrfs sector that has a part of it replaced with 0xffffffff and CRC does not match, btrfs does not have any backup, because you did setup a btrfs with no raid on top of dmraid that is performing raid behind the scenes - btrfs will give you IO error and block access to WHOLE file, not just a part of it. dmraid has no knowledge that something went wrong BECAUSE FILESYSTEMS DON’T COMMUNICATE FINDING A BROKEN SECTOR TO UNDERLYING BLOCK DEVICES !. There is no mechanism for that, there never were ! Nobody in ext4 or FAT32 found a block that seemed corrupt and said “hey HDD, you got a bad sector”, storage behaviour logic is more or less 40 year old logic.
(Ok I’m calm now, sorry for angry part … I’m now to tired to reword it to be less brusk)
Yup
Beg, borrow, steal. Honestly I know your pain. Not always I had money for a spare disk. BUT, rather than buying into a new harddrives try to do it smart - all big enterprises ditch their disks after 2 year (to limit maintenance cost with increasing failure rate) - some time ago I’ve even came across a listing for over 1000x 2tB WD SAS disks for 27 GBP a pop - thats less than half price for new WD.
So, since we are in this subject of money. I think most of folks assume that if you go for btrfs - you moderately care for data integrity (because you care for system with data CRC). Most will tell you not to touch btrfs without a ECC ram. Yes it sucks, but most have learnt the hard way. If you don’t care that much about data integrity, why btrfs at all ? dmraid_raid5 + random FS and your golden.
After decade in software engineering I can tell you just one advice - don’t kick out an open door.
you can try strip zfs of sun part to be less ram hungry, but then it’s a bit buggy, legalities are hazy, driver in linux is nonprintable_opinion_here. But one thing for sure, btrfs guys are veeeerrryyyyy concerned about getting a raid5&6 working, it’s just now a more serious messups were uncovered in places that were supposed to be stable so the fundation is being ironed out.
Always you strip will be biggest_sector_size * how many disks you have. so if you go for 2TB disk you most likely have a 4kb sectors … so every time a sector will be updates all sectors on other disks will have to be read (to create a parity) than your data + parity will get written down.
Yeah, about that … yes it’s feasible … only thing that peole don’t want to admit is that unclean shutdown is an umbrella term to: you loosing power, sata controller being overloaded get’s a self reset and silently looses some data ( aaaahhhhhh good old MARVEL chips ), driver in kernel dying on you, kernel ooops, kernel panic, kernel oom killer, application in system mode exploding on you (docker comes to mind ?)
Bitmaps are a last hope attempt to fix a problem that is caused by a lacking design … some folk clinge to stuff to much
Also bitmaps are not without it’s problem (just a quote from https://raid.wiki.kernel.org/index.php/Write-intent_bitmap)
"In one configuration I have, this takes about 16 hours on a 400Gb drive. When I do 5 of them simultaneously this takes 2+ days to complete"
Maybe zfs then ?
Aaahhh yes, sorry for bringing confusion to the equation:
a) mdadm - just an management tool, can be used to configure multitude of different raids (even hardware ones) so mdadm does not mean kernel raid
b) dmraid - an actual software (fake hardware) raid done in BIOS (by silently pushing all hard work to kernel again) called by Device Mapper (usually buggy as hell, since BIOS & UEFI is done by firmware engineers … and opinion of firmware is usually bad)
c) kernel raid (or RAID … wtf ?!) whish is by a definition multipath raid (WTF?!) … same stuff as dmraid, but less broken because implemented fully in kernel and reviewed by a lot of smart people
Honestly I know that calling linux raid a dmraid might be an insult to some, but kernel raid was there to help with some use cases and people used it to replace multi thousand dollar equipment and got surprised why performance sucked.
experience, logic, btrfs mailing list, knowing how stuff works …
Also I know which stuff you are referencing, this specific people run that setups in enterprise environments. Best example is a guy that did run a btrfs RAID 10 on top of kernel raid where each “disk” that btrfs have seen had a mirrored RAID1 of two or more disks. Through that when read is performed kernel will sequence read from 3 different disks (giving you 3x throughput) to btrfs that sits on top. He was using it for a very high bandwidth setup with 4 xeons and tone of ram pumping into teamed fibre network interfaces …

Let’s start again. In line of not kicking in an open door: you said that you are a broke student. and you want to use raid 5 - hence I assume you got 3 disks at disposal.
From you status, and amount of data that you want to store I assume that your data is not worth that much (well most of gigabytes of it) it’s movies or porn Fair enough, I prefer online database for that matter but each to their own.
So raid 5 on 3 disks will give you 66% capacity with one disk fail redundancy.
How about placing all you data on 2 disks connected into RAID 1 and all not so important into NORAID file system ? Hell for that matter you can build btrfs filesystem with 3 disks and have different RAID levels for different folder !!! This is where btrfs really shines, you can use dirt cheap stuff of different capacity and achieve wonders with it !

Now, if all you data is absolutely important to you, you got 3 disks of EXACTLY same capacity, nothing else than raid 5 is a go, and you got 4GB ECC ram … zfs and don’t look back ! (at least I would)

On the other hand if you are willing to go to a job and earn some cash, some examples of how you can buy stuff smart:
got 2 of those when they were 25 guid a pop:
http://www.ebay.co.uk/itm/182182189607?_trksid=p2055119.m1438.l2649&ssPageName=STRK%3AMEBIDX%3AIT

you can join those in series, SAS card can be as little as 20 quid, you just need a pciE socket in what ever system you have and BAM for less than 50 quid you can start poping chap ex corporate disks (again those were cheaper before):
http://www.ebay.co.uk/itm/201628321792?_trksid=p2055119.m1438.l2649&ssPageName=STRK%3AMEBIDX%3AIT

If you need it to be a separete nas, pick up an HP proliant (g6 and up) for 40 quid and seriously you will blow anything out of the water with that !!!

aehinson · September 24, 2016, 1:16am

Thanks for the reply. All the info really helps. Sorry again for the million questions.

Tomasz_Kusmierz · September 24, 2016, 1:50am

No problem mate,

when I started with btrfs I’ve butchered 4 file systems, lost lots of data … just all fun and game … i did learnt the hard way why ECC matters … why decent PSU matters …

Hey, watch Avi Miller video on basics of btrfs,

Then watch Dave Chinner rip btrfs a new one:

Fun part is that those skeletons are still in the the corner

zfs is not without issues it self, if you ever run into problem and need to “resilver” your system will grind to a halt frenas guys always blame it one everything else, but even they admit that ionice as btrfs has would be a life saver !

Also sorry for jumping on you, your idea was very bad, but at least you did some effort and showed some initiative.

And seriously consider per folder raid levels - it might just solve your problem! Also at 3 disks btrfs raid1 vs raid5 will give you 11% capacity difference. On ZFS you can’t have a 3 disk raid1. on ZFS you can’t change from raid 5 to raid 6 … you need to take your data somewhere else and rebuild whole pool. Always consider that your time sacrificed in the future = time you could spend earning money for more disk OR more time you could spend on the beach relaxing !

FREE_NORWAY · September 25, 2016, 12:03pm

Hi

Thank you for the interesting thread.

I have tried Rockstor for 8-9 month now and I like the distro.
But it is let down by it’s underlying filesystem.

As stated in the thread, BTRFS is focusing on Enterprise features(like ZFS) and I would argue some features will maybe never come ->like Raid 5/6.
I was looking for a software raid solution/distro to take over for my hardware Raid setup that I use today(12TB Raid 5).
I tried different solutions like OMV, FreeNAS, NAS4Free and Rockstor.
All of the have nice features and some have a large community that support it well.
But non of them has the features that make it easy expandable with new disks and Raid 1/10 in the meanwhile is not really an option->Raid 5/6 on BTRFS may be 2 month away or 5 years.

So this makes me wonder where Rockstor wants to be in the future.
As I understood it was created to provide a Linux NAS distro that uses all the nice features of BTRFS.
But with BTRFS failing to provide home user features, I see Rockstor failing to become that.

What is the developers thoughts about this?
I have read all the Raid 5/6 threads in the Rockstor forum, but there is really no answer where Rockstor wants to end up.

Regards
Sebastian

sfranzen · September 25, 2016, 1:58pm

I have only made a few small contributions so far, so not much of a “the developer” . However, full filesystem RAID support is an awesome feature that like everyone I would love to have available immediately. Unfortunately though, we can’t add features btrfs doesn’t have other than by contributing kernel code, which is certainly far above my programming skill. All that can be done otherwise is keeping the kernel and btrfs-progs up to date and hoping for things to be fixed, making the future of RAID5/6 in particular depend on that project.

As a home user though, I am fine with the current capabilities. My storage is still relatively small (~2 TB, no RAID) and contains mostly unimportant data, such as a media library. Backups of important data are also stored on it, but I also keep duplicates elsewhere obviously.

Tomasz_Kusmierz · September 27, 2016, 5:40pm

Guys I know your frustration, I would like to spend less on storage my self and use raid 6 … but it’s a bit unfair to say that btrfs concentrates on enterprise features … bugs being squashed now are thing like “die while unmounting on trimm” - something anyone with laptop with ssd can experience. Or silent fs corruption with kernel memory leak. Yes some developers are paid money to develop qpgroups (or stuff like that), so be it. It will benefit us all in future, but there is a major hunt for bugs that make this fs occasionally eat your data.

suman · September 28, 2016, 12:28am

Ups is supported in rockstor and @phillxnet is our man on that feature. Having a battery backup, even for a short time to properly shutdown the server is helpful.

aehinson · September 28, 2016, 1:02am

But that still wouldn’t solve any of the other problems that @Tomasz_Kusmierz mentioned in point number 8 would it?

suman · September 28, 2016, 1:09am

Really good question, something I think about frequently myself. We’ve been patiently following btrfs development and sometimes get a bit frustrated that it’s taking longer than we like. I’m those times zol seems like a good thing to add. But we’ve refrained from doing so because 1) there are many nice features of btrfs that are overlooked that are not available in zfs and further more, may not be possible. So we can’t provide feature parity and before we realize, rockstor will become clunky trying to support both filesystems. And 2) there’s more than enough work in our queue that compliments btrfs features in some cases and completely independent of fs in others. As we improve rockstor, raid 56 shall stabilize, hopefully. Having said all this, it won’t be too hard to add zol support if we need to.

Tomasz_Kusmierz · September 28, 2016, 1:20am

It will surely help !

Additional thing I’ve forgot to ask is when btrfs hit’s an error in file system it tends to call “kernel error” which usually leads to up - this results in module reload … your shiny raid 5 & 6 mid way write goes to hell. It’s fun to read on btrfs mailing list justification to why one would ops on corupt data

suman:

Really good question, something I think about frequently myself. We’ve been patiently following btrfs development and sometimes get a bit frustrated that it’s taking longer than we like. I’m those times zol seems like a good thing to add. But we’ve refrained from doing so because 1) there are many nice features of btrfs that are overlooked that are not available in zfs and further more, may not be possible. So we can’t provide feature parity and before we realize, rockstor will become clunky trying to support both filesystems. And 2) there’s more than enough work in our queue that compliments btrfs features in some cases and completely independent of fs in others. As we improve rockstor, raid 56 shall stabilize, hopefully. Having said all this, it won’t be too hard to add zol support if we need to.

I’ve actually migrated a comapny server to rockstor (it was just file sharing server + owncloud). Right now I’m trying to build a back up server based on rockstor and CCTV system based on rockstor But I’m early adapter with not much budget on hand.

KarstenV · September 30, 2016, 8:59am

While some discussion is going on in these forums, as to whether we should find other ways around the RAID 56 bug, it does seem that right now some active development is going into fixing these bugs.

In the BTRFS mailing list some post are made where they are trying to fix it. At first in btrfs-progs, and when these fixes are known to work, it will be fixed in kernel.

I have no doubt that this is difficult and delicate work, to get done right, so we will have to be patient.

But at least it seems there is some progress now, and I’m hoping to see these fixes in Rockstor soon

Flyer · September 30, 2016, 9:51am

@KarstenV is right, here you are:

https://mail-archive.com/linux-btrfs@vger.kernel.org/msg58002.html
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg57955.html

And something moving with in-band deduplication too :
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg57980.html

Dragon2611 · October 6, 2016, 5:13pm

Hmm looks promising and it would be nice to get raid5/6 support back.

MvL · October 15, 2016, 8:39pm

Hmmm, promising news!

Is it already possible to test this?