PATCH: Fix the so-called famous RAID5/6 scrub error

Looks like Qu has submitted a patch for the RAID 5/6 scrub error bug that has given BTRFS a lot of it’s more recent unreliability claims… http://www.spinics.net/lists/linux-btrfs/msg60595.html

It’s great news for us and hopefully we have some confirmation on reliability improvements with RAID 5/6… Interesting note as well, in the latest Synology DSM 6.1 BETA they highlight:

File self-healing

Btrfs file system is able to auto-detect corrupted files with mirrored metadata, and recover broken data using the supported RAID volumes, which include RAID 5, 6, 10, 1, and SHR.

I’ve chatted with Qu a few times about various minor bug’s I’ve come across, he’s super committed and major contributor the development of BTRFS!

3 Likes

This is very good news.

I’m really hoping this is the fix that is needed, so that RAID56 can be deemed production ready.

I gues this will make it into the next kernel release?

I’m certain a lot of peoble will be testing this to see if its solid.

I for one would return to RAID6 if I was convinced it is stable.

Finally! great news for btrfs.
http://www.phoronix.com/scan.php?page=news_item&px=Btrfs-RAID5-RAID6-Fixed

1 Like

WOW … so that fixes like 5% of problem with raid5&6 ? I’m amazed …

Let me know what they will fix a:

  • zeroed fs on non clean mount
  • inability to remove a broken device
  • write hole
  • slow, yet persistent memory leak being hit by btrfs in kernel that even got linus working on it

… and so on …

Allthough I understand your sentiment, I dont understand the negativity.

Yes there are many problems to resolve, but clearly work is being done.

I allways thought this error where scrub (which is basically a check / repair tool), instead of repairing, actually destroys the data, was one of the worst, and the one most likely to hit normal users (as its advised to run a scrub regularly). Having that fixed is a major thing, and would probably also be instrumental in order to get e.g. the inability of removing/replacing a broken device working.

I think there is still some way to go before RAID56 is proven reliable. But lets be happy that progress is being made, and that there is seemingly serious work going into fixing it.

3 Likes

Again, sorry for that In other article I’ve stomped on Yoshi quite hard … literary slamming door on him.

Anywa I’ll just paste same statement here:

“Anyway, in my defence I’m simply pissed of about current situation in
community that somebody comes in and proclaim the “raid5 fix” ( the
article actually states that “Btrfs RAID5/RAID6 Support Finally Get
Fixed” ) and then there is a myriad of people coming back saying “I beg
you please please save my non backed up data from raid 5” and you want
to punch in the face that douche that wrote “yeach it’s fixed”. And the
guy that wrote that article is doing it for advertisement, there are 5
advertisements spots around this misguiding article.”

And negativity was brewing in me for long time and it’s rearing more and more it’s ugly head, because people come out with crazy ideas like spanning btrfs over raid5 soft raid proclaiming again “yeah it fixes stuff” and this results in more chances that some poor user that does not know intricacies of data storage will get suckered in to loosing his / hers data and whole blame ends up on btrfs / rockstor.

So whole anger comes from having to listen to people that drive other people into the brick wall and as usual they never help them out, it’s you with dust pan standing next to brick wall picking up the remains.

And this does extend outside of btrfs, it’s a problem of whole open source community and engineering community as well - pattern is very simple:

  1. douche writes and article for advertisement or other unknown reason that is generally misguiding if not out right false
  2. ten, twenty other people get their pants wet over “wow it changes everything”
  3. users jump in on the wagon and hey surprise surprise they get left but hurt.

And again, read the article title
"Btrfs RAID5/RAID6 Support Finally Get Fixed"
there is no way that user will take out of it that it’s not 100% safe to use now.

Just FYI of what btrfs mailing list thinks about this article:

My concern isn’t priority. Easier bugs often get fixed first. That’s
just the way Linux development works.

I am very concerned by articles like this:

    http://phoronix.com/scan.php?page=news_item&px=Btrfs-RAID5-RAID6-Fixed

with headlines like “btrfs RAID5/RAID6 support is finally fixed” when
that’s very much not the case. Only one bug has been removed for the
key use case that makes RAID5 interesting, and it’s just the first of
many that still remain in the path of a user trying to recover from a
normal disk failure.

Admittedly this is Michael’s (Phoronix’s) problem more than Qu’s, but
it’s important to always be clear and complete when stating bug status
because people quote statements out of context. When the article quoted
the text

    "it's not a timed bomb buried deeply into the RAID5/6 code,
    but a race condition in scrub recovery code"

the commenters on Phoronix are clearly interpreting this to mean “famous
RAID5/6 scrub error” had been fixed and the issue reported by Goffredo
was the time bomb issue. It’s more accurate to say something like

    "Goffredo's issue is not the time bomb buried deeply in the
    RAID5/6 code, but a separate issue caused by a race condition
    in scrub recovery code"

Reading the Phoronix article, one might imagine RAID5 is now working
as well as RAID1 on btrfs. To be clear, it’s not–although the gap
is now significantly narrower.

And finally word from a person that created a patch that started this whole shiet storm:

I never think RAID5/6 is stable and will never use it(in fact, the whole btrfs, in any of my working boxes).
If you think I’m expressing thing like whatever you think, that’s just your misunderstanding.

Do you have an exact link to this mailing list thread? I want to inform Michael about that. Maybe he can state it more clearly in the article that this fix doesn’t fix the whole situation.

I don’t have a link because mailing list ends up in my mailbox and I simply read through it :confused: … yes it keeps me bussy while in toilet :stuck_out_tongue: but I’ll look for it for you.

Edit:
here it is
http://www.spinics.net/lists/linux-btrfs/msg60684.html

I’ve forwarded that to Michael at Phoronix. He posted a status update, so everyone should be able to understand the situation better now. :slight_smile:

Edit: Typo fixed

2 Likes

Can he change a title from “raid 5/6 fixed” so people with ADHD (aka facebookers) will not jump right into conclusions ?