Defective disk, any recommendations regarding replacement?

KarstenV · March 27, 2017, 6:37am

One of my disks has decided to die on me.

Its actually the newest of all of my disks, still under warranty, so it’ll probably be replaced with a new one.

But until I get this sorted out, I have ordered a new disk to replace it.

The symptoms are something like this.

Rockstor started sending me mails about pending writes to sdf.

I started investigating.

sdf is recognized in bios, and by rockstor.
“Btrfs fi sh” shows that its stil registered as part of the pool. The pool is not acessible, propably due to the reboot I made to see if it was detected by BIOS.
The disk is definately defective, its clicking every 1 second, and every few minutes it spins down and up again.
DMESG shows a lot of read / write and other errors.

In my time with Rockstor I have been installing and replacing a lot of disks. But never one that has gone defective.

The documentation seems a bit sketchy about this scenario.

Do I just run a btrfs replace (-r), or do I have to mount the volume in degraded mode, before replacing?

Anybody with any insights?

Tomasz_Kusmierz · March 27, 2017, 11:05am

If you can mount pool without degraded mode, do it (rarely btrfs will let you but hey) then IF you can perform “replace” go for it, if you don’t have a physical mount point of missing drive (because you unmounted since the drive failed) you can try to use --replace-missing (or something like that)

phillxnet · March 27, 2017, 1:10pm

@KarstenV I’ll just chip in on this also:

Given those errors and the obviously poorly device I would second your (-r) variant of the replace as per the notes in:

Also remember that you only get one rw degraded mount attempt (currently) so make sure you have everything in place prior to attempting the repair, be it a replace or a delete (if space and min drive count permits). There after your options reduce to data retrieval as opposed to pool repair.

https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices

KarstenV · March 27, 2017, 1:43pm

Thanks for the answers. Others with any input are also welcome

The system is shut down until the new disk arrives. When it arrives I will attach it to the computer, and try to recover.

I find the documentation in this regard rather lacking. Some pages give one advice, others another.

I’ll try to follow the howto in RockStors documentation in the “Data loss Prevention and Recovery in RAID1 Pools” section (as my pool is RAID1).

I find the documentation a bit confusing in point 4-7. It instructs you to mount the pool degraded, but also tells you its OK to power down the PC, and change disk. This would in my understanding make the FS unmountable in RW again?
So repairs would not be possible?

I will try and do it in one go.

Attach new drive, boot, mount degraded, start replacement (with -r as the disk is unreadable).
And then just hope for the best.

Tomasz_Kusmierz · March 27, 2017, 11:10pm

This is the most sane move !

KarstenV · March 30, 2017, 6:47pm

OK, so the drive is out of the machine and a new one mounted.

Replacement is ongoing, 4,6 %. Progressing slowly at about 1% every 5 minuttes.

No errors reported yet

Edit:

When I wrote progressing slowly, I was probably misrepresenting the system.
I see it writing to the new disk at above 150 Megabytes/s, which is probably all the speed that disk is capable of.

Has there been changes to the replace code in BTRFS?
My previous experiences with using replace was allways horrendously (as in horribly) slow, this is much better.

Edit 2:
Replacement went without problems, system is rebooted, and seems to be running smoothly. All disks are reported as part of the pool.