Hard drive pending corrupted sectors

coleberhorst · May 2, 2017, 3:27am

Hi, been getting some error emails lately about one of my drives. It says it has:

Device: /dev/sde [SAT], 65496 Offline uncorrectable sectors
Device: /dev/sde [SAT], 65496 Currently unreadable (pending) sectors

I know on FreeNAS there’s a way to attempt fixing the sectors and if they don’t fix rebuild the pool skipping them. Is there a way to do something similar on Rockstor? I am in Raid 1+0 so I suppose a disk failure wouldn’t be the worst thing, but I like fixing errors

phillxnet · May 2, 2017, 11:56am

@coleberhorst Hello again.
In this case the error is hardware based and given the number of sectors involved I would say that this is a very poorly drive that need to be replaced ‘too sweet’ especially given they are already identified as un-correctable. The ‘fix’ method to assert correctness of stored data in both ZFS and btrfs is the scrub but in this case you are well beyond the filesystem level and clearly in very poorly hardware realm. Look to the method (currently command line only) where you remove and or replace this drive’s place in it’s pool but as the drive is known dodgy to treat it as read only. Note that if the pool is currently at minimum drive count for the btrfs raid level then you will have to perform an in place ‘replace’ and as the drive looks very poorly it would be advisable to treat it as read only ie the ‘-r’ switch detailed in our following open issue:

github.com/rockstor/rockstor-core

Implement a disk replacement UI

opened 05:31PM - 10 Jan 17 UTC

phillxnet

Thanks to maxhq in the following forum thread for highlighting this issue. At t…imes it is desired to replace an existing disk with another one, ie the function of: btrfs replace start devid /dev/sdX /mnt2/pool-name A user level interface for this process would make for a nice improvement. We should ensure to allow for the -r switch where reads from the source device are only carried out if not other zero-defect mirror exists (recommended for poorly devices). See: https://btrfs.readthedocs.io/en/latest/btrfs-replace.html The status of this or other current replace operations could be shown in a tab on the same page, akin to the balance tab in pools maybe. The current status can be had via: btrfs replace status /mnt2/pool-name N.B. it is generally considered to be a longer process to use replace rather than: "btrfs dev add" and then "btrfs dev delete", it might make sense to suggest this course of action in the same UI. https://forum.rockstor.com/t/problems-with-disk-replacement/2660

as if you drop below minimum drive count for a given btrfs raid level then you only get one chance to mount the pool in degraded mode in which to fix the issue. But is seems that your pool is still holding up so currently your options are broader. The minimum drive count for btrfs raid 10 is 4 so you would need 5 drives if you were to remove this device from the pool by simply using the Rockstor UI resize pool - remove drive option, however that will exercise the existing poorly drive quite a bit which, given it’s current report, would be inadvisable. Likewise adding an additional drive, to raise the pool above minimum count, would also exercise the drive quite a bit. Hence the replace -r suggestion.

If the drive ends up failing and dropping out of existence from btrfs’s point of view then the following issue has details of the procedure then required:

github.com/rockstor/rockstor-core

Implement a delete missing disk in pool UI

opened 10:57AM - 26 Apr 17 UTC

closed 11:50AM - 05 Oct 18 UTC

phillxnet

Thanks to zoombiel in the forum and GitHub users @therealprof and @iFloris (#737…) for highlighting this issue. At time of issue creation we have no UI means of addressing the removal of a missing device from a pool. I proposed that we extend our existing 'bin' icon to link to a specialised page to explain the detached status, ie this could be a backup drive that is not permanently attached (likely then the only member of a pool if it is a btrfs pool member at all) or it could be a member of a multi disk pool, in which case there is a high likely hood that the observed situation is a fail condition. We could in turn link to the associated pool details page where it is proposed we implement a "Remove Missing Devices" option which in turn runs the following: ``` btrfs device delete missing mountpoint ``` Reference: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices Section: Using add and delete: "... if a device is missing or the super block has been corrupted, the filesystem will need to be mounted in degraded mode:" "btrfs device delete missing /mnt" "btrfs device delete missing tells btrfs to remove the first device that is described by the filesystem metadata but not present when the FS was mounted." N.B. As this option depends upon a degraded mount, it could also be presented only under those circumstance. It is favoured that this option be located in the pool UI as it exists at the pool level of analysis: hence the link from disk to pool to emphasise the affect level. A pre-requisite of this issue is to address pool health reports which in turn will require developing to be able to recognise the 'degraded' pool state and additionally, in a missing drive scenario, offer the option to mount degraded with all the associated warnings, ie the current one shot deal, if the drive count is at minimum, including the missing drive, for the given raid level. Linking to related issues on pool health reporting: #1531 , #1199 , Please update the following forum thread with this issues resolution: https://forum.rockstor.com/t/remove-failed-disk-from-pool-no-replacement/3151 Also note that this issue is potentially a duplicate (but with additional implementation details) of GitHub user @therealprof;s #737 with additional contributions from @iFloris .

But you are not at this stage just yet.

So I would say you need to do a replace (with the read only -r switch) of this disk in it’s existing pool to get it out with the minimum of exercise.

Yes on this note your situation is not a good candidate but modern drives do have a degree self healing capability that is often not triggered until the problematic sector is written to. As a consequence there are texts that suggest a full disk write to a drive will force all faulty / bad sectors to undergo the ‘auto replace with spare sectors’ procedure built into drives. However although I have done this on quite a few drives they have pretty much all shortly there after produced additional bad sectors and that was with less than a hand full of bad sectors. You have > 65000 and they are already marked as un-correctable so yes kid glove time and step very carefully and be sure to understand the commands required.

It is also possible to manually mark sectors as bad (rather than via the auto method just described), but you already have >65000 !!

Hope that helps.

coleberhorst · May 2, 2017, 6:13pm

Thanks as always for your very detailed answers and help.

After closer investigation there is also a SMART error showing the cache is failing and throwing parity errors so this drive seems close to dead along with it’s 1/8th sectors unreadable. I will do as you advise and remove it, probably replacing it within a week or so with a new 6TB one.

Current config is raid 10 with 2TB x 3, 4TB, 6TB. New will be 2TB x 2, 4TB, 6TB x 2. As you said my hardware isn’t the best. Using my old desktop and some of the drives are quite old.