Failed and removed disk not shown in pool, cannot be removed

Brief description of the problem

I run Rockstor 5.1.0-0 on an old Fujitsu Primergy TX150 S8, which I rebuilt specifically for this purpose.

Until recently, it had 1 SSD for Rockstor and four 6TB SATA HDDs for the Main pool, each one LUKS-encrypted and set to auto-unlock on boot. The Main pool is configured as a raid1.

A few days ago, I added a fifth disk, set up LUKS encryption and added it to the pool via the “Resize/ReRaid Pool” button/modal in the WebUI. I kept the profile as raid1 and new balance operation started.

During the balance, which took ~32h, one of the older HDDs seems to have failed completely:

  • both the LUKS volume and the disk itself were marked as “detached” in the Rockstor UI
  • the device file /dev/sdb was gone
  • the disk didn’t even show up as a physical disk in the output of the MegaRAID CLI command and trying to query the slot gave the same output as if there just wasn’t a disk in there, at all, and
  • the orange fault indicator LED was on next to the disk’s slot in the server’s disk enclosure.

There are also some additional warning messages in the WebUI:

  • a “Pool Device Errors Alert” message at the top of each page and
  • “(Device errors detected)” warnings in the “Disks” sections of both the overview of all pools and the Main pool overview panel.

But here’s the puzzling thing: there were and are no detached/missing disks listed in the Main pool overview, so none can be removed, either. It only lists the four working disks.

Deleting the detached disks from the Disks page did not change this, nor did physically pulling the disk out of the enclosure.

The balance operation finished successfully and the pool now still shows four disks and no detached/missing ones, but the warnings persist.

I logged in via SSH and checked the Main pool’s filesystem with btrfs fi show /mnt2/Main. It still lists a total of five devices.

How do I proceed here?

Detailed step by step instructions to reproduce the problem

  1. Create a pool with a raid1 profile on top of some LUKS-encrypted disks (don’t know if there need to be four of them)
  2. Add another disk to the system, LUKS-encrypt and unlock it
  3. Add it to the pool via “Resize/ReRaid pool” button and modal, let the balance operation start
  4. While the balance is running, physically remove one of the old disks

You should see the removed disk and its LUKS volume listed as detached on the Disks page, but not on the pool’s overview page.

Web-UI screenshot

Error Traceback provided on the Web-UI

Only

Pool Device Errors Alert

… at the top of each page and

(Device errors detected)

… in the Disks section of both the pools overview page and the overview panel on the Main pool’s page.

As I wrote, btrfs fi show /mnt2/Main outputs a list of 5 devices, all of them device-mapped LUKS volumes, of course.

By comparing those with the list of devices the Rockstor UI lists, I can identify the one “zombie” device that actually belongs to the removed disk.

I can confirm this by running btrfs device stats /dev/mapper/luks-<uuid>, which shows tens of millions of .write_io_errs (probably from when the balance was still running), while there aren’t any of those for the other devices.

I could remove that device using the btrfs CLI tool, of course.

That would probably trigger another balance, but by the end of it the state of the btrfs filesystem might be back in sync with what Rockstor thinks the state of the pool is.

I don’t know if that would get rid of the warning, but I’d be more confident to try a reboot and see if that helps – compared to now, where I’m afraid the system wouldn’t even come up properly.

The thing is that doing this might make it impossible to debug how the system got into this peculiar state with missing/detached devices that are not shown in the pool config, in the first place. So I’m a little hesitant to fix it now, in case a dev or someone familiar with the codebase has any questions.

I assume that part of the problem is with the fact that I’m using LUKS-encrypted volumes and the extra layer of abstraction that this represents. Especially since the device mapper doesn’t actually remove the mounted LUKS volume, even when the physical disk backing it has disappeared from the system.

I thought about this and I assume that Rockstor can recognise this condition, because it maintains its own mapping between physical disk and LUKS volume, so it can know that the LUKS volume must be detached, if the disk it resides on is detached.

But somewhere in that fault handling process something went wrong with the association between the pool and its disks and the detached LUKS volume device was seemingly removed from the list, but the knowledge that something’s not right with the pool is still there, somehow. It just can’t be fixed from the WebUI…

Little update: I decided to give the removed disk another try to see, whether it has actually failed, or not.

To that end, I inserted it into an unused slot in my external disk enclosure and – would you believe it – it showed up in the output of the MegaRAID CLI tool.

I was able to retrieve S.M.A.R.T. data from the drive, reviewed it and found nothing worrying there. I started an extended self-test. If that should finish without error, I’m considering returning the disk to its old slot and re-integrating it into the system.

Now, because the disk is connected via an adapter card, which cannot be configured to function in JBOD mode, I have to add every HDD to a “virtual disk” or “disk group” to even get it to show up in the system. As I don’t want to use the adapter card’s own limited RAID support, I always create a dedicated one of those for each disk. But the one for this HDD was removed alongside the disk itself. Meaning I might have to recreate it and it might not get recognised by Rockstor as the same disk it knew before.

I’ll post updates here on how all of that turns out.

2 Likes

The extended self-test finished without errors, so I took the disk out of the other slot and returned it to its original one. After some more debugging, I found that the adapter card had apparently “forgotten” its own config. for this drive and considered it “foreign”. I was able to import this “foreign” config, though, and now the physical disk (for want of a better word) shows up in the system again… but not the LUKS volume that belongs to it (even though the disk has the “unlocked padlock” symbol next to it)!

The Main pool still only lists four devices and the corresponding total capacity and the warning messages are still present, too.

I don’t know how to proceed here.

How can I fix this and bring the underlying system state back in sync with Rockstor?