Degraded pool not accessible after reboot, unable to remove bad drive

Brief description of the problem

Running older CentOS version 3.9.2-57
I had a detached disk showing in the pool and a degraded pool alert. The stats page for the disk looked OK, so I rebooted the system to see if it would clear and now I have no access to the pool data in the UI, but I can see the disks in the UI. I can SSH into the box. The detached removable drive has been there for a while with no problems and previous attempts to remove it have failed, but I tired again today again with no luck. I get an error that says it can’t remove it because it’s not attached. I have no access to shared from windows machines on network. OpenVPN rockon doesn’t appear to be running. Everything was working prior to reboot, but with pool degraded alert at top of page. I should have a replacement hard drive in a few hours.

Pool data now available in UI, maybe I wasn’t patient enough the first time. Still showing the same GET error on the top of the UI.

Detailed step by step instructions to reproduce the problem

reboot with degraded pool 1 drive detached.

Web-UI screenshot




Error Traceback provided on the Web-UI

UI gives me the following error on all screens.

Unknown internal error doing a GET to /api/shares?page=1&format=json&page_size=9000&count=&sortby=name

Rockstor.log tail provided the following.

[root@jonesville log]# tail rockstor.log
return func(*args, **kwargs)
File “/opt/rockstor/src/rockstor/storageadmin/views/command.py”, line 348, in post
import_shares(p, request)
File “/opt/rockstor/src/rockstor/storageadmin/views/share_helpers.py”, line 86, in import_shares
shares_in_pool = shares_info(pool)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 685, in shares_info
pool_mnt_pt = mount_root(pool)
File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 525, in mount_root
‘Command used %s’ % (pool.name, mnt_cmd))
Exception: Failed to mount Pool(Removable-Backup) due to an unknown reason. Command used [‘/usr/bin/mount’, u’/dev/disk/by-label/Removable-Backup’, u’/mnt2/Removable-Backup’]

I assume I need to remove the drives via command line to restore the pool, but not sure where to start.

That would still leave you with the non-functional drive for your jones-pool pool. I think you can mount that pool as degraded using the command line, and then use the replace function via the WebUI.

Take a look at this:

https://rockstor.com/docs/data_loss.html

while it has been updated to account for the Built on OpenSUSE versions, I believe the process of mounting the pool degraded and then attempt the replace (or add new disk and then remove broken disk) still holds true. Ideally, once you are able to mount it as degraded do:

  • Configuration back up via the WebUI - and offload to a non-Rockstor location
  • ideally back up your pool data to another device (assuming your Removable-backup drive is for?)

Then deal with getting that pool clean again 
 and then :slight_smile:
move to the latest testing release or at least latest stable.

1 Like

Luckily I just finished a full backup to 2 removable USB drives connected to a windows client. The one that shows in the pool is too small these days. It’s been living there for a while with no issues. Drive 3 being detached, on of the HGDT drives appears to be the one that caused the error. I’m running so close to full I’m going to need to add a drive to the pool before I remove one, or I’ll have to move to RAID 0, remove drive, add drive, and then convert back to 1.

Good with the backup then!

If your existing pool is pretty full, I would then opt for, if you can, add the new drive, before removing the other one “officially”. I would not mess with the RAID level until you have a healthy pool again.

1 Like

IT gremlins were kind to me today. Powered down added new drive, re-seated errored drive and everything came back up. I just need to investigate the error alert and figure out if it’s the missing USB drive or not.

very nice!

If you’re not using that USB drive, you could remove that pool via the WebUI 
 that would give you an immediate answer.

For your working-again pool, it might be useful to do a scrub and a balance to clean out inconsistencies and get the system to balance the load more across the new drive as well 


1 Like

It might be all the errors on the re-seated drive.

When I try to remove the USB drive using the resize reraid dialog I get the following error.

    >     Traceback (most recent call last):

File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 68, in run_for_one
self.accept(listener)
File “/opt/rockstor/eggs/gunicorn-19.7.1-py2.7.egg/gunicorn/workers/sync.py”, line 27, in accept
client, addr = listener.accept()
File “/usr/lib64/python2.7/socket.py”, line 202, in accept
sock, addr = self._sock.accept()
error: [Errno 11] Resource temporarily unavailable

@D_Jones

With regard to the repair scenarios and removing pools with all missing drives etc, we have addressed quite a few bugs on that front since the CenOS days and 3.9.2-57 (April 2020). Plus the underlying btrfs is way better; especially on our new preferred target of Leap 15.6. Not that helpful to you right now however. So what @Hooverdan said re scrub etc and once you have a healthy main pool again you can approach (time allowing) the transition to our modern openSUSE variant. Note that we have also recently update the following how-to:

https://rockstor.com/docs/howtos/centos_to_opensuse.html

The errors reported on that flaky dive in the main pool are accumulative. I.e. they show all errors encountered since the drive was part of this pool, or since they were last reset. The indicated command there (text above table) can be used to reset those stats so you can more easily see if there are any new errors generated.

Assuming the USB drive is not connected, and it’s the only member of that Pool “Removable-Backup”, you may have success with deleting the pool. But again we had more bugs back then in this area !!

Well done for stringing along this old CentOS install for so long by the way. And once that main Pool is healthy again - we can attempt approach any issues with it’s import. But a good scrub before hand, with your existing kernel, without any custom mount options, can only help.

Hope that helps and keep us posted.

2 Likes

New drive added, old drive removed, balance running, fingers crossed! Going to ignore USB drive for a while, even deleting to pool didn’t work. CENTOS here I come, if’s funny how the threat of data loss can change your priorities. Unfortunately out of town for 10 days before I can get to that.