Failed to start SFTP due to a system error - after replacing drive

gburian · December 8, 2020, 2:54am

Brief description of the problem

I can no longer start my SFTP service after disabling it from the web UI and replaced 1 drive of a 7 drive RAID 1 pool. Drive replacement was done to increase pool size.

Detailed step by step instructions to reproduce the problem

Disabled SFTP service.
Replaced drive with procedure documented here:
Data Loss-prevention and Recovery in Rockstor — Rockstor documentation
Ran a balance on the pool.
Disabled pool quota during the balance as I noticed the balance was running very slowly.
Re-enabled pool quota aftr balance had completed.
Tried to restart the SFTP server from the web UI.

Web-UI screenshot

Error Traceback provided on the Web-UI


            Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/smart_manager/views/sftp_service.py", line 52, in post
    toggle_sftp_service()
  File "/opt/rockstor/src/rockstor/system/ssh.py", line 82, in toggle_sftp_service
    return systemctl('sshd', 'restart')
  File "/opt/rockstor/src/rockstor/system/services.py", line 77, in systemctl
    return run_command([SYSTEMCTL_BIN, switch, service_name], log=True)
  File "/opt/rockstor/src/rockstor/system/osi.py", line 176, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/bin/systemctl restart sshd. rc = 1. stdout = ['']. stderr = ['Job for sshd.service failed because the control process exited with error code. See "systemctl status sshd.service" and "journalctl -xe" for details.', '']

Note: the “Backup” share that I previously served with SFTP now shows as unmounted and with the previous size:

Trying to change the share’s size also shows an error:



Traceback (most recent call last):

File “/opt/rockstor/src/rockstor/rest_framework_custom/generic_view.py”, line 41, in _handle_exception

yield

File “/opt/rockstor/src/rockstor/storageadmin/views/share.py”, line 247, in put

share_pqgroup_assign(share.pqgroup, share)

File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 1240, in share_pqgroup_assign

return qgroup_assign(share.qgroup, pqgroup, mnt_pt)

File “/opt/rockstor/src/rockstor/fs/btrfs.py”, line 1294, in qgroup_assign

raise e

CommandException: Error running a command. cmd = /usr/sbin/btrfs qgroup assign 0/258 2015/2 /mnt2//Backup_Pool. rc = 1. stdout = [‘’]. stderr = [‘ERROR: unable to assign quota group: File exists’, ‘’]

gburian · December 8, 2020, 3:24am

Looking further, the share named “Backup” that I used to share via the SFTP service is now owned by root, which makes it so it cannot be shared via SFTP.

Trying to change this back to the user group (“backup”/“backup”) I had before results in another error:

Trying to delete the share warns me that it will delete all the share data (nope, don’t want that!), and trying to allocate a new share won’t let me share the same data as my current “Backup” share.

I seem to be stuck with no way to get my SFTP server back online.

gburian · December 10, 2020, 1:29am

Still stuck here.

Can anyone suggest how to diagnose why I cannot change the ownership of my “Backup” share?

Flox · December 12, 2020, 3:30pm

Hi @gburian,

First of all, sorry for the delayed answer; I’ll try to see if I can help.

It seems the failure to start the SFTP service might be a symptom of a broader problem but let’s try to see first what the sshd.service has to say:

systemctl status -l sshd.service

or if the output seems truncated (as in not enough lines are displayed to get a good idea of what happened), you can explore (using up/down or pageUp/pageDown keys to navigate):

journalctl -u sshd.service

Now, the reason why I’m wondering whether this is a symptom of something broader is the following:

Could you provide a screenshot of your “Pools” page? Are all shares on this pool also unmounted? Do you have another pool that is also experiencing an issue?

It could also be interesting to see what the following reports:

btrfs fi show /mnt2/Backup_Pool

The screenshot of your pool usage also seems to show no free space, with a used space bigger than total available space, so that might be odd as well. I’m not very familiar with space used, but this in combination with your mention of quotas quirks might be clues. It might thus be interesting to see the following:

btrfs fi usage /mnt2/Backup_Pool/

I’m sorry for not being able to give you a clear answer, but hopefully that’ll help provide more information on what is happening.

Also, could you let us know what version of Rockstor you are using? From your logs, it seems you might be running an older version. yum info rockstor should give you a clear answer.

gburian · December 13, 2020, 5:11pm

Thanks for the reply @Flox. Ironically, a few minutes before your post I managed to get my SFTP server back online.

The main problem I had was that the owner and group for my Backup share somehow had its values stored in the Rockstor database set to “root” for both, even though the values on disk were still correctly set to “backup”. Since they were set to “root” in the database, the SFTP service would not use this share.

When I tried to update the owner/group via the Web UI, it would fail after exactly 2 minutes. My system is an old PC circa 2008 with all spinning disks so it takes a relatively long time to run the “chown” command spawned by this change. I tracked the timeout down to the timeout set for gunicorn requests, but even after bumping that up to 30 minutes to allow the chown to complete, this still failed eventually.

Finally, I figured out how to use the psql command to change the right field in the database to convince Rockstor that this share’s owner/group were backup/backup. For reference in case others need to do something similar:

psql -U rocky storageadmin
enter password “rocky”
select * from storageadmin_share
update storageadmin_share set owner=‘backup’,“group”=‘backup’ where id=9;

Once I made this change, I was able to add the Backup share to the SFTP service and everything is now working as before.

I am not sure exactly how I was able to get the STFP server running in the first place, but looking at my SFTP server logs, it was running prior to my Backup share being re-associated with it, causing clients to fail to find the files they were looking for since the share was not mounted.

The main anomaly now is the disk usage shown for the Backup share, although I see the warning at the top of the Shares screen indicating that quotas are not enforced right now so the size of a share is effectively the size of its pool.

The device of the new disk I added also shows up as “detached-0359b48b77c14907a8d3d8bd88ba1d45” which is wrong. I suspect this is because Rockstor did not account for the disk replacement correctly somehow, since the status of the pool when I look from the command line appears to be as expected (the new disk shows up as /dev/sdg).

Here are a few screenshots and command output:

Pools:

Backup_Pool details:

[root@backup ~]# btrfs fi show /mnt2/Backup_Pool
Label: 'Backup_Pool'  uuid: f89b541f-6c7e-4819-a605-124c91f9b428
	Total devices 7 FS bytes used 2.51TiB
	devid    1 size 3.64TiB used 2.51TiB path /dev/sdg
	devid    2 size 1.82TiB used 1.33TiB path /dev/sdf
	devid    3 size 931.51GiB used 433.00GiB path /dev/sdj
	devid    4 size 465.76GiB used 193.00GiB path /dev/sda
	devid    5 size 465.76GiB used 194.00GiB path /dev/sdd
	devid    6 size 465.76GiB used 194.00GiB path /dev/sdb
	devid    7 size 465.76GiB used 193.00GiB path /dev/sdc

[root@backup ~]# btrfs fi usage /mnt2/Backup_Pool/
Overall:
    Device size:		   8.19TiB
    Device allocated:		   5.02TiB
    Device unallocated:		   3.17TiB
    Device missing:		     0.00B
    Used:			   5.02TiB
    Free (estimated):		   1.58TiB	(min: 1.58TiB)
    Data ratio:			      2.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,RAID1: Size:2.50TiB, Used:2.50TiB
   /dev/sda	 193.00GiB
   /dev/sdb	 194.00GiB
   /dev/sdc	 193.00GiB
   /dev/sdd	 192.00GiB
   /dev/sdf	   1.33TiB
   /dev/sdg	   2.50TiB
   /dev/sdj	 431.00GiB

Metadata,RAID1: Size:6.00GiB, Used:4.24GiB
   /dev/sdd	   2.00GiB
   /dev/sdf	   2.00GiB
   /dev/sdg	   6.00GiB
   /dev/sdj	   2.00GiB

System,RAID1: Size:32.00MiB, Used:384.00KiB
   /dev/sdf	  32.00MiB
   /dev/sdg	  32.00MiB

Unallocated:
   /dev/sda	 272.76GiB
   /dev/sdb	 271.76GiB
   /dev/sdc	 272.76GiB
   /dev/sdd	 271.76GiB
   /dev/sdf	 498.99GiB
   /dev/sdg	   1.13TiB
   /dev/sdj	 498.51GiB

Shares:

I believe I am running the latest Rockstor from the Stable channel:

[root@backup log]# yum info rockstor
Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.it.ubc.ca
 * epel: sjc.edge.kernel.org
 * extras: mirror.it.ubc.ca
 * updates: centos.les.net
Installed Packages
Name        : rockstor
Arch        : x86_64
Version     : 3.9.2
Release     : 57
Size        : 85 M
Repo        : installed
Summary     : Btrfs Network Attached Storage (NAS) Appliance.
URL         : http://rockstor.com/
License     : GPL
Description : Software raid, snapshot capable NAS solution with built-in file integrity protection.
            : Allows for file sharing between network attached devices.

Anyways, I am back up and running and the remaining issues appear to be mostly cosmetic.

Perhaps when support for replacing disks is added to the Web UI, these issues will be dealt with.

Thanks again for your response and ideas for what to look at.