Ran out of space, now rockstor won't mount drives

Hi,

I have a raid10 array that ran out of space, it took me a while to work this out and I’ve cleared a few gigs now, but rockstor is still not mounting and starting the rockon services. The Web UI is sluggish and keeps failing on api calls and so does running ./bootstrap

I can mount the drives myself using mount -U uuid /mnt2 which works fine, but it still doesn’t fix the web ui or whatever is failing

The only errors I’ve found are the following (other than errors about qgroups not existing)

ERROR [storageadmin.views.command:192] Exception while setting service statuses during bootstrap: (Error running a command. cmd = /usr/bin/systemctl start atd. rc = 4. stdout = ['']. stderr = ['Failed to start atd.service: Transaction is destructive.', "See system logs and 'systemctl status atd.service' for details.", '']).
[07/May/2019 14:51:15] ERROR [storageadmin.util:44] Exception: Exception while setting service statuses during bootstrap: (Error running a command. cmd = /usr/bin/systemctl start atd. rc = 4. stdout = ['']. stderr = ['Failed to start atd.service: Transaction is destructive.', "See system logs and 'systemctl status atd.service' for details.", '']).
Traceback (most recent call last):
  File "/opt/rockstor/src/rockstor/storageadmin/views/command.py", line 188, in post
    systemctl('atd', 'start')
  File "/opt/rockstor/src/rockstor/system/services.py", line 63, in systemctl
    return run_command([SYSTEMCTL_BIN, switch, service_name])
  File "/opt/rockstor/src/rockstor/system/osi.py", line 121, in run_command
    raise CommandException(cmd, out, err, rc)
CommandException: Error running a command. cmd = /usr/bin/systemctl start atd. rc = 4. stdout = ['']. stderr = ['Failed to start atd.service: Transaction is destructive.', "See system logs and 'systemctl status atd.service' for details.", '']

I’ve checked for failed drives, I’ve checked dmesg, I’ve run a btrfs check, all coming back clear.

Any help appreciated.

Never seen that before, but it seems to be systemd related.

First port of call is the systemd journal. Try to start the service, and then run:

journalctl -n 50

Check to see if any clues are provided.

Next, I’d start by trying to reload the systemd daemon:

systemctl daemon-reload

Then try to start again.

Alternatively, did you try to shutdown, suspend or reboot the machine recently, but not have it work?
If so, you may have a stuck suspend process. Check with:

ps aux|grep systemd-sleep suspend

If found, kill it with the process ID, reboot then try again.

I’ve checked atd, rockstor using systemctl status. Both are running at the moment.

my issues seem to be with getting rockstor to mount the raid array. Or getting the rockstor service to do it. As I guess that is what the api calls are doing.

Have you tried reloading the systemd daemon though?
Did you attempt to reboot but not succeed?

I ask because the message in your logs indicate a systemd problem.

What is the output of the journalctl command I provided earlier after attempting to restart atd?

I’ve rebooted my machine a couple of times without success already.

I’ve reloaded the systemd daemon, no change.
I’ve re-run ./bootstrap which is spitting out the following error
Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: HTTPConnectionPool(host=‘127.0.0.1’, port=8000): Max retries exceeded with url: /api/commands/bootstrap (Caused by <class ‘httplib.BadStatusLine’>: ‘’)
Error connecting to Rockstor. Is it running?

now rockstor.service is running and checking status shows me that all parts have successfully started.

I grepped the journal just now for ‘rockstor’

 bootstrap[6430]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/commands/bootstrap (Caused by <class 'httplib.BadStatusLine'>: '')
May 08 09:37:06 lazarus systemd[1]: rockstor-bootstrap.service: main process exited, code=exited, status=1/FAILURE
May 08 09:37:06 lazarus systemd[1]: Unit rockstor-bootstrap.service entered failed state.
May 08 09:37:06 lazarus systemd[1]: rockstor-bootstrap.service failed.

so looking like the issue lies here. Rockstor-bootstrap.service cannot start due to the error.

@Haioken Just chipping in on this one, could the original ‘out of space’ issue have been with the system disk. Hence the rather strange atd error and since we mount data drives within the rockstor-bootstrap service, an overly full system disk could have such effects.

Maybe the output of:

btrfs fi show

and a

btrfs fi usage /mnt2/<each-pool-name-in-turn>

Might shed some light on things.

@Michael_Arthur Did you for instance use the system drive as a Rock-on root and maybe it got too full and so we have these unusual systemd / atd failures.

Just a quick thought.

Hope that helps.

system disk is practically empty

Label: 'rockstor_rockstor00'  uuid: de89a23b-cfcd-49e2-a705-f4e463105cf5
	Total devices 1 FS bytes used 7.89GiB
	devid    1 size 224.52GiB used 12.06GiB path /dev/sda3

Label: 'redundant_storage'  uuid: 3794bd2d-fa72-46f6-b074-782862f69618
	Total devices 4 FS bytes used 1.50TiB
	devid    1 size 1.82TiB used 792.03GiB path /dev/sdb
	devid    2 size 1.82TiB used 792.03GiB path /dev/sde
	devid    3 size 1.82TiB used 792.03GiB path /dev/sdc
	devid    4 size 1.82TiB used 792.03GiB path /dev/sdd


   Overall:
Device size:		   7.28TiB
Device allocated:		   3.05TiB
Device unallocated:		   4.23TiB
Device missing:		     0.00B
Used:			   3.00TiB
Free (estimated):		   2.13TiB	(min: 2.13TiB)
Data ratio:			      2.00
Metadata ratio:		      2.00
Global reserve:		 512.00MiB	(used: 0.00B)

Data,RAID10: Size:1.52TiB, Used:1.50TiB
   /dev/sdb	 389.00GiB
   /dev/sdc	 389.00GiB
   /dev/sdd	 389.00GiB
   /dev/sde	 389.00GiB

Metadata,RAID10: Size:4.00GiB, Used:3.44GiB
   /dev/sdb	   1.00GiB
   /dev/sdc	   1.00GiB
   /dev/sdd	   1.00GiB
   /dev/sde	   1.00GiB

System,RAID10: Size:64.00MiB, Used:176.00KiB
   /dev/sdb	  16.00MiB
   /dev/sdc	  16.00MiB
   /dev/sdd	  16.00MiB
   /dev/sde	  16.00MiB

Unallocated:
   /dev/sdb	   1.44TiB
   /dev/sdc	   1.44TiB
   /dev/sdd	   1.44TiB
   /dev/sde	   1.44TiB

@Michael_Arthur OK, rules that one out anyway.

Looks like I missed the mark on that one.

Thanks for the outputs. Hopefully others can chip in here also as this is a strange one.

strangely I’ve managed to get rockstor-bootstrap.service running.
systemctl start rockstor-bootstrap

it failed first but seems to be running now.
had to manually mount the raid10 array first though.

Lets see how much further I can get, the web UI is behaving very badly at the moment, getting timeouts at the moment.

is this normal?

[09/May/2019 09:44:16] ERROR [system.osi:119] non-zero code(1) returned by command: ['/usr/sbin/btrfs', 'qgroup', 'show', '/mnt2/redundant_storage/rockons/btrfs/subvolumes/518e95a85a98e9ababc16767b81ca4185ba7ff28137d00f1cee5df14426723e3-init']. output: [''] error: ["ERROR: can't list qgroups: quotas not enabled", '']

getting lots of api call failures

Yay more errors, looks like I’m going to have to rebuild - I’ve ordered a 4T backup… :frowning: sad days

May 10 18:06:13 lazarus docker-wrapper[28140]: run_command(cmd)
May 10 18:06:13 lazarus docker-wrapper[28140]: File "/opt/rockstor/src/rockstor/system/osi.py", line 121, in run_command
May 10 18:06:13 lazarus docker-wrapper[28140]: raise CommandException(cmd, out, err, rc)
May 10 18:06:13 lazarus docker-wrapper[28140]: system.exceptions.CommandException: Error running a command. cmd = /usr/bin/dockerd --log-driver=journald --storage-driver btrfs --storage-opt btrfs.min_space=1G --data-root /mnt2/rockons. rc = 1. stdout = ['']. stderr = ['chmod /mnt2/rockons: operation not permitted', '']

@Michael_Arthur, Sorry for the slow response. I’ve had a thought re your more recent report:

What version of Rockstor are you using? If not a more recent stable updates channel version then quotas disabled on a pool could well explain a lot of errors as earlier versions of Rockstor absolutely depended upon quotas enabled status and if they were disabled then it would fall flat on it’s face, and be unable to mount that or any other pool and likely be a complete mess actually.

Just a though. In which case enabling quotas on all pools via command line and rebooting might get you sorted but a quotas disabled state on a pool in testing (which is currently older than stable) is likely to lead to a variety of unknown states (read Web-UI breakages that may not recover).

Also note that if you have retroactively installed docker-ce on a testing release then you will face the following upstream related docker-ce issue:

which in turn lead to us having to ‘cope’ with a quote disabled state which is now a capability of the later stable release channel versions only.

May be ‘off the mark’ again but just in case.

Hope that helps and let us know how you get on.

I’ve paid for stable updates and have been running the same version for a long time now, however I will investigate that.

@Michael_Arthur

Thanks for helping to support Rockstor development. Well worth double checking your version exactly as if you went from testing to stable we had the following bug:

https://github.com/rockstor/rockstor-core/issues/1870

and it’s associated pull request:

https://github.com/rockstor/rockstor-core/pull/1871

Also note that very early 3.9.2.# versions shared this quotas disabled failure; as per my last posts GitHub reverence so best check and report here the version so we can rule that one out also:

So do make sure via:

yum info rockstor

Which should show the installed and available versions.

Would be good to get to the source of your issue here as you have reported non rockstor systemd tasks failing to start atd (the at deamon) which makes me think you have a low level problem with your system disk (ssd) or it’s attachment (cable) / driver / controller. And of course Rockstor sits on top of a bunch of other such services. Maybe take a look at the btrfs report for errors on the system disk for example via (in your case):

btrfs dev stats /mnt2/rockstor_rockstor00

Re:

you may have a stuck immutable bit or that subvol has gone read only, or is usually on data pool but if not mounted at that time has fallen through to system pool directory. For an instance we had a while ago with immutable flag bugs see @Haioken and @Rene_Castberg contributions in the following thread:

Again this was associated from early stable channel releases.

You have quite dispersed errors all from a single system. And seemingly spanning both your pools. Assuming in that case the /mnt2/rockon was on your data drive.

You could also check your systems memory, ie take a look at our Pre-Install Best Practice (PBP) doc section and specifically the Memory Test (memtest86+) subsection.

Just puzzled by so many different and unrelated (bar via OS parts) errors here.

Let us know how you get on. And think about what might have changed from your prior, presumably more stable time to what is happening now. I.e. have you increased the load on the PSU for example, again a common cause of intermittent and unrelated issues. Does the system have adequate ventilation: dust bunnies etc.

Hope that helps.

Rockstor is version 3.9.2

shock horror, the rockstore.service has recovered now, drives mounted and the web GUI working. :man_shrugging: after a power cut here. I was in the process of copying off everything to potentially do a fresh install. Still left with the docker not starting issue though

[root@lazarus ~]# btrfs dev stats /mnt2/rockstor_rockstor00
[/dev/sda3].write_io_errs    0
[/dev/sda3].read_io_errs     0
[/dev/sda3].flush_io_errs    0
[/dev/sda3].corruption_errs  0
[/dev/sda3].generation_errs  0 

chattr -i /mnt2/rockons
Solved the docker issue, now left with this one.

May 12 18:21:30 lazarus dockerd[10720]: User uid:    911
May 12 18:21:30 lazarus dockerd[10720]: User gid:    1000
May 12 18:21:30 lazarus dockerd[10720]: -------------------------------------
May 12 18:21:30 lazarus dockerd[10720]:
May 12 18:21:30 lazarus dockerd[10720]: chown: changing ownership of '/config': Operation not permitted
May 12 18:21:30 lazarus dockerd[10720]: [cont-init.d] 10-adduser: exited 0.
May 12 18:21:30 lazarus dockerd[10720]: [cont-init.d] 30-dbus: executing...
May 12 18:21:30 lazarus dockerd[10720]: chown: changing ownership of '/config': Operation not permitted
May 12 18:21:30 lazarus dockerd[10720]: [cont-init.d] 10-adduser: exited 0.
May 12 18:21:30 lazarus dockerd[10720]: [cont-init.d] 30-config: executing...

however docker is now running and so are all the docker images.

puzzling that it somehow resolved itself, though I did play a little with trying to re-enable quotas so perhaps that helped…

Since I’ve run out of space I’m thinking of moving to mirrored or RAID 5, how safe is it to change that?

@Michael_Arthur

Could your post the output of the:

yum info rockstor

command as there is more info there that is pertinent to some of your reports as the 3.9.2 release as a whole has had > 45 fixes since it’s release, some fairly significant like the ability to work with quotas disabled and a little later on Web-UI elements to disable / enable quotas. Plus some fixes for the immutable attribute (chattr -i ) stuff.

Glad to hear some positive change is a foot. Would be nice and potentially useful to have some feedback on my prior suggestions re ensuring root causes for some of these issues however. But I fully appreciate that everything takes time however. But if your hardware is ‘flaky’ as the range of issues suggests (but not all of them of course) it’s in your interests to do the suggested checks.

Great, and thanks for the update.

Yes quite possibly, however to say possibly for sure we need the output of the above yum command. Quotas was a real pain for Rockstor for a period back there and is still a haunting element of btrfs, mainly on the performance and usage reporting front, ie it kills performance but is absolutely required to report usage. But the downside of significantly lower performance is being somewhat alleviated by upstream (the kernel and btrfs folks) so upon us moving to a more progressive base of openSUSE we should have that one sorted curtest of upstream which will be nice.

Also note that there is a delay from enabling quotas to it’s size reporting coming into effect. Also note that you will require later stable versions to cope with a quota state change.

Any changes on a system that is of unknown reliability, such as your system appears to be, is ill advised. Try the suggested memtest in my prior post before making any changes as no operating system or file system is robust to bad memory, hence the common additional precaution of ECC memory in server grade hardware.

But given the assumption of good hardware, yes you can change to btrfs raid1 for a close equivalent to traditional mirrored raid. Btrfs raid 1 effectively attempts to keep 2 copies of data and metadata on 2 independent device, irrespective of the number of devices in the given pool. This also affords btrfs the capability, under some circumstances, of self healing, ie given all data and metadata are check summed by default it can check data/metadata integrity, and given a raid level having multiple copies (2 in raid1/10 currently) it can substitute the know good copy when it encounters a bad copy. Scheduled scrubs for an integral pro active part of this element of btrfs. So in summary both btrfs raid1 and 10 only have 2 copies of data, and can each only suffer a single drive failure, I would favour raid1 unless you require the potential performance benefit of btrfs raid10. There are also slightly less ‘moving parts’ in raid1 than 10, as there is no striping. Just a thought. Also I would avoid compression for the time being simply as it’s another ‘moving part’. I’m a little cautious that way :slight_smile:

Note however that the btrfs parity raid levels of 5/6 are not recommended as they are significantly less mature and have incomplete facilities within the filesystem, such as size reporting and repair capability. But btrfs raid1 / raid10 are far more robust / mature, and perform better across the board as a result. So don’t use raid5/6 unless you know of the risks you are taking.

So chuffed you system seems to be on the ‘up’ but do please report on the exact Rockstor version via that command run as root on a terminal / console as that may well help to explain a lot, although there is as yet no explanation for your failed systemd atd service which is why I suggested the memory check / PSU consideration in my last post as they are common components for causing seemingly unrelated issues.

Well done for persevering but do keep in mind that to help forum members to help you they need complete feedback on suggested commands. And a full history if possible, ie why were quotas disabled, you also mention a power failure. They are again non friendly but not necessarily disastrous due to btrfs’s copy-on-write mechanism and your employing a redundant btrfs raid level. However our current limitation to a single device system disk is a weakness but one that hopefully upstream will tackle in time. And if said power failure, or another that preceded it, was the cause of your atd failure then yes a re-install and re-import on known good hardware would be nice. Just remember if you are going the system re-install route to make sure to disconnect all data drives (with system powered down) so you can be certain human/installer error doesn’t result in an install over the top of a current data pool member.

We do have the following howto: Reinstalling Rockstor doc entry which might be useful.

As for disk usage for different btrfs raid levels their is the online btrfs calculator:
btrfs disk usage calculator

Hope that helps.

Hi Phil,

Happy to do some digging around to try to work out the root cause.

Loaded plugins: changelog, fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.xnet.co.nz
 * epel: mirror.xnet.co.nz
 * extras: mirror.xnet.co.nz
 * rpmfusion-free-updates: ucmirror.canterbury.ac.nz
 * updates: mirror.xnet.co.nz
Installed Packages
Name        : rockstor
Arch        : x86_64
Version     : 3.9.2
Release     : 48
Size        : 85 M
Repo        : installed
From repo   : Rockstor-Stable
Summary     : RockStor -- Store Smartly
License     : GPL
Description : RockStor -- Store Smartly

Heres what I believed has happened. First was running out of disk space, which after a reboot services started to fail to start and the raid array in turn failed to be mounted.
After investigation I tried running various btrfs commands to check for drive failure (this was my original assumption as cause of failure)
Then looked at a mixture of disabling/enabling quotas (This likely caused further issues)

I removed enough files to free up space and having re-enabled quotas after a reboot the services started correctly again.
Then ran the chattr command which allowed writing to the drives again (likely happened due to the power failure)

About a month ago I did have a PSU failure, I replaced my PSU with a brand new one. So I’m going to rule out the power supply. I will try to run a mem test asap. I did however run one after the PSU failure I think as the PSU slowly died as opposed to it blowing a cap.

Let me know if I’ve missed anything. I’m now also investing in a UPS :smiley:

1 Like

@Michael_Arthur Nice rundown, thanks.

Super. If you are to have Rockstor interface directly with this UPS, ie the data cable of the UPS plugs into the Rocksor machine (ideal arrangement for Rockstor), then make sure to get a model that is well supported by NUT: https://networkupstools.org/.
They have a nice compatibility (Hardware compatibility list) page:
https://networkupstools.org/stable-hcl.html

But many support usbhid-ups which is one of the most developed, if not the most feature rich, drivers. Less feature rich as it’s a bit of a cover all but really you mainly just need the basics.

Also take a look at the docs section: UPS / NUT Setup as Rockstor can act to inform other machines on the network of the power state: assuming the switch/network and other machines are also powered by the same UPS of course. NUT is surprisingly capable that way.

Let us know how you get on.