Can't start rock-ons

Hi,

Running newest stable (all updates applied) and I get following error:

Houston, we’ve had a problem.

Unknown internal error doing a POST to /api/sm/services/docker/start

empty traceback.

I did manage to start it with:
dockerd --log-driver=journald --storage-driver btrfs --storage-opt btrfs.min_space=1G --data-root /mnt2/rockstor_rockstor00/rockons

but don’t know why it stopped working after reboot (systemd start does not work).

It seems that home00, rockons and root00 are unmounted, tried to enable quotas and reboot but no cigar. I have only ran stable updates…

Sorry to read you’re having troubles.
Is this a fresh install in which this never worked, or has it been working before?

On top of my head, I would think that some helpful information may reside in the logs:

  • /opt/rockstor/var/log/rockstor.log

Does systemd give you some error information when you try to start the docker service?

1 Like

It’s the unmounted stuff that’s the reason, haven’t had time to figure out why those do not mount yet. Has worked before ok. Manually starting dockerd works, it’s just that the default /mnt2/rockons is not mounted. Rest of the shares work w/o problems.

I get this error on the log:
ERROR [storageadmin.views.command:70] Skipping Pool (rockstor_rockstor00) mount as there are no attached devices. Moving on.

data is there and it does boot from that disk (nvme).

@Jorma_Tuomainen Hello again.

OK, that’s good; looks like you are getting to the bottom of the cause for this one.

ie the system doesn’t recognise there to be any currently attached devices for that pool (ie all detached), and so doesn’t proceed to the share scan and mount associated with that pool (later on in that same file (command == ‘bootstrap’) as it thinks no disks are attached.

The simplest work around for now may just be to use another pool for your rockons share as the system pool has a few quirks associated with it that are in the process of being improved upon. But lets try some things first.

The code in question is:

As such could you give us a screen grab of your Disks page and the Pools page.

This may be nvme related (Rockstor bug) or has come from Rockstor failing to adapt accordingly to system disk changes post install / first boot. Have their been any system disk changes that lead up to this rockstor_rockstor00 anomaly?

Hope that helps and lets see what those pages look like before we move on, it may be that there is a simple work around but we need to first see those pages. Probably best in the long run to not use the system share for data but then the rockons share is replaceable and I do the same on one of my machines here just to make better use of it’s ssd based system drive.

Yep, well I now run dockerd manually:
dockerd --log-driver=journald --storage-driver btrfs --storage-opt btrfs.min_space=1G --data-root /mnt2/rockstor_rockstor00/rockons

(then just ctrl+z, bg, disown %1)

And I had uptime of ~50 days before reboot so it’s some (update) that happened at that time.

@Jorma_Tuomainen Thanks for sharing your current workaround.

Could you please edit this post so that the images are inline (ie not an external reference), you should be able to drag and drop the images into the forum web page.

Thanks. That way they are displayed in line within the forum web page and should be more stable going forward.

Also could you post your:

yum info rockstor

Just to make sure.

Done and:

Installed Packages
Name : rockstor
Arch : x86_64
Version : 3.9.2
Release : 31
Size : 79 M
Repo : installed
From repo : Rockstor-Stable
Summary : RockStor – Store Smartly
License : GPL
Description : RockStor – Store Smartly

@Jorma_Tuomainen Great, and thanks for the edit and version confirmation.

And can you confirm that the rockstor_rockstor00 pool is listed as having only a detached disk under the Disks column: it’s cut off in your second image.

Cheers.

Yep, only that. It’s the nvme root-disk (which is not detached :))

@Jorma_Tuomainen

OK, great.

So my current suspicion is that this is my fault, sorry. I recently found and fixed a rather long term and elusive bug concerning Rockstor with 27 or more drives when the system drive was also named sda (plus a few other caveats). During the development of this fix there were a number of inadvertent side effects on disk recognition. For each I developed a test to ensure the ‘fixed’ code (drive recognition level) functioned as expected; and as it had done previously. The aim of course was to introduce no regressions and only add the fix. In the case of an nvme system drive I now suspect I missed the mark and there currently exists no testing, regression or otherwise, for this arrangement: which did work at least at some point as evidenced by yours and others reports. The change I suspect was for issue:
https://github.com/rockstor/rockstor-core/issues/1925
and was fixed by the changes in pull request:
https://github.com/rockstor/rockstor-core/pull/1946
which was released in Rockstor stable channel version: 3.9.2-31 (20 days as of writing).

If you would indulge me a little more I would like to ask that you further provide what I hope will be enough info for me to create a test to reproduce your issue: missing/detached nvme system disk, ie it was there and after an update is no longer recognised as attached. This should help to avoid the same regression going forward and should also help with fixing what ever when wrong in the first place.

The procedure required on your end is a little cumbersome but given your supplied work around should be trivial for your. Essentially I require the same procedure and info as I requested from @kingwavy referenced in the indicated issue which in turn grew from the following forum thread:

More specifically in my 14th May post in that thread:

Repeating the request in this thread for ease and slightly modified for this instance:

Could you also post the output of the following commands:

btrfs fi show

and

ls -la /dev/disk/by-id/

and

lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

and when the above command has a serial entry such as:

SERIAL=""

we resource udev via:

get_disk_serial location:

which in turn parses the following command. If you could execute this command on one of the drives that is showing the above empty serial (if any) it may help track this bug down:

udevadm info --name=devname-here

Also if you could remove the “#” and following space in front of the following line on your installed version located here:

/opt/rockstor/src/rockstor/system/osi.py

and then enable debug logging via:

/opt/rockstor/bin/debug-mode on

Then either reboot or restart the rockstor service via:

systemctl restart rockstor

We should then be able to see in your logs what scan_disks() is passing to _update_disk_state() and confirm or otherwise my suspicion that the nvme disk is just not being parsed correctly by scan_disks(). Or at least narrow down where the problem originates. Look in the main rockstor log for debug stuff, either via the UI component in System - Log Manager, or in:

/opt/rockstor/var/log/rockstor.log

So given this is suspected as a recent regression it would be good to get this one sorted as soon as possible and have some tests so that the same doesn’t happen again.

Sorry to have to ask all of this but it would be invaluable to get this output so that we can be assured that your particular instance is catered for as it is currently the only reproducer we have had reported.

Once we narrow down what is causing this bug I will open an issue with the relevant details and a fix can then be logged against that issue.

Thanks again for your help with this one and for helping to support Rockstor development via a stable subscription.

Here:

[root@rockstor ~]# btrfs fi show
Label: ‘rockstor_rockstor00’ uuid: 4a05477f-cd4a-4614-b264-d029d98928ab
Total devices 1 FS bytes used 7.46GiB
devid 1 size 110.81GiB used 52.02GiB path /dev/nvme0n1p4

Label: ‘Data’ uuid: d2f76ce6-85fd-4615-b4f8-77e1b6a69c60
Total devices 2 FS bytes used 6.56TiB
devid 1 size 9.10TiB used 6.58TiB path /dev/sdb
devid 2 size 9.10TiB used 6.58TiB path /dev/sda

[root@rockstor ~]# ls -la /dev/disk/by-id/
total 0
drwxr-xr-x 2 root root 320 Aug 4 18:56 .
drwxr-xr-x 8 root root 160 Aug 4 18:56 …
lrwxrwxrwx 1 root root 9 Aug 4 18:56 ata-WDC_WD100EFAX-68LHPN0_7PKNDX1C -> …/…/sdb
lrwxrwxrwx 1 root root 9 Aug 4 18:56 ata-WDC_WD100EFAX-68LHPN0_7PKP0MNC -> …/…/sda
lrwxrwxrwx 1 root root 13 Aug 4 18:56 nvme-eui.0000000001000000e4d25c19f1e04d01 -> …/…/nvme0n1
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-eui.0000000001000000e4d25c19f1e04d01-part1 -> …/…/nvme0n1p1
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-eui.0000000001000000e4d25c19f1e04d01-part2 -> …/…/nvme0n1p2
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-eui.0000000001000000e4d25c19f1e04d01-part3 -> …/…/nvme0n1p3
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-eui.0000000001000000e4d25c19f1e04d01-part4 -> …/…/nvme0n1p4
lrwxrwxrwx 1 root root 13 Aug 4 18:56 nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A -> …/…/nvme0n1
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part1 -> …/…/nvme0n1p1
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part2 -> …/…/nvme0n1p2
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part3 -> …/…/nvme0n1p3
lrwxrwxrwx 1 root root 15 Aug 4 18:56 nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part4 -> …/…/nvme0n1p4
lrwxrwxrwx 1 root root 9 Aug 4 18:56 wwn-0x5000cca251f38e60 -> …/…/sdb
lrwxrwxrwx 1 root root 9 Aug 4 18:56 wwn-0x5000cca251f3d4af -> …/…/sda

[root@rockstor ~]# lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID
NAME=“sdb” MODEL=“WDC WD100EFAX-68” SERIAL=“7PKNDX1C” SIZE=“9.1T” TRAN=“sata” VENDOR=“ATA " HCTL=“1:0:0:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“Data” UUID=“d2f76ce6-85fd-4615-b4f8-77e1b6a69c60”
NAME=“sda” MODEL=“WDC WD100EFAX-68” SERIAL=“7PKP0MNC” SIZE=“9.1T” TRAN=“sata” VENDOR=“ATA " HCTL=“0:0:0:0” TYPE=“disk” FSTYPE=“btrfs” LABEL=“Data” UUID=“d2f76ce6-85fd-4615-b4f8-77e1b6a69c60”
NAME=“nvme0n1” MODEL=“INTEL SSDPEKKW128G7 " SERIAL=“BTPY72910KCW128A” SIZE=“119.2G” TRAN=”” VENDOR=”" HCTL="" TYPE=“disk” FSTYPE="" LABEL="" UUID=""
NAME=“nvme0n1p3” MODEL="" SERIAL="" SIZE=“7.8G” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“swap” LABEL="" UUID=“d33115d8-3d8c-4f65-b560-8ebf72d08fbc”
NAME=“nvme0n1p1” MODEL="" SERIAL="" SIZE=“200M” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“vfat” LABEL="" UUID=“53DC-1323”
NAME=“nvme0n1p4” MODEL="" SERIAL="" SIZE=“110.8G” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“btrfs” LABEL=“rockstor_rockstor00” UUID=“4a05477f-cd4a-4614-b264-d029d98928ab”
NAME=“nvme0n1p2” MODEL="" SERIAL="" SIZE=“500M” TRAN="" VENDOR="" HCTL="" TYPE=“part” FSTYPE=“ext4” LABEL="" UUID=“497a9eda-a655-4fc4-bad8-2d9aa8661980”

[root@rockstor ~]# udevadm info --name=nvme0n1p3
P: /devices/pci0000:00/0000:00:1c.4/0000:03:00.0/nvme/nvme0/nvme0n1/nvme0n1p3
N: nvme0n1p3
S: disk/by-id/nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part3
S: disk/by-id/nvme-eui.0000000001000000e4d25c19f1e04d01-part3
S: disk/by-partuuid/e1a514c1-5b8c-41af-980a-5aa8a35ecef9
S: disk/by-path/pci-0000:03:00.0-nvme-1-part3
S: disk/by-uuid/d33115d8-3d8c-4f65-b560-8ebf72d08fbc
E: DEVLINKS=/dev/disk/by-id/nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part3 /dev/disk/by-id/nvme-eui.0000000001000000e4d25c19f1e04d01-part3 /dev/disk/by-partuuid/e1a514c1-5b8c-41af-980a-5aa8a35ecef9 /dev/disk/by-path/pci-0000:03:00.0-nvme-1-part3 /dev/disk/by-uuid/d33115d8-3d8c-4f65-b560-8ebf72d08fbc
E: DEVNAME=/dev/nvme0n1p3
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/nvme/nvme0/nvme0n1/nvme0n1p3
E: DEVTYPE=partition
E: ID_FS_TYPE=swap
E: ID_FS_USAGE=other
E: ID_FS_UUID=d33115d8-3d8c-4f65-b560-8ebf72d08fbc
E: ID_FS_UUID_ENC=d33115d8-3d8c-4f65-b560-8ebf72d08fbc
E: ID_FS_VERSION=2
E: ID_PART_ENTRY_DISK=259:0
E: ID_PART_ENTRY_NUMBER=3
E: ID_PART_ENTRY_OFFSET=1435648
E: ID_PART_ENTRY_SCHEME=gpt
E: ID_PART_ENTRY_SIZE=16252928
E: ID_PART_ENTRY_TYPE=0657fd6d-a4ab-43c4-84e5-0933c84b4f4f
E: ID_PART_ENTRY_UUID=e1a514c1-5b8c-41af-980a-5aa8a35ecef9
E: ID_PART_TABLE_TYPE=gpt
E: ID_PATH=pci-0000:03:00.0-nvme-1
E: ID_PATH_TAG=pci-0000_03_00_0-nvme-1
E: ID_SERIAL=INTEL SSDPEKKW128G7_BTPY72910KCW128A
E: ID_SERIAL_SHORT=BTPY72910KCW128A
E: ID_WWN=eui.0000000001000000e4d25c19f1e04d01
E: MAJOR=259
E: MINOR=3
E: PARTN=3
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=72618

Nothing on the logs about the nvme disk, only sda and sdb (hdds).

I’ll try reboot later.

@Jorma_Tuomainen Thanks for these.

And given the rockstor_rockstor00 pool (vol) is on p4 (p3 is the swap partition which we exclude by another mechanism):

Could I also have:

udevadm info --name=nvme0n1p4

for the no serial entry of serial="":

The stuff that I need will only show up after the edit I requested on that source file and then only after the debug mode is enabled and the rockstor service restarted or as you say a reboot. But both the edit (un remarking that debug logging line) and the service restart (by however) are required for the log entries to start appearing and they should appear directly after every Disk, Pool, or Shares page refresh. And on every boot.

They should look something like:

disks item = Disk(name='sda3', model='PERC H710', serial='6848f690e936450018b7c3a11330997b', size=277558067, transport=None, vendor='DELL', hctl='0:2:0:0', type='part', fstype='btrfs', label='rockstor_rockstor', uuid='7f7acdd7-493e-4bb5-b801-b7b7dc289535', parted=True, root=True, partitions={}) 

except relevant to your own disks and there should be one for each ‘relevant’ devcie.

If you could paste the complete list generated after each Disk page refresh that should do it. I’m expecting that there is in fact only sda and sdb entries but it is best to paste the exact output so I can create a test to reproduce and start working on a fix. Plus I need absolute confirmation that the nvme partition is not presented along with your data drives to definitely pin down the location of the bug.

Thanks for you time on this one.

Bit by bit.

[root@rockstor ~]# udevadm info --name=nvme0n1p4
P: /devices/pci0000:00/0000:00:1c.4/0000:03:00.0/nvme/nvme0/nvme0n1/nvme0n1p4
N: nvme0n1p4
S: disk/by-id/nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part4
S: disk/by-id/nvme-eui.0000000001000000e4d25c19f1e04d01-part4
S: disk/by-label/rockstor_rockstor00
S: disk/by-partuuid/87e031b7-6353-44af-bbc7-b674d8a8c85e
S: disk/by-path/pci-0000:03:00.0-nvme-1-part4
S: disk/by-uuid/4a05477f-cd4a-4614-b264-d029d98928ab
E: DEVLINKS=/dev/disk/by-id/nvme-INTEL_SSDPEKKW128G7_BTPY72910KCW128A-part4 /dev/disk/by-id/nvme-eui.0000000001000000e4d25c19f1e04d01-part4 /dev/disk/by-label/rockstor_rockstor00 /dev/disk/by-partuuid/87e031b7-6353-44af-bbc7-b674d8a8c85e /dev/disk/by-path/pci-0000:03:00.0-nvme-1-part4 /dev/disk/by-uuid/4a05477f-cd4a-4614-b264-d029d98928ab
E: DEVNAME=/dev/nvme0n1p4
E: DEVPATH=/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/nvme/nvme0/nvme0n1/nvme0n1p4
E: DEVTYPE=partition
E: ID_FS_LABEL=rockstor_rockstor00
E: ID_FS_LABEL_ENC=rockstor_rockstor00
E: ID_FS_TYPE=btrfs
E: ID_FS_USAGE=filesystem
E: ID_FS_UUID=4a05477f-cd4a-4614-b264-d029d98928ab
E: ID_FS_UUID_ENC=4a05477f-cd4a-4614-b264-d029d98928ab
E: ID_FS_UUID_SUB=05b3b2a3-25c3-4593-b922-2e3180122b52
E: ID_FS_UUID_SUB_ENC=05b3b2a3-25c3-4593-b922-2e3180122b52
E: ID_PART_ENTRY_DISK=259:0
E: ID_PART_ENTRY_NUMBER=4
E: ID_PART_ENTRY_OFFSET=17688576
E: ID_PART_ENTRY_SCHEME=gpt
E: ID_PART_ENTRY_SIZE=232380416
E: ID_PART_ENTRY_TYPE=ebd0a0a2-b9e5-4433-87c0-68b6b72699c7
E: ID_PART_ENTRY_UUID=87e031b7-6353-44af-bbc7-b674d8a8c85e
E: ID_PART_TABLE_TYPE=gpt
E: ID_PATH=pci-0000:03:00.0-nvme-1
E: ID_PATH_TAG=pci-0000_03_00_0-nvme-1
E: ID_SERIAL=INTEL SSDPEKKW128G7_BTPY72910KCW128A
E: ID_SERIAL_SHORT=BTPY72910KCW128A
E: ID_WWN=eui.0000000001000000e4d25c19f1e04d01
E: MAJOR=259
E: MINOR=4
E: PARTN=4
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=72776

I did that but without reboot (only with code change, enabling logging and restarting rockstor) I get only:
DEBUG [system.osi:494] disks item = Disk(name=‘sda’, model=‘WDC WD100EFAX-68’, serial=‘7PKP0MNC’, size=9771050598, transport=‘sata’, vendor=‘ATA’, hctl=‘0:0:0:0’, type=‘disk’, fstype=‘btrfs’, label=‘Data’, uuid=‘d2f76ce6-85fd-4615-b4f8-77e1b6a69c60’, parted=False, root=False, partitions={})
DEBUG [system.osi:494] disks item = Disk(name=‘sdb’, model=‘WDC WD100EFAX-68’, serial=‘7PKNDX1C’, size=9771050598, transport=‘sata’, vendor=‘ATA’, hctl=‘1:0:0:0’, type=‘disk’, fstype=‘btrfs’, label=‘Data’, uuid=‘d2f76ce6-85fd-4615-b4f8-77e1b6a69c60’, parted=False, root=False, partitions={})

nothing about nvme disk, only the:
ERROR [storageadmin.views.command:70] Skipping Pool (rockstor_rockstor00) mount as there are no attached devices. Moving on.

@Jorma_Tuomainen This is great work, thanks.

I should now have enough to get under way with this issue but have two more issues (currently) pending in my queue.

Now very much looks like the issue is with scan_disks() failing to identify the nvme as the system disk, hence the:

and the higher order code segments consider it ‘detached’ as a result: since it was once known to them.

I will, once I’ve started, create an issue detailing my initial findings and link back to this forum thread. I shall then update this thread upon the issues resolution.

It would be great if you could hang in there just a little longer while I sort this one out as a ‘real life’ confirmation of the fix at your end would be pukka.

Thanks again and I hope to get on to this issue as time allows: unless of course someone beats me to it that is.

Not in that big of a hurry since I have a workaround for docker :slight_smile:

@Jorma_Tuomainen OK and cheers.

I’d like to get it sorted sooner rather than later though as it’s a regression and the associated code / tests are still fairly fresh in my head so I’ll see how it goes.

As always many things to juggle.