Can't start rock-ons

phillxnet · August 5, 2018, 9:03am

OK, great.

So my current suspicion is that this is my fault, sorry. I recently found and fixed a rather long term and elusive bug concerning Rockstor with 27 or more drives when the system drive was also named sda (plus a few other caveats). During the development of this fix there were a number of inadvertent side effects on disk recognition. For each I developed a test to ensure the ‘fixed’ code (drive recognition level) functioned as expected; and as it had done previously. The aim of course was to introduce no regressions and only add the fix. In the case of an nvme system drive I now suspect I missed the mark and there currently exists no testing, regression or otherwise, for this arrangement: which did work at least at some point as evidenced by yours and others reports. The change I suspect was for issue:

github.com/rockstor/rockstor-core

disk serial=none or fake-serial re sda[a-z] dev names

opened 08:59AM - 16 May 18 UTC

closed 12:16PM - 16 Jul 18 UTC

phillxnet

Thanks to forum member kingwavy in the following form thread for highlighting th…is bug. On initial inspection of the reported data, drive names of the form sdab, sdac, when the same system has a system drive of the more common sda# type name, are being incorrectly identified as system drives. This is currently thought to be the trigger for an incompatible serial=none response from scan_disks() that is then surfaced by scan_disks() caller _update_disk_state(): ``` File “/opt/rockstor/src/rockstor/storageadmin/views/disk.py”, line 342, in _update_disk_state if (re.match(‘fake-serial-’, do.serial) is not None) or ( File “/usr/lib64/python2.7/re.py”, line 137, in match return _compile(pattern, flags).match(string) TypeError: expected string or buffer ``` With kingwavy's assistance the incorrect labelling of root was identified via debug logging thus: ``` m = Disk(name='sdaf', model='PERC H710', serial='fake-serial-0b606a7b-c7fa-4cf0-9e7a-c8dcc70d4034', size=976748544, transport=None, vendor='DELL', hctl='0:2:0:0', type='disk', fstype='btrfs', label='SCRATCH', uuid='a90e6787-1c45-46d6-a2ba-41017a17c1d5', parted=False, root=True, partitions={}) [14/May/2018 13:21:21] DEBUG [system.osi:478] disks item = Disk(name='sdag', model=None, serial='fake-serial-2a01338a-d494-40ef-80d9-ba2888bfde5f', size=976748544, transport=None, vendor=None, hctl=None, type='disk', fstype='btrfs', label='SCRATCH', uuid='a90e6787-1c45-46d6-a2ba-41017a17c1d5', parted=False, root=True, partitions={}) [14/May/2018 13:21:21] DEBUG [system.osi:478] disks item = Disk(name='sdae', model=None, serial='fake-serial-288f98c5-e108-4ca4-b1e0-fd4a791e7ea5', size=234461593, transport=None, vendor=None, hctl=None, type='disk', fstype='btrfs', label='INTEL_SSD', uuid='a504bf03-0299-4648-8a95-c91aba291de8', parted=False, root=True, partitions={}) ``` ie multiple additional "root=True" attributions and a suspected consequent erroneous 'fake-serial-...' attribution: ie relevant lsblk: NAME="sdae" MODEL="INTEL SSDSC2KW24" SERIAL="CVLT6153072G240CGN" SIZE="223.6G" TRAN="sas" VENDOR="ATA " HCTL="1:0:17:0" TYPE="disk" FSTYPE="btrfs" LABEL="INTEL_SSD" UUID="a504bf03-0299-4648-8a95-c91aba291de8" However the system drive in this trigger system is still identified correctly: ``` [14/May/2018 13:21:21] DEBUG [system.osi:478] disks item = Disk(name='sda3', model='PERC H710', serial='6848f690e936450018b7c3a11330997b', size=277558067, transport=None, vendor='DELL', hctl='0:2:0:0', type='part', fstype='btrfs', label='rockstor_rockstor', uuid='7f7acdd7-493e-4bb5-b801-b7b7dc289535', parted=True, root=True, partitions={}) ``` It is noteworthy that the 'sdab' drive (presumably list order related) is the victim of _update_disk_state() bug with "serial=none": ``` [14/May/2018 13:21:21] DEBUG [system.osi:478] disks item = Disk(name='sdab', model=None, serial=None, size=976748544, transport=None, vendor=None, hctl=None, type='disk', fstype='btrfs', label='SCRATCH', uuid='a90e6787-1c45-46d6-a2ba-41017a17c1d5', parted=False, root=True, partitions={}) ``` when the serial is again plainly available (via default and initial lsblk call): ``` NAME="sdab" MODEL="ST91000640SS " SERIAL="5000c50063041947" SIZE="931.5G" TRAN="sas" VENDOR="SEAGATE " HCTL="1:0:14:0" TYPE="disk" FSTYPE="btrfs" LABEL="SCRATCH" UUID="a90e6787-1c45-46d6-a2ba-41017a17c1d5" ``` Also tangentially relevant is root_disk()'s inflexibility re matching only sda and not sdab type drive names, though this is not currently thought to be a trigger in this issue. Please reference / update the following forum thread with this issues progress / resolution: https://forum.rockstor.com/t/disk-scan-errors-expected-string-or-buffer/4783

and was fixed by the changes in pull request:

github.com/rockstor/rockstor-core

disk serial=none or fake-serial re sda[a-z] dev names. Fixes #1925

rockstor:master ← phillxnet:1925_disk_serial=none_or_fake-serial_re_sda(a-z)_dev_names

opened 06:32PM - 11 Jul 18 UTC

phillxnet

+1088 -18

On systems with 27 or more disks named sd* where the system/root is also install…ed on sda ie sda3 and where the 27th and subsequent disks are named sdaa, sdab etc and are also btrfs formatted; a regex based serial propagation bug (from base device to partition) in scan_disks() resulted in the first listed (by lsblk) sda[a-z] device receiving a serial=none attribution and the second and all subsequent listed sda[a-z] devices receiving an erroneous fake-serial attribution. All affected devices would also have empty (or incorrectly inherited from sda) model, transport, vendor, and hctl attributions. This resulted in an inability for the system to update the disk state as the abstracted disk info from the corrupt parsing of attached disk state caused an exception: "TypeError: expected string or buffer". Note that all affected disks (sda[a-z]) were also miss attributed as root=True in the erroneous abstract data produced by scan_disks() Summary 1. Add scan_disks() test mode to enable repeatable fake-serial- tests. 2. Establish unit tests to reproduce the above issue, as reported (36 disk system) and as a minimum artificial drive subset. 3. Establish unit tests to identify current correct behaviour for system disk on LUKS, mdraid, and bios mdraid; as well as btrfs in partition and data disk on bios mdraid behaviour. 4. Fix overly broad regex to identify system disk partitions, if any: ie improve system disk partition match mechanism. 5. Fix system disk LUKS, mdraid, bios mdraid regressions as a result of (4). 6. Address pre-existing TODO: re mdraid info via model info on non system disks; as part of the fix for (5). Fixes #1925 and, by way of duplication: Fixes #1834 ("disk serial is null") @schakrava Ready for review. Testing: The reported issue was reproduced via unit testing based on the reporter's 36 disk system. An abstract of the minimum drive count thought to be able to reproduce the issue was also produced. Further unit tests were constructed for regression testing during the fix development. ``` ./bin/test --settings=test-settings -v 3 -p test_osi* ``` with the relevant scan_disks() related results, post pr, as follows: ``` ... test_scan_disks_27_plus_disks_regression_issue (system.tests.test_osi.OSITests) ... ok test_scan_disks_btrfs_in_partition (system.tests.test_osi.OSITests) ... ok test_scan_disks_dell_perk_h710_md1220_36_disks (system.tests.test_osi.OSITests) ... ok test_scan_disks_intel_bios_raid_data_disk (system.tests.test_osi.OSITests) ... ok test_scan_disks_intel_bios_raid_sys_disk (system.tests.test_osi.OSITests) ... ok test_scan_disks_luks_on_bcache (system.tests.test_osi.OSITests) ... ok test_scan_disks_luks_sys_disk (system.tests.test_osi.OSITests) ... ok test_scan_disks_mdraid_sys_disk (system.tests.test_osi.OSITests) ... ok ---------------------------------------------------------------------- Ran 11 tests in 0.020s OK ``` The Intel bios raid related unit tests were developed using appropriately configured hardware. And all bcache related test data (LUKS on bcache) assumed the udev rules detailed at: https://forum.rockstor.com/t/bcache-developers-notes/2762 Thanks to forum member kingwavy for their cooperation and their contributing the report and the command outputs used to construct the initial unit test reproducer.

which was released in Rockstor stable channel version: 3.9.2-31 (20 days as of writing).

If you would indulge me a little more I would like to ask that you further provide what I hope will be enough info for me to create a test to reproduce your issue: missing/detached nvme system disk, ie it was there and after an update is no longer recognised as attached. This should help to avoid the same regression going forward and should also help with fixing what ever when wrong in the first place.

The procedure required on your end is a little cumbersome but given your supplied work around should be trivial for your. Essentially I require the same procedure and info as I requested from @kingwavy referenced in the indicated issue which in turn grew from the following forum thread:

More specifically in my 14th May post in that thread:

Repeating the request in this thread for ease and slightly modified for this instance:

Could you also post the output of the following commands:

btrfs fi show

and

ls -la /dev/disk/by-id/

and

lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

and when the above command has a serial entry such as:

SERIAL=""

we resource udev via:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L424-L428


      
          # that is a member of eg md125 has
          # FSTYPE="linux_raid_member"
          # Add the same treatment for partitions hosting LUKS
          # containers.
          if dmap["FSTYPE"] == "linux_raid_member" and (

get_disk_serial location:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L1022-L1027


      
              """
              disk_byid_withpath = get_device_path(disk_byid)
              return run_command([WIPEFS, "-a", disk_byid_withpath])
          
          
          def blink_disk(disk_byid, total_exec, read, sleep):

which in turn parses the following command. If you could execute this command on one of the drives that is showing the above empty serial (if any) it may help track this bug down:

udevadm info --name=devname-here

Also if you could remove the “#” and following space in front of the following line on your installed version located here:

/opt/rockstor/src/rockstor/system/osi.py

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/osi.py#L494


      
                      # work with and it current filesystem type.
                      # Has one role per device limit but helps to keep usability
                      # and underlying disk management simpler.
          else:
              # We are not a partition so record this.
              dmap["parted"] = False
              # As we are not a partition it is assumed that we might hold a
              # partition so start an empty partition dictionary for this.
              # N.B. This assumes base devices are listed before their partitions
              dmap["partitions"] = {}
              # This dict will be populated when we find our partitions and back
              # port their names and fstype (as values).
          if (not is_root_disk and not is_partition) or is_btrfs:
              # We have a non system disk that is not a partition
              # or
              # We have a device that is btrfs formatted
              # Or we may just be a non system disk without partitions.
              dmap["root"] = is_root_disk
              if is_btrfs:
                  # a btrfs file system
                  # Regex to identify a partition on the base_root_disk.

and then enable debug logging via:

/opt/rockstor/bin/debug-mode on

Then either reboot or restart the rockstor service via:

systemctl restart rockstor

We should then be able to see in your logs what scan_disks() is passing to _update_disk_state() and confirm or otherwise my suspicion that the nvme disk is just not being parsed correctly by scan_disks(). Or at least narrow down where the problem originates. Look in the main rockstor log for debug stuff, either via the UI component in System - Log Manager, or in:

/opt/rockstor/var/log/rockstor.log

So given this is suspected as a recent regression it would be good to get this one sorted as soon as possible and have some tests so that the same doesn’t happen again.

Sorry to have to ask all of this but it would be invaluable to get this output so that we can be assured that your particular instance is catered for as it is currently the only reproducer we have had reported.

Once we narrow down what is causing this bug I will open an issue with the relevant details and a fix can then be logged against that issue.

Thanks again for your help with this one and for helping to support Rockstor development via a stable subscription.