Device management in Rockstor

phillxnet · July 15, 2016, 5:36pm

This is a wikified post documenting how Rockstor manages devices and is intended to serve as a live document on the same; user documentation is available at Disks. Please update this developer document when you make any changes to the disk management subsystem within Rockstor’s code. Also any suggestions on how this might be done better are very welcome; but please make sure that these suggestions are clearly indicated as such; this helps to maintain the relationship between this document and the code. The idea is to aid in on-boarding and to provide a place where those with domain specific knowledge can contribute idea’s on how things might be done better. Please see About the Wiki category for overall guidance.

Subtitle: Rockstor’s Serial Obsession

Disk management primer

A prime directive of Rockstor as a storage management system is to track the underlying devices / disks. This subsystem is based primarily on the associated devices serial numbers. This was a design decision which is occasionally questioned with:

“Why don’t you use the UUID to track disks like every other distro / fstab”

The answer is quite simple. Disks / devices don’t have UUID’s, these belong to the file systems on devices. And to meet our base requirement to manage the devices so that they might be added / removed from file systems, btrfs in this case, we need a unique device identifier which is not filesystem dependant. Most distros / fstab configs are managing filesystems not raw devices.

”So why not use by-id names ie /dev/disk/by-id/ entries”

(by-id background: as per Red Hat’s guidelines on Persisten Naming by-id type names are advised for application level device access due to their boot to boot stable nature.)

The answer: we do (as from 3.8-14.02 onwards), however there is a caveat here. By-id device names are intended as application level boot stable abstractions (soft symbolic link to the canonical short temp_names of for example “sdb“). But the udev system that sets up these symbolic links faces the same issue of having to uniquely identify devices irrespective of their filesystem content. Similarly udev also depends on serial numbers: the upshot of this is that without a serial number a device is essentially anonymous in machine terms, at least until it gains a filesystem and consequently no by-id name is generated for devcies where no serial can be obtained. But Rockstor is expected to manage / track devices which in the almost unique case of btrfs pools can come and go, ie a pool can have devices removed or added, so a disk can be a part of one pool one day and due to operational requirements be removed from a pool and assigned to a different pool (currently only via manual intervention) so the disk itself, along with various data held about it, should be consistent in Rockstor’s ‘memory’/ tracking system. For example a custom spin down time / SMART options or last sampled SMART data must be firmly associated with the underlying device not the filesystem / pool. That is why Rockstor’s Disk page displays the following when no serial number is found for a device:

Because without a serial number the device is machine anonymous and also has no by-id name with which to reference it from one boot to another. This is, as explained above, due to udev relying on the serial number as well.

A further caveat here is that udev can provide a bus based name but this would change far too easily by a device simply being moved from one port / bus to another so is not suitable for our purpose. N.B. A device’s by-id and serial can also be affected by such moves but these instances are usually associated with external enclosures / adaptors proxying serial numbers and are a corner case.

It is also why devices without a serial are not offered as potential members of a pool and why all SMART calls are disabled for these devices. These design decisions also help with keeping device path management as simple as possible as there is then no need to deal with alternative paths to temp_name only devices that we can’t track anyway (from one boot to another anyway). I.e those with only a canonical name that, due to no serial, are also machine anonymous.

So to summarise the canonical reference for tracking a device and it’s settings in Rockstor is that device’s serial number, hence the warning indicated above when this can not be ascertained. Also given by-id names are similarly dependant on a devices serial it is often the case that the internal name used by Rockstor on these devices has to fail over to the short temp_name type ie sdb, which is not boot safe/stable and hence the restrictions on these devices.

Internal Implementation.

This essentially boils down to 2 major processing steps.

scan_disks() in src/rockstor/system/osi.py
_update_disk_state() in src/rockstor/storageadmin/views/disk.py

The arrangement here is that scan_disks() is called by _update_disk_state() which in turn updates Rockstor’s database of what devices are currently and previously known to the Rockstor instance. This update process is currently repeated continously in the background using a polling system. See Future Enhancements below.

scan_disks()

This function is called exclusively by _update_disk_state() and has the sole task of returning information about what devices of interest (storage devices) are currently attached to the system. It does initial filtering of uninteresting devices (such as cdrom’s and devices below a parameter specified size. It also ignores swap partitions.

The mechanism employed by this base method is simply to execute the following command:

lsblk -P -p -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

And parse the result.

This data, on currently attached devices only, is packaged into a List of namedtuple constructs for each device of interest found. A namedtuple is essentially a dictionary with order preserved. An important element of this initial filtering is identifying the device which currently hosts the rockstor system itself, this is determined by a helper function in the same file called root_disk() which in turn works by identifying the device that currently hosts the ‘/’ mount point by examining /proc/mounts.

For historical reasons there is an overloading of such elements in the disk data returned such as the partitioned flag. This has yet to be transitioned over to using a new role based system that has recently been added to scan_disks() caller to enable a richer storage of devices ‘roles’. So currently the partition flag for the root device is unset, essentially lying to it’s caller (_update_disk_state()) in order to keep preserve what was initially a simple copy mechanism within that caller for partition status. This however need to be updated to allow for greater flexibility going forward.

Another key responsibility of scan_disks() is to extract the serial number for every device, which is usually present in the output of the lsblk command as given above but when it is found not to be there is a fail over mechanism that will ‘try again’ via the get_disk_serial() helper which in turn simply parses the output of the following command:

udevadm info --name=device_name

And because udev can hold up to 3 different representation of serial there is additional logic (documented within the code) to present the closest counterpart to that normally returned by lsblk, when it works, which is also the closes representatin to that found printed on the label of real devices. Note also that some devices are only virtual, ie md devcies, and so to cater for these devices there are various fail overs in place such as substituting an md devices own unique identifier in place of it’s non existent ‘serial’; essentially a like for like switch given every md device is assigned a unique identifier and so can act as a viable substitute for a hardware assigned serial number.

Fake serials

So having established above that we absolutely require a serial, what do we do if one is not available: we make one up. But the rest of the Rockstor system expects this to be unique and so we make it so by randomly assigning a temporary uuid4 as part of that serial number. But to ease identification of fake from real serial numbers every fake serial takes the following form:

fake-serial-<uuid4>

Hence a simple inspection of any serial can identify it easily, and in a human readable form, as fake, yet it still maintains a unique nature to keep things simple else where. This move then allows for the assumption of uniqueness of this field when it finally enters Rockstor’s database for management purposes.

At this point all device names are of the canonical temp_name type ie sda etc as we are dealing which the current boot only and assessing the ‘now’ status of attached devices.

N.B. the Red serial warning displayed in the Disks page are as a direct result of a device’s serial entry beginning with ‘fake-serial-’ and for legacy purposes serial numbers that are null or an empty string are also flagged with this warning. But currently no serial entry is expected to be null or empty (N.B. there is currently a rare instance of observed null serial in the db that is the result of a bug that this document is being written to help address).

_update_disk_state()

This is a static transaction atomic method to update the database with the current disk data as returned to it by scan_disks(). It’s function is to further filter / process this data and update all db entries with what current information is available. For example if a device was not reported by scan_disks() but is found to be in the database (hence the unique device level reference requirement of a serial number) then that device has it’s offline status changed to “True” to indicate a detached state.

N.B. as devices that are no longer attached also no longer have a system name we follow the fake-serial method of naming and give that device a ‘made up’ name:

detached-<uuid4>

this again ensures uniqueness and also presents an easily identifiable status via the name. In fact every device in the db will (during refresh within this atomic transaction only) temporarily attain a detached-uuid4 type name until it is found to be attached, via a serial match, between the db’s memory of what has been attached and what is now returned as the list of attached devices as reported by scan_disks(). This ensures we have an up to date name against every unique (by serial) device. This transient renaming of the device in the proposed db contents is also the stage where the temp_name of sda type returned by scan_disks is translated to the by-id type name using a helper function in src/rockstor/system/osi.py called get_dev_byid_name() prior to being committed to the db Disk.name field.

The role db field that has only recently been developed / added is also maintained by the _update_disk_state() function. This db field in intended as a scalable catch all to hold miscellaneous information about a device and is currently only used to maintain bios raid and mdadm raid member status via flags indicating the same returned by scan_disks() ie ‘isw_raid_member’ and ‘linux_raid_member’. The structure of this field is that of a json string and so is extensible and easily translated to a Python dict.

Another key db update that is maintained by _update_disk_state() is that of the smart availability / enabled status which is then translated into switches and their state respectively in the last column of the Disks page in the Web-UI. These properties of each device are ascertained by a ‘probe’ mechanism contained within the helper function smart.available (in src/rockstor/system/smart.py) which simply executes a smartctl --info on the given device. However this in inself can lead to an error if the devie knows nothing of SMART and so this error is interpreted as the device having no smart capability. But does currently produce a log entry which by the nature of the pole driven calls is continuously repeated. To keep this log spamming to a minimum while still providing feedback certain devcies, based on their by-id names, are excluded from this probe. This is because they are assumed to not support SMART based on their name. The current list is contained in the re.match line used to process their by-id names which contain their overall category ie virtio- or md-. Of note is that because Rockstor’s SMART subsystem assumes a device path of /dev/disk/by-id and we are not able to track devcies without serial, we simply disable all SMART calls to non by-id names devcies as well. This is implicitly done by assuming all fake-serial- devcies are so because of their non trackable non by-id name status. Another cover all is that SMART calls are disabled for detached devices. N.B. the smart calls referred to here are not those initiated / configured by smartmontools itself, ie those configured in smartd.conf, but those initiated by Web-UI update mechanisms and user interaction within Rockstor.

On _update_disk_state() function’s first pass of the current db state it removes any entry that represents a repeat serial instance, the db doesn’t actually required this field to be unique but the rest of Rockstor expects this to be the case. There is also a removal of all fake serial entries as given their serial is by definition untrackable it also contains no information we wish to maintain. Any devices that originally caused these entries will by now have been assigned a new serial anyway and should be included with the return list from scan_disks().

Given the above the rough function of _update_disk_state() is to maintain existing records of devices for which a serial was previously recorded, although their name and some other info is update, and to remove and re-establish db entries for all other devices. This way it maintains a knowledge of only those devices that can be uniquely identified (on a device level only) as well as temporary / transient entries for all other devices of potential interest.

_update_disk_state() and pools

The last major purpose of _update_disk_state() is to track and update the disk / pool relationships but that is beyond the scope of this disk management document and a prime candidate for a pool management document.

Inline code comments

Please note that further code level details of the mechanisms involved here are mostly documented in full within the code itself. Please endeavour to maintain appropriate levels of code comments on any changes as given the foundational nature of this code it is particularly important to explain what is done and why.

Planned Future Enhancements.

A move to pole driven updates, ie only upon udev events of device changes do we update the db.
Further use of the role field within the db to expand the categorization of disk information, ie for labeling removable disks as ‘backup targets’ or ‘import devices’ or whatever and to simplify the currently non intuitive special case of the root partition not being identified as a partition so that it is not flagged as unusable by the UI components.
Further simplify the code where possible.
Move to only using udev info for serial aquisition as the current lsblk then udev serial fail over is potentially more fragile. Especially now we use udev for our device by-id names that are themselves based on udev’s understanding of the serial rather than lsblk’s.

Unresolved issues:-

As referenced earlier there is currently a rare but observed instance of the db containing a null value for the serial. This is a bug and is currently under investigation as the original author of this doc and the most recent re-working of the serial / disk management subsystem is currently unclear on the mechanism by which this situation arises and may only have been seen with nvme devices which currently lack proper udev and SMART support. So it may be that rougue udev rules are the cause. This document will be update on the resolution of this suspected issue cause.