Rockstor-hdparm.service has failed during/after upgrade

Hooverdan · February 11, 2020, 11:24pm

Hi,
not sure this is related to the update to 3.9.2.53 in the stable channel, but that’s when it happened (i.e. CentOS is the operating system).

When I triggered the update, after 5 minutes or so, (based on the pop-up with the countdown that comes up), the connection was lost to the WebUI, and it would not come back. At the same time, I was still connected via SSH (PuTTY) to the server itself, so it was still “running”.

Querying the status, I found, that for some reason a deleted device was causing the hdparm service to fail.

[root@rockstorw ~]# systemctl status

● rockstorw
State: degraded
Jobs: 0 queued
Failed: 1 units
Since: Thu 2020-01-09 17:56:15 PST; 1 months 2 days ago
CGroup: /
├─1 /usr/lib/systemd/systemd --system --deserialize 20
├─docker
│ ├─303a0397c2b44183b5a8072a7e206f5b9f09640f8d81ef5110a7110be4d193de
│ │ ├─ 4491 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
│ │ ├─ 9339 /usr/libexec/netdata/plugins.d/apps.plugin 1
│ │ ├─17778 /usr/sbin/netdata -D -u root -s /host -p 19999
│ │ └─22468 /usr/libexec/netdata/plugins.d/go.d.plugin 1
│ ├─28c77fcfcc59cf6b67609fd2f222e28b7799d583d70ec9ada5a5f85c9be48811
│ │ ├─ 6004 s6-svscan -t0 /var/run/s6/services
│ │ ├─ 6065 s6-supervise s6-fdholderd
│ │ ├─ 7451 s6-supervise plex
│ │ ├─11646 /usr/lib/plexmediaserver/Plex Media Server
│ │ ├─11667 Plex Plug-in [com.plexapp.system] /usr/lib/plexmediaserve
│ │ ├─11718 /usr/lib/plexmediaserver/Plex DLNA Server
│ │ ├─11721 /usr/lib/plexmediaserver/Plex Tuner Service /usr/lib/plex
│ │ └─17079 Plex EAE Service
│ ├─3154f5ae28cb47d6b5933d2964b0d4d830ba53eb8f4d0fbf9bca4677dac09776

The looking for the failed item:

[root@rockstorw ~]# systemctl --failed

UNIT LOAD ACTIVE SUB DESCRIPTION
● rockstor-hdparm.service loaded failed failed Rockstor hdparm settings

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use ‘systemctl list-unit-files’.

[root@rockstorw ~]# systemctl restart rockstor-hdparm

failed again (as I was hoping, it was something like reported a long time ago here:

github.com/rockstor/rockstor-core

failed to start rockstor hdparm settings

opened 11:22AM - 09 Jul 17 UTC

closed 12:15AM - 18 Jul 17 UTC

phillxnet

bug

An as yet unidentified system package update seems to have broken (on some syste…ms) the rockstor-hdparm.service that enacts custom Web-UI configured drive spin down time and APM levels during boot. ``` journalctl -u rockstor-hdparm.service hpp systemd[1]: Starting Rockstor hdparm settings... hpp hdparm[1727]: /dev/disk/by-id/ata-SAMSUNG_HM160HI_S1WWJ9BZ408430: No such file or directory hpp systemd[1]: rockstor-hdparm.service: main process exited, code=exited, status=2/INVALIDARGUMENT hpp systemd[1]: Failed to start Rockstor hdparm settings. hpp systemd[1]: Unit rockstor-hdparm.service entered failed state. hpp systemd[1]: rockstor-hdparm.service failed. ``` It was found that if the same rockstor-hdparm.service file was executed later in the boot process it would succeed: ``` journalctl -u rockstor-hdparm.service ... hpp systemd[1]: Starting Rockstor hdparm settings... hpp systemd[1]: Started Rockstor hdparm settings. ``` A pull request to address this issue is under development.

but alas, that didn’t work either. So, looking at this:

[root@rockstorw ~]# systemctl status rockstor-hdparm.service

● rockstor-hdparm.service - Rockstor hdparm settings
Loaded: loaded (/etc/systemd/system/rockstor-hdparm.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-02-11 14:25:11 PST; 22s ago
Process: 22554 ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3XS (code=exited, status=2)
Main PID: 22554 (code=exited, status=2)

journal -xe

shows this:

Feb 11 14:25:11 rockstorw systemd[1]: Starting Rockstor hdparm settings...
-- Subject: Unit rockstor-hdparm.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rockstor-hdparm.service has begun starting up.
Feb 11 14:25:11 rockstorw hdparm[22554]: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3XS: No such file or directory
Feb 11 14:25:11 rockstorw systemd[1]: rockstor-hdparm.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 11 14:25:11 rockstorw systemd[1]: Failed to start Rockstor hdparm settings.
-- Subject: Unit rockstor-hdparm.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rockstor-hdparm.service has failed.
--
-- The result is failed.
Feb 11 14:25:11 rockstorw systemd[1]: Unit rockstor-hdparm.service entered failed state.
Feb 11 14:25:11 rockstorw systemd[1]: rockstor-hdparm.service failed.
Feb 11 14:25:11 rockstorw polkitd[4284]: Unregistered Authentication Agent for unix-process:22548:283853767 (system bus name :1.2061, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
Feb 11 14:25:33 rockstorw systemd[1]: Configuration file /etc/systemd/system/rockstor-hdparm.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.

As you can see above, it seems the hdparm tries to access a device that I removed about a month ago see here: Advice needed on All-Drives replacement - #8 by Hooverdan

I replaced all HGST drives with WD ones … and after each removal/detaching also ensured that I deleted the entry in the disks line.

The only way I can get the UI back is issuing a reboot via the terminal, which I did. The system comes back up, and shows the update still as available:

If @phillxnet or @Flox confirm that it’s update related, I can update the title, but I suspect, this could have happened at any point in time. It seems to me that there’s a zombie entry for the last removed device (I think the ID is from the last drive I then removed to complete my disk replacement journey) somewhere.

No idea how to fix this. Any suggestions? As of now, the box runs fine as far as I can tell, I can access the shares/open files, etc. so during the reboot this does not seem to be an issue, only as a post-upgrade piece …

Hooverdan · February 11, 2020, 11:39pm

Further investigation shows that in the rockstor-hdparm.services files these entries exist (w.r.t. to spindown, etc.), all for drives that have since been disconnected:

cat rockstor-hdparm.service

[Unit]
Description=Rockstor hdparm settings
After=suspend.target
# Rockstor-hdparm - the automated maintenance of this file requires a clear line
# at the end of both the Unit and Service sections, but no other clear lines.
# The line count of the template file in /conf is used to evaluate
# if an ‘all entries removed’ scenario has been reached. If any modification are
# made to the template file please also delete the generated file,
# /etc/systemd/system/rockstor-hdparm.service, and recreate any existing
# configuration via Rockstor’s WebUI.

[Service]
Type=oneshot
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZB3XS
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZB3ZS
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZNA2S
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEJ0URKS
# 20 minutes

[Install]
WantedBy=suspend.target basic.target

The current disks did not have any spin down set. For fun I changed them, and they showed up in the service file showing a spin down setting after 20 mins.

Will it be as easy as deleting the zombie entries from the file? If so, I suspect, the “deletion” of a removed disk is forgetting to also remove these settings …

Flox · February 11, 2020, 11:59pm

Hi @Hooverdan,

Thanks a lot for tracking that one down and for such a clear report. It seems you got to the bottom of it, but as you say, I, too, doubt it’s related to the 3.9.2-53 update as we didn’t have anything related to that. There was a database schema update related to the disk, yes, but I don’t see how it was related to it.

I’m unfamiliar with how the rockstor-hdparm.service is dealt with and updated, but I suspect you uncovered a case that might have been there dormant. I’ll let @phillxnet answer you (it may be the wrong timezone for him right now, though) as this is his expertise, but I’ll try to see if I can make sense of the related code and find something helpful.

Thanks again for your report, and I’m glad you’re all sorted now at least.

Hooverdan · February 12, 2020, 12:08am

@Flox,
thanks for lighting fast response
Indeed, I added a couple of spindown settings for the current disks using the WebUI. They showed up in there, alongside the zombie entries. I took a backup of the file, then changed it manually, deleting obsolete entries (/etc/systemd/system/rockstor-hdparm.service)

then executed

systemctl daemon-reload

and then for good measure

systemctl restart rockstor-hdparm.service

and voila, no error messages.

I then went back and executed the upgrade to the 3.9.2-53 version, and … that also went off without a hitch. So, the hypothesis that it’s not related to this update is likely correct, however I assume based on how this update is applied (as opposed to other ones that might require, say, a reboot), the zombie entries can cause a problem.

Let’s see what @phillxnet has to say (or you, if you can make sense of it). I’m more than happy to open an issue, if it comes to that and not just another user error

phillxnet · February 12, 2020, 1:37pm

@Hooverdan Well done on getting that sorted.

That’s the peculiarity there, and the bit we need to work out. There is not dependency on that service from within Rockstor’s other service:

the first to run (rockstor-pre.service):

https://github.com/rockstor/rockstor-core/blob/master/conf/rockstor-pre.service.in

the middle service (rockstor.service):

https://github.com/rockstor/rockstor-core/blob/master/conf/rockstor.service.in

the last to run (rockstor-bootstrap.servcie):

https://github.com/rockstor/rockstor-core/blob/master/conf/rockstor-bootstrap.service.in

So it’s strange that, in the exact same state, this hdparm service files’ outdated entries are fine during boot but end up upsetting / blocking the update process.

Yes, this service file was my way to instantiate hdparm settings (spin-down and APM settings) that were power cycle safe. And I had purposefully kept it independent of the other services as these changes can be made live so they are nothing to do with the other services actually.

Yes, that file is not only a systemd service to instantiate the setting requested (post filter / sanity check of sort via the Web-UI) but it also serves as the database of entered settings, assuming those settings were deemed ‘doable’ by said filters. Hence the strict formatting warning within it. It turns out it’s pretty much impossible to read some of these settings from a dive once they have been requested. So I just stashed the requested setting in that file and read them back into the Web-UI to ‘know’ what they were requested to be. I think, from memory, Synology used a custom kernel module to ‘stash’ some of this info; I used this file.
The original pull request for those wanting to dig into this anomaly is as follows:

github.com/rockstor/rockstor-core

drive power down interface. Fixes #885

rockstor:master ← phillxnet:885_drive_power_down_interface

opened 03:48PM - 25 Apr 16 UTC

phillxnet

+1129 -66

Adds a per drive power down interface / feature based on idle time using the sta…ndard hdparm parameters. Given this is closely related to a drive's Advanced Power Management (APM) setting a display of the current setting and an option to change this setting is also included in the disks table and on the same configuration page. Care has been taken to graphically associate the APM setting with the Power Status (spin down) setting by grouping these columns one after the other and maintaining this relationship on the settings page also. Hdparm is itself an interface to the kernel's own drive control subsystem so we are sticking to core linux utilities for this feature. The hdparm switches used are -C to set idle spin down time and -B to read and set APM level. N.B. It is not possible to read a drives current setting for -C (idle spin down) which complicates matters, the meaning for these values can also vary between drives. The settings used are drawn from man hdparm and are apparently more reliable for newer drives. Hdparm settings are maintained over a reboot but not over a power cycle. To address this a new stand alone rockstor systemd service is introduced (rockstor-hdparm.service). It does not depend on any other rockstor systemd service and no other rockstor service depends on it. The service is used simply to execute the tested hdparm commands that are otherwise executed on demand via the WebUI. That is if no error message and no non zero return code is received from a proposed (via user config entry) hdparm command then that same command is placed in this systemd service unit to be applied on the next boot to address the power cycle loss of these settings otherwise. As idle time / spin down is associated with rotational media we do not apply any setting to a device if it fails an included is_rotational test. This test is based on a drives udevadm info readings for rpm and Automatic Acoustic Management (AAM) readings. The kernels own proc reading of rotation has been frequently reported as currently unreliable hence establishing our own mechanism for this. This function (in system/osi) is considered as conservative and will err on the "not confirmed as rotational" side. If the is_rotational function returns false no setting is applied and user confirmation of this is simply to not update our disk table with any requested settings. This could be improved upon but may benefit from future enhancements in a notification system. As is the following info level message is logged:- "Skipping hdparm settings: device not confirmed as rotational" This should soon be more readily available by improvement in WebUI log view functionality. The core of this feature's function is several light weight property extensions to the disk model that provide the power status, hdparm -C setting, and current APM level to the disks page by way of the additional 2 columns. There is also included a pause function that simply executes hdparm -y to request a drive immediately enters standby mode. No full sleep mode is activated by the included mechanisms (ie hdparm -Y) as such a state requires that a drives interface be re-set and was considered overly heavy handed and potentially more problematic. The active settings change path takes the users input and tests it for sanity then attempt to run the proposed hdparm command one switch at a time; starting with the -B setting as this is more often problematic due to some drives not supporting this APM level set option. If any errors or non zero return codes are received then this setting is not included in the final systemd entry. The -C switch is then also tested in isolation and again if any errors or non zero return codes are encountered then at this stage no entry will be submitted to the systemd file. This means in some circumstances we may have only -C settings for a device but if the -B setting passed our tests then it is by default included; often simply re-affirming the default setting unless the user has specified otherwise. The core method to update the systemd service is implemented in update_hdparm_service which is utilized only by set_disk_spindown that tests the viability of the proposed hdparm switches. To facilitate the option to not include a -B setting we use the 0 value for that switch. This is a symmetrical arrangement in that if an hdparm -B reading fails or returns 'not supported' the reading mechanism conflates these responses into the 0 value (see get_disk_APM_level). This is otherwise an unused APM value (ie 1-255) and so we can maintain a simple interger for this value. This way we avoid applying any -B option to a drive that has otherwise failed to cleanly respond to our -B request or has done so but with a 'not supported' response. Like wise the mechanism is employed to remove any existing -B entry by passing zero as the requested value. The same reading mechanism translates the off reading to 255 which requires our interface logic to translate this back to 'off' but this was considered a cleaner system than dealing with the raw values of 0 (our flag) or 1-255 (written values) or 'off' which represents a reading of the the written value 255. This way we maintain symmetry with reading (using our low level reader) and writing and only translate in the UI for readability. On the UI front the APM settings are disabled by way of their tickbox option and the tick box is also disabled when an APM read of 0 (error or not supported) is encountered. Note that all hdparm commands use the dev by-id names to maintain correct targets over system reboots. Several general purpose mechanism have been included in this pr to facilitate translating the currently db drive.names of sda / sdb to these more stable by-id drive names which are created by the udev subsystem and represent simple symbolic links to their associated sda / sdb type names. It is intended that these translation functions be employed on a wider basis going forward. By default we offer a 20 minute idle spin down time if a prior setting is not found. Given prior settings cannot be established using system calls we establish the single point of truth to be direct reading from the device and the systemd file where reading from the device is not possible (ie with -C settings). This avoids using the django db and was intended to keep the whole hdparm feature as light as possible. If the systemd file does not exist it is created from the empty template held in /opt/rockstor/conf/rockstor-hdparm.service. When newly established the service is also enabled but not started as the associated commands have already been executed by way of the tests prior to their entry in the systemd file. The service being enabled is sufficient to ensure that they will be run on next boot which is all that is needed. If on the other hand no entries are found in an existing /etc/systemd/system/rockstor-hdparm.service file, ie by way of UI instigated edits, then the service is disabled and the service file removed in order to prepare for new settings and the re enabling of the service in that instance. In light of issue #1275 and it's associated pr #1287 which re-factors the disks table creation code modified in this pr I have prepared some handlebar helpers and included them in this pull request, they are however untested as they rely on what is as yet un-merged UI code. Testing of the supplied spin-down values has been mostly successful with values of 20 minutes and less as higher values are less reliable to non functional. It was also found that with no WebUI open the drives are much more reliably suspended, however opening the WebUI and viewing the disks page after drives have entered standy mode and even refreshing this view via browser or via use of the "Rescan" button doesn't then wake the drives that have entered standby. So there appears to be some command active that inhibit the drives entering standby (while the WebUI is open) but doesn't then activate these drives once they are in standby mode. This was a hard one find. Also note that even with the WebUI open multiple drives did successfully enter standby when set to 30 seconds or 1 minute but no longer. However 20 mins and below seem to work find when the WebUI isn't active. This has mostly been tested when viewing the disks page and is quite a time consuming business. Hence the submission of this pr at the beginning of a testing cycle. Reference in the user interface is also made to the recently merged "Task execution time windows" feature. Note: This pr submission text is also intended as the basis of a wiki entry upon successful review and merge. The wiki entry can then serve as a technical manual on this subsystem.

And contains an exposition of how it works. And @Hooverdan your linked issue was a minor addition to the hdparm service file to have it wait for the udev system so it’s by-id names would have been created by the time it runs.

Exactly, and I’ve just found the following:

github.com/rockstor/rockstor-core

spindown config persists on deleted disks

opened 12:25PM - 23 May 16 UTC

phillxnet

If a disk has previously had a spin down / APM setting applied and is then remov…ed and deleted from the system, it's spindown / APM config entry persists in /etc/systemd/system/rockstor-hdparm.service. This is incorrect but should be mostly harmless. The fix is to incorporate a call to system/osi/set_disk_spindown for the deleted device with a spindown_time parameter value of -1 which acts as a flag to remove that device's entry in the systemd file. This call should be executed on all devices just prior to their deletion. N.B. it is not required to do this on devices that are just missing / detached; only when they are deleted from the system. This way we preserve the spin down / APM settings for devices that are detached / reattached. But once deleted it is proper that their spin down / APM settings be removed as with all other details concerning deleted drives. A workaround for now is to delete the /etc/systemd/system/rockstor-hdparm.service file and re-apply any required spin down / APM settings.

which looks to be the ‘tidy up after yourself’ hdparm issue so we just need to work out why we have this update failure when our hdparm service has rouge entries. I’m still not convinced we have an absolute causal relation here but something is happening.

A failure in this hdparm service should not affect any others really. So we have a fragility here and the cause is a likely candidate for another issue, once we know what that cause is of course.

Agreed, I think this was waiting in the wings until the next update came along. There is a slight time difference between an update that has a database migration and one that does not. But from the pull request for that database change we have the normal sequence that is used for these Web-UI initiated rockstor updates:

systemctl stop rockstor
systemctl stop rockstor-pre
/usr/bin/find /opt/rockstor -name "*.pyc" -not -path "*/eggs/*" -type f -delete
zypper --non-interactive refresh
zypper --non-interactive install rockstor
systemctl start rockstor

with the fairly obvious yum counterparts if we find we are running on a ‘rockstor’ linux (CentOS) base.

That code is in system/pkg_mgmt.py:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/pkg_mgmt.py#L388-L405


      
                      updates.append(line)
                      log = True
              if new_version is None:
                  logger.debug("No changelog found: trying yum update for info.")
                  # Do a second check which is valid for updates without changelog
                  # updates. eg: same day updates, testing updates.
                  new_version = pkg_latest_available(pkg, machine_arch, distro_id)
                  if new_version is None:
                      new_version = version
              return version, new_version, updates
          
          
          def pkg_latest_available(pkg_name, arch, distro_id):
              """
              Simple wrapper around "yum update pkg_name --assumeno" to retrieve
              the latest version available from "Version" column
              :return:
              """

and the distro specific commands are defined earlier:

github.com

rockstor/rockstor-core/blob/master/src/rockstor/system/pkg_mgmt.py#L373-L387


      
                      .split("rockstor-")[1]
                      .split(".{}".format(machine_arch))[0]
                  )
              if re.match("Changelogs for rockstor-", line) is not None:  # dnf-yum
                  # eg: "Changelogs for rockstor-3.9.2-51.2089.x86_64"
                  new_version = (
                      line.split()[2]
                      .split("rockstor-")[1]
                      .split(".{}".format(machine_arch))[0]
                  )
          if log is True:
              updates.append(line)
              if len(line.strip()) == 0:
                  log = False
          if re.match("\* ", line) is not None:

So from that I don’t yet see why this failure in an essentially unrelated service is tripping up our main rockstor services.

Good.

Thanks for another great report by the way. But bar the above,

I don’t have any more light to shed than yourself or @Flox, and you are both probably more aware of systemd issues than me anyway. So:

So that would be great, but we need a reproducer for this. There should be enough info here hopefully to generate one though. And once we have asserted the root cause we can get a focused issue opened and a resolution sorted.

Hope that helps and well done all.

As an aside, I’m pretty sure one can simply delete this service file and then, ideally:

systemctl daemon-reload

and all that will be lost is the spin-down and APM setting for all current and past drives. Resetting these within the Web-UI should then re-assert the hdparm service this file configures. The file is the ‘source of truth’ from Rockstor’s perspective. And this work around may be safer than editing the file anyway.