Rockstor-hdparm.service has failed during/after upgrade

Hi,
not sure this is related to the update to 3.9.2.53 in the stable channel, but that’s when it happened (i.e. CentOS is the operating system).

When I triggered the update, after 5 minutes or so, (based on the pop-up with the countdown that comes up), the connection was lost to the WebUI, and it would not come back. At the same time, I was still connected via SSH (PuTTY) to the server itself, so it was still “running”.

Querying the status, I found, that for some reason a deleted device was causing the hdparm service to fail.

[root@rockstorw ~]# systemctl status

● rockstorw
State: degraded
Jobs: 0 queued
Failed: 1 units
Since: Thu 2020-01-09 17:56:15 PST; 1 months 2 days ago
CGroup: /
├─1 /usr/lib/systemd/systemd --system --deserialize 20
├─docker
│ ├─303a0397c2b44183b5a8072a7e206f5b9f09640f8d81ef5110a7110be4d193de
│ │ ├─ 4491 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
│ │ ├─ 9339 /usr/libexec/netdata/plugins.d/apps.plugin 1
│ │ ├─17778 /usr/sbin/netdata -D -u root -s /host -p 19999
│ │ └─22468 /usr/libexec/netdata/plugins.d/go.d.plugin 1
│ ├─28c77fcfcc59cf6b67609fd2f222e28b7799d583d70ec9ada5a5f85c9be48811
│ │ ├─ 6004 s6-svscan -t0 /var/run/s6/services
│ │ ├─ 6065 s6-supervise s6-fdholderd
│ │ ├─ 7451 s6-supervise plex
│ │ ├─11646 /usr/lib/plexmediaserver/Plex Media Server
│ │ ├─11667 Plex Plug-in [com.plexapp.system] /usr/lib/plexmediaserve
│ │ ├─11718 /usr/lib/plexmediaserver/Plex DLNA Server
│ │ ├─11721 /usr/lib/plexmediaserver/Plex Tuner Service /usr/lib/plex
│ │ └─17079 Plex EAE Service
│ ├─3154f5ae28cb47d6b5933d2964b0d4d830ba53eb8f4d0fbf9bca4677dac09776

The looking for the failed item:

[root@rockstorw ~]# systemctl --failed

UNIT LOAD ACTIVE SUB DESCRIPTION
● rockstor-hdparm.service loaded failed failed Rockstor hdparm settings

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use ‘systemctl list-unit-files’.

[root@rockstorw ~]# systemctl restart rockstor-hdparm

failed again (as I was hoping, it was something like reported a long time ago here:

but alas, that didn’t work either. So, looking at this:

[root@rockstorw ~]# systemctl status rockstor-hdparm.service

● rockstor-hdparm.service - Rockstor hdparm settings
Loaded: loaded (/etc/systemd/system/rockstor-hdparm.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2020-02-11 14:25:11 PST; 22s ago
Process: 22554 ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3XS (code=exited, status=2)
Main PID: 22554 (code=exited, status=2)

journal -xe 

shows this:

Feb 11 14:25:11 rockstorw systemd[1]: Starting Rockstor hdparm settings...
-- Subject: Unit rockstor-hdparm.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rockstor-hdparm.service has begun starting up.
Feb 11 14:25:11 rockstorw hdparm[22554]: /dev/disk/by-id/ata-HGST_HDN724040ALE640_PK1334PEHZB3XS: No such file or directory
Feb 11 14:25:11 rockstorw systemd[1]: rockstor-hdparm.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 11 14:25:11 rockstorw systemd[1]: Failed to start Rockstor hdparm settings.
-- Subject: Unit rockstor-hdparm.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit rockstor-hdparm.service has failed.
--
-- The result is failed.
Feb 11 14:25:11 rockstorw systemd[1]: Unit rockstor-hdparm.service entered failed state.
Feb 11 14:25:11 rockstorw systemd[1]: rockstor-hdparm.service failed.
Feb 11 14:25:11 rockstorw polkitd[4284]: Unregistered Authentication Agent for unix-process:22548:283853767 (system bus name :1.2061, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
Feb 11 14:25:33 rockstorw systemd[1]: Configuration file /etc/systemd/system/rockstor-hdparm.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.

As you can see above, it seems the hdparm tries to access a device that I removed about a month ago see here: Advice needed on All-Drives replacement

I replaced all HGST drives with WD ones … and after each removal/detaching also ensured that I deleted the entry in the disks line.

The only way I can get the UI back is issuing a reboot via the terminal, which I did. The system comes back up, and shows the update still as available:

image

If @phillxnet or @Flox confirm that it’s update related, I can update the title, but I suspect, this could have happened at any point in time. It seems to me that there’s a zombie entry for the last removed device (I think the ID is from the last drive I then removed to complete my disk replacement journey) somewhere.

No idea how to fix this. Any suggestions? As of now, the box runs fine as far as I can tell, I can access the shares/open files, etc. so during the reboot this does not seem to be an issue, only as a post-upgrade piece …

2 Likes

Further investigation shows that in the rockstor-hdparm.services files these entries exist (w.r.t. to spindown, etc.), all for drives that have since been disconnected:

cat rockstor-hdparm.service

[Unit]
Description=Rockstor hdparm settings
After=suspend.target
# Rockstor-hdparm - the automated maintenance of this file requires a clear line
# at the end of both the Unit and Service sections, but no other clear lines.
# The line count of the template file in /conf is used to evaluate
# if an ‘all entries removed’ scenario has been reached. If any modification are
# made to the template file please also delete the generated file,
# /etc/systemd/system/rockstor-hdparm.service, and recreate any existing
# configuration via Rockstor’s WebUI.

[Service]
Type=oneshot
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZB3XS
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZB3ZS
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEHZNA2S
# 20 minutes
ExecStart=/usr/sbin/hdparm -q -B125 -q -S240 /dev/disk/by-id/ata-HGST_HDN724040A                                  LE640_PK1334PEJ0URKS
# 20 minutes

[Install]
WantedBy=suspend.target basic.target

The current disks did not have any spin down set. For fun I changed them, and they showed up in the service file showing a spin down setting after 20 mins.

Will it be as easy as deleting the zombie entries from the file? If so, I suspect, the “deletion” of a removed disk is forgetting to also remove these settings …

1 Like

Hi @Hooverdan,

Thanks a lot for tracking that one down and for such a clear report. It seems you got to the bottom of it, but as you say, I, too, doubt it’s related to the 3.9.2-53 update as we didn’t have anything related to that. There was a database schema update related to the disk, yes, but I don’t see how it was related to it.

I’m unfamiliar with how the rockstor-hdparm.service is dealt with and updated, but I suspect you uncovered a case that might have been there dormant. I’ll let @phillxnet answer you (it may be the wrong timezone for him right now, though) as this is his expertise, but I’ll try to see if I can make sense of the related code and find something helpful.

Thanks again for your report, and I’m glad you’re all sorted now at least.

2 Likes

@Flox,
thanks for lighting fast response :slight_smile:
Indeed, I added a couple of spindown settings for the current disks using the WebUI. They showed up in there, alongside the zombie entries. I took a backup of the file, then changed it manually, deleting obsolete entries (/etc/systemd/system/rockstor-hdparm.service)

then executed

systemctl daemon-reload

and then for good measure

systemctl restart rockstor-hdparm.service

and voila, no error messages.

I then went back and executed the upgrade to the 3.9.2-53 version, and … that also went off without a hitch. So, the hypothesis that it’s not related to this update is likely correct, however I assume based on how this update is applied (as opposed to other ones that might require, say, a reboot), the zombie entries can cause a problem.

Let’s see what @phillxnet has to say (or you, if you can make sense of it). I’m more than happy to open an issue, if it comes to that and not just another user error :slight_smile:

1 Like

@Hooverdan Well done on getting that sorted.

That’s the peculiarity there, and the bit we need to work out. There is not dependency on that service from within Rockstor’s other service:

  • the first to run (rockstor-pre.service):
  • the middle service (rockstor.service):
  • the last to run (rockstor-bootstrap.servcie):

So it’s strange that, in the exact same state, this hdparm service files’ outdated entries are fine during boot but end up upsetting / blocking the update process.

Yes, this service file was my way to instantiate hdparm settings (spin-down and APM settings) that were power cycle safe. And I had purposefully kept it independent of the other services as these changes can be made live so they are nothing to do with the other services actually.

Yes, that file is not only a systemd service to instantiate the setting requested (post filter / sanity check of sort via the Web-UI) but it also serves as the database of entered settings, assuming those settings were deemed ‘doable’ by said filters. Hence the strict formatting warning within it. It turns out it’s pretty much impossible to read some of these settings from a dive once they have been requested. So I just stashed the requested setting in that file and read them back into the Web-UI to ‘know’ what they were requested to be. I think, from memory, Synology used a custom kernel module to ‘stash’ some of this info; I used this file.
The original pull request for those wanting to dig into this anomaly is as follows:

And contains an exposition of how it works. And @Hooverdan your linked issue was a minor addition to the hdparm service file to have it wait for the udev system so it’s by-id names would have been created by the time it runs.

Exactly, and I’ve just found the following:

which looks to be the ‘tidy up after yourself’ hdparm issue so we just need to work out why we have this update failure when our hdparm service has rouge entries. I’m still not convinced we have an absolute causal relation here but something is happening.

A failure in this hdparm service should not affect any others really. So we have a fragility here and the cause is a likely candidate for another issue, once we know what that cause is of course.

Agreed, I think this was waiting in the wings until the next update came along. There is a slight time difference between an update that has a database migration and one that does not. But from the pull request for that database change we have the normal sequence that is used for these Web-UI initiated rockstor updates:

systemctl stop rockstor
systemctl stop rockstor-pre
/usr/bin/find /opt/rockstor -name "*.pyc" -not -path "*/eggs/*" -type f -delete
zypper --non-interactive refresh
zypper --non-interactive install rockstor
systemctl start rockstor

with the fairly obvious yum counterparts if we find we are running on a ‘rockstor’ linux (CentOS) base.

That code is in system/pkg_mgmt.py:

and the distro specific commands are defined earlier:

So from that I don’t yet see why this failure in an essentially unrelated service is tripping up our main rockstor services.

Good.

Thanks for another great report by the way. But bar the above,

I don’t have any more light to shed than yourself or @Flox, and you are both probably more aware of systemd issues than me anyway. So:

So that would be great, but we need a reproducer for this. There should be enough info here hopefully to generate one though. And once we have asserted the root cause we can get a focused issue opened and a resolution sorted.

Hope that helps and well done all.

As an aside, I’m pretty sure one can simply delete this service file and then, ideally:

systemctl daemon-reload

and all that will be lost is the spin-down and APM setting for all current and past drives. Resetting these within the Web-UI should then re-assert the hdparm service this file configures. The file is the ‘source of truth’ from Rockstor’s perspective. And this work around may be safer than editing the file anyway.

2 Likes