SMART not supported?

Using Leap 15.4 V4.5.5-0, and a Crucial CT250P2SSD8 NVMe SSD disk, (and two HDDs setup into a software raid1).

The Rockstor WebUI says the SSD disk is “Not supported”, even though it works fine in CLI mode.

I have sent a bunch of logs to the support-email and linked it here.

1 Like

@aremiaskfa Hello again, and thanks for the report.

Better to pop in edited snippets from logs here, surrounting them in the tripple inverted backslash to quote them. There is currently little time I’m unfortunately within support email and far more folks here on the forum to help.

Could you given more details of the exact situation here.

Can you give output of the CLI support and for SMART. We use the CLI to assess support level and indicate this on the Web-UI. The log snippets should also give the error messages associated with attempts to assess the SMART support. Or it may be we are blanket ignoring this device type, SMART wise, for some reason. See below:

I’ll note a potentially relevant part of the code here to help others chip in on what may be going wrong here:
The docstrings:

So from those code coments it does look like we are universally ignoring nvme- devices purely on the grounds of our prior (older) CentOS base.

You could try removing the “nvme-|” entry in that re.match string to see if all is then well. We are, after-all, on much newer smartmontools now.

If you look after the quoted rty (line 473) you will see what we do to ‘test’ for smart capability. Hopefully it will be fine for nvme devices now but otherwise it may need some attention in their case.

Thanks again for the report, and be sure to reboot, or stop and then start all rockstor services after any python change for them to take effect. Might be a nice feature add pull request actually :).

Hope that helps, and do let us know if removing that nvme disable for smart in-code works for you.

2 Likes

WebUI:

smartctl:

smartctl 7.2 2021-09-14 r5237 [x86_64-linux-5.14.21-150400.24.38-default] (SUSE RPM)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT250P2SSD8
Serial Number:                      3315B635EA66
Firmware Version:                   P2CR048
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 6120000284
Local Time is:                      Mon Jan 23 15:03:50 2023 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        0       0
 1 +     1.90W       -        -    1  1  1  1        0       0
 2 +     1.50W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     5000    1900
 4 -   0.0020W       -        -    4  4  4  4    13000  100000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        20 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    8,100 [4.14 GB]
Data Units Written:                 93,665 [47.9 GB]
Host Read Commands:                 241,745
Host Write Commands:                1,596,732
Controller Busy Time:               16
Power Cycles:                       7
Power On Hours:                     168
Unsafe Shutdowns:                   4
Media and Data Integrity Errors:    0
Error Information Log Entries:      8
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               31 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          8     0  0x1006  0x4005  0x028            0     0     -

lsblk:

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   7.3T  0 disk 
sdb           8:16   0   7.3T  0 disk /mnt2/radarr-media
                                      /mnt2/radarr-download
                                      /mnt2/radarr-config
                                      /mnt2/rockons-root/btrfs
                                      /mnt2/rockons-config
                                      /mnt2/dropbox
                                      /mnt2/system-backup
                                      /mnt2/rockons-root
                                      /mnt2/omada-controller-logs
                                      /mnt2/rockons-data
                                      /mnt2/transmission
                                      /mnt2/omada-controller-data
                                      /mnt2/dropbox-config
                                      /mnt2/data
nvme0n1     259:0    0 232.9G  0 disk 
├─nvme0n1p1 259:1    0     2M  0 part 
├─nvme0n1p2 259:2    0    64M  0 part /boot/efi
├─nvme0n1p3 259:3    0     2G  0 part [SWAP]
└─nvme0n1p4 259:4    0 230.8G  0 part /mnt2/home
                                      /mnt2/ROOT
                                      /usr/local
                                      /boot/grub2/x86_64-efi
                                      /tmp
                                      /root
                                      /var
                                      /.snapshots
                                      /opt
                                      /srv
                                      /home
                                      /boot/grub2/i386-pc
                                      /

1 Like

@aremiaskfa Thanks for the additional info.
Re:

The reboot, or restart the rockstor systemd services, and see if all is well.

I strongly suspect we have a legacy ignore SMART in the above code as it just wasnt’ supported back then. But is obviously supported now, at least by smartmontools which is what we use.

Hope that helps and let us know how it goes.

2 Likes

I am doing something wrong.

I have changed the file as suggested and restarted rockstor services: systemctl restart rockstor rockstor-bootstrap rockstor-pre.service
however the change was not visible.

As a sanity check, I then went so far to temporarily set all smart_available and smart_enabled vars to True and restarted rockstor again. This time all my disks displayed that they had SMART supported!

When I reverted the disk.py file back to its original state (and restarted rockstor), however, it still says that the NVMe disk has SMART supported, even though it should say Not supported.

I’ve tried killing nginx and using systemctl stop instead of restart, but no luck.
I have disabled cache in my browser and have even used a different browser, but no lock as well.
I don’t want to reboot my server.

@aremiaskfa Hello again:
Re:

Try a stop on all these, then a start.
That should do it with regard to instantiating Python code changes.

Re:

The disk Rescan button should help there.

Disk-Rescan-button

Hope that thelps. And well done for the experimentation.

3 Likes

Thank you for helping, however it didn’t work.

I am changing /opt/rockstor/src/rockstor/storageadmin/views/disk.py, is this correct?
disk.pyc is being replaced, when stop/start-ing the services.

EDIT: OK, I think I got it.
The reason probably lies here:

When I have temporarily set all smart_available and smart_enabled vars to True, that caused that state to be persisted via do.save() at the end of the method.
However, there is no such persistence in place in the code snippet above.
It simply continues with the next object in the for loop, leaving the previous one untouched.
I have added an extra do.save there, so that I’m back to the original state and can test the fix mentioned by @phillxnet

EDIT 2: I have removed “nvme-”, but the smart.available() returns False, False.
Where can I find reference docs for the smart.* module?

2 Likes

@aremiaskfa Hello again, and thanks for chasing this one down.

Re:

Yes, that looks to be correct; at least as far as you have tracked this down before the smart.available() discovery anyway.

It very much looks like you are getting to the root of this one actually.
Re:

The only developer docs we have for these related functions are within their docstrings. I think the following source file and location should point you in the right direction:

Which is under (a part/file of) the system module (note the init.py within the system dir).

From the file you have been looking at we have the following import from elsewhere in our own code:

And so we get access to “smart” and all it’s goodies as we have defined them: smart.available is thus usable within this views/disk.py file.

And if you do fancy taking your findings to a pull request stage, then note that we don’t as-yet have any unit tests for this part (smart) of the system module:

It may very well be a good time/exercise to add some coverage to this area if it turns out that what we need here is an augmentation in our parsing to decern smart availability. But it may also be that nvme output is quite different. Let us know how your experimentation goes.

Also note that we have an ability to submit problematic smart output via the following shell script:

So it would be helpful to post the output of that script, for the given NVME device to our support email as we would then have access to this particular smart output. The testing mode you see referenced in smart.py is intended for developers to be able to test our code on systems they don’t have direct access to. I unfortunately don’t have access to a real nvme currently, but we do have some in the ‘team’ : but as always it’s a matter of prioritising time. So do, if you can, keep chipping away on this one as I think you are running down a new area of development for us. As stated we have never supported nvme smart and their output may present some new challenges.

Hope that helps, and thanks again for potentially extending our nvme smart capabilities.

3 Likes

My NVME output is indeed different to my HDDs and to what Rockstor expects.
Namely, the
'SMART support is: Available - device has SMART capability.', 'SMART support is: Enabled',
is missing!

“smartctl --info” output is already on your support email.
Also note that there is a bunch of info outputs available on Github for my exact model (thought I am not sure if it will help?): https://github.com/linuxhw/SMART/tree/master/NVMe/Crucial/CT250/CT250P2SSD8

Anyway, I have decided to create a ticket: https://www.smartmontools.org/ticket/1693

I have tried installing the latest version(=7.3) of smartmontools and running some updatedb cmd ,but no change.

3 Likes

@aremiaskfa Hello again.
Re:

Yes, that is what it looks like. So I think we need to adapt our parsing to catch this kind of output. It may well be that this format is common for NVME devices. I have none here currently to test with but maybe in time.

Thanks, hopefully it can be used to establish some tests so we can extend our smart output parsing. When we initially developed this parsing there were many reports of minor failings. But via the previously mentioned shell script we ended up mostly supporting what was out-there. However you report looks to be the start of a new wave of changes likely associated with these devices.

I’m a bit behind on the support emails currently but I should get to it in time.

It will be interesting to see the response from the upstream project.

Nice find/report by the way. Lets keep an eye on this to see who we might augment our smart ‘sensing’ parsing to acomodate. Ideally we need an nvme that doesnt’ support smart. It may be that this doesn’t exist, or/and it may be that it can similarly not be disabled. If so then we may just assume smart capability and enablement for all nvme devices. That should be easy enough.

Hope that helps and thanks for the additional info.

2 Likes

The output for NVMe does not contain SMART support is: ... because, unlike SATA, the related functionality ( SMART/Health Information ) is mandatory and cannot be disabled.

Seems like all nvme are SMART compatible and enabled?

3 Likes

@aremiaskfa Nice find, re your upstream issue with smartmontools.

As you did the exploratory work on this one, do you fancy opening an issue here:

with your findings referencing your upsteam smartmontools issue. We then have attribution to yourself via the issue creation and upstream within issue text. And link back to this forum thread for some context.

That issue can then be used by a pull request to remove our auto-disable of nvme smart (because in our last os host it wasnt’ in smartmontools at the time) and replace it with an auto enable. We may well have some more work to do here also regarding the subsequent parsing if they have changed/lost other entries.

If you fancy having a go at this pr yourself then indicate this on the issue by assigning yourself. Or noting the same if it wont’ let you.

Thanks again for chasing this one down. Major progress here.

Have your tried jury-rigging the code to auto enable given a device name or the like. You could then test the subsequent formatting for our display of the smart details.

Hope that helps.

2 Likes

My vacation is over and with it also my alone time.
So, we’ll see

After jury-rigging it, the SMART details dialog/page is mostly non-functional.
Only a few Identity attributes are displayed in the table.
The rest of the tabs will need to be reworked specifically for nvme drives.

Identity tab: nvme drives seem to have compatible Identity information information. In this case, I suggest simply outputting the whole === START OF INFORMATION SECTION === into a nice table like it is now, but without the regex matches. Simply build the table per line, splitting by first occurrence of colon.

Attributes tab: nvme drive don’t have SMART Attributes. Instead they offer SMART / Health Information. The two are not compatible.

Capabilities tab: I think, this is one can be hidden for nvme drives.

Error Logs tab: nvme drives report errors differently. Will need to be adapted.

Self Test Logs tab: smartmontools 7.3 do not support such logs for nvme drives yet. One should use the nvme-cli package instead.

Perform Tests tab: smartmontools 7.3 don’t support performing tests either. Again nvme-cli is a superior package in this regard.

One ponders what to do next… Maybe for the time being we can show the user that his nvme drive has SMART, but other stuff is not yet supported.

3 Likes

@aremiaskfa Hello again.
Re:

Yes, that was my suspicion, the output does look to be a little too different.

Re:

Sounds like we need a proof of concept pull request for this one. Fancy having a go? We have not had a SMART parsing failure reported for a number of years now so I’m reluctant to tear the whole thing down currently. But we could special-case if nvme type is found.

OK, so another special case likely here, given all other devices are currently reported ok, at least from the silence in recent years feedback anyway.

Oh dear. I think we would be better skipping this initially in favour of watching the smartmontools space. Expecially given all that we see currently also didn’t exist in our CentOS days. It can be a can of worms to perpetually expand into progressively more niche tools.

Same as my last comment on this one I think.

Agreed. However I think this is all likely going to have to wait until our next testing branch phase as the existing testing phase is now wrapping up to the first stable poetry based build. Plus smartmontools is not standing still, so we will likely have more to chew on there when it’s next visited.

Also we could suggest to folks in our fledgling nvme smart support that there is a cli option such as you have suggested: nvme-cli

Some excellent finds here. Thanks for following this issue up so persistently.

3 Likes

I will probably continue on this topic. When I next find the time I will prepare a pull request with minimal changes, as agreed.

3 Likes