SMART not supported?

Unfortunatelly I have some more examples of failing drives to contribute - this time a nvme SSD used in my workstation, running openSUSE Leap 15.6.

The SSD suddenly failed with a bunch of Buffer I/O error on dev nvme0n1p2 in the kernel log.

Interestingly, after a “cool down” of about 3 h, the SSD worked again for ~ 1h, letting me take a recent backup and even finish a btrfs scrub successfully.

That was the nvme state during this time window, where everything seemed normal again:

$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 33 °C (306 K)
available_spare                         : 100%
available_spare_threshold               : 5%
percentage_used                         : 4%
endurance group critical warning summary: 0
Data Units Read                         : 138045346 (70.68 TB)
Data Units Written                      : 58826116 (30.12 TB)
host_read_commands                      : 941342021
host_write_commands                     : 459546178
controller_busy_time                    : 2745
power_cycles                            : 1137
power_on_hours                          : 4966
unsafe_shutdowns                        : 37
media_errors                            : 0
num_err_log_entries                     : 3569
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

Interestingly, although the drive had just failed catastrophically 3h before, there is no indication about anything beeing wrong … :person_shrugging:

Afterwards, I ran a nvme self-tests, and there it started failing again.

From there on, even when booting into a recovery OS on a USB stick, the nvme partition only showed up briefly. Fetching the state via nvme command or even mounting the SSD led to errors in the kernel log and after just some minutes, the SSD disappeared from the system completly.


The drive is just 2.5 years old, and with 30 TB written on a 2 TB SSD it should still be far from failing. I have already contacted the manufacturer (as 3 years waranty are provided) and the drive is already posted to them, let’s see if at least the financial loss is compensated.

I just wanted to provide this example for showing, that also the new, shiny and standardized nvme status report is not given to indicate a drive failure prematurely (duh!).

Cheers
Simon

2 Likes