SMART not supported?

and there is a Rockon for Scrutiny available … of course has to go look at it to see anything amiss (I’ve been running it for a few years, but I certainly don’t check it often enough to catch an impending failure).

2 Likes

… and there goes another NVME SSD, this time the Rockstor OS drive of my main server.

I was worried after it got unresponsive for 2 times on one day, that I had to reboot the server. Luckily, after the 2nd instance I could connect via SSH and saw this in the kernel log:

$ sudo dmesg
[ 4145.426952] nvme nvme1: I/O 64 (I/O Cmd) QID 1 timeout, aborting
[ 4145.426976] nvme nvme1: I/O 65 (I/O Cmd) QID 1 timeout, aborting
[ 4145.426989] nvme nvme1: I/O 66 (I/O Cmd) QID 1 timeout, aborting
[ 4145.427000] nvme nvme1: I/O 67 (I/O Cmd) QID 1 timeout, aborting
[ 4145.427011] nvme nvme1: I/O 68 (I/O Cmd) QID 1 timeout, aborting
[ 4175.462259] nvme nvme1: I/O 64 QID 1 timeout, reset controller
[ 4236.297492] nvme1n1: I/O Cmd(0x2) @ LBA 13600192, 16 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297506] I/O error, dev nvme1n1, sector 13600192 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 2
[ 4236.297563] nvme1n1: I/O Cmd(0x2) @ LBA 108647784, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297572] I/O error, dev nvme1n1, sector 108647784 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 2
[ 4236.297582] nvme1n1: I/O Cmd(0x2) @ LBA 108648040, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297589] I/O error, dev nvme1n1, sector 108648040 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 2
[ 4236.297598] nvme1n1: I/O Cmd(0x2) @ LBA 108648296, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297605] I/O error, dev nvme1n1, sector 108648296 op 0x0:(READ) flags 0x84700 phys_seg 32 prio class 2
[ 4236.297614] nvme1n1: I/O Cmd(0x2) @ LBA 108648552, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297621] I/O error, dev nvme1n1, sector 108648552 op 0x0:(READ) flags 0x84700 phys_seg 29 prio class 2
[ 4236.297630] nvme1n1: I/O Cmd(0x2) @ LBA 108648808, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297637] I/O error, dev nvme1n1, sector 108648808 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 2
[ 4236.297646] nvme1n1: I/O Cmd(0x2) @ LBA 108649064, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297653] I/O error, dev nvme1n1, sector 108649064 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 2
[ 4236.297661] nvme1n1: I/O Cmd(0x2) @ LBA 108649320, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297668] I/O error, dev nvme1n1, sector 108649320 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 2
[ 4236.297677] nvme1n1: I/O Cmd(0x2) @ LBA 108649576, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297684] I/O error, dev nvme1n1, sector 108649576 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 2
[ 4236.297693] nvme1n1: I/O Cmd(0x2) @ LBA 108649832, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
[ 4236.297700] I/O error, dev nvme1n1, sector 108649832 op 0x0:(READ) flags 0x84700 phys_seg 11 prio class 2
[ 4236.298174] nvme nvme1: Abort status: 0x371
[ 4236.298181] nvme nvme1: Abort status: 0x371
[ 4236.298186] nvme nvme1: Abort status: 0x371
[ 4236.298191] nvme nvme1: Abort status: 0x371
[ 4236.298196] nvme nvme1: Abort status: 0x371
[ 4236.316896] nvme nvme1: 15/0/0 default/read/poll queues

The NVMe SSD is just over 4 years old, and it had many different use cases. For about the last 1-2 years, it was the Rockstor OS drive. Here are the health parameters of this drive:

$ sudo nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 34 °C (307 K)
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 6%
endurance group critical warning summary: 0
Data Units Read                         : 106075308 (54.31 TB)
Data Units Written                      : 82597918 (42.29 TB)
host_read_commands                      : 757942863
host_write_commands                     : 1062108684
controller_busy_time                    : 13332
power_cycles                            : 825
power_on_hours                          : 18392
unsafe_shutdowns                        : 142
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

Notably, I am currently not able to reproduce the errors. After the initial errors, I rushed to replace the drive and could successfully migrate it to a new NVMe SSD. I have formatted the old drive and run currently a badblocks -wsv write-test and it just finished the first pass without any errors … :person_shrugging:

Cheers Simon

2 Likes

I assume, in your research you have stumbled across this info as well?

and

and it was curious there that even though it was reported quite some time ago, people would still successfully use the kernel parameter to avoid issues.

But since this also seems to be a combo of overly optimistic firmware and the kernel and your drives having a couple of years under their belts, may be it is related?

I couldn’t find whether a definite fix within the kernel has addressed this particular problem. though, of course, it is curious why you would run into this issue only after some time, so your diagnosis of a faulty drive further up might still be the true root cause

2 Likes

Thank you @Hooverdan for pointing that out.

I actually haven’t searched the internet and just focused on replacing that drive, because it was running for over a year without any errors.

After having a look at the issues described with APST, I think that this is something completely different, as there where no “I/O errors” reported by the users, but rather a lot of “APST” errors, of which I have had none.

BTW, the badblock has finished 4 whole-disk write cycles without any error :see_no_evil_monkey:

1 Like

Even if it was not explicitly present in your logs, considering you’re getting these messages I was thinking it could/would be related to the APST … since post an APST error you would get all the IO errors (at least the way I ready the various threads). But … I don’t want to be the “hammer in search of a nail” here just because it would be convenient :smile:.

1 Like

You could take a look at ShredOS/nwipe which we document in our following doc: Pre-Install Best Practice (PBP)

Another nice doc update by @Hooverdan from our previous DBAN recommendation.

The act of writing to all blocks may force the auto-replacement of some dodgy ones with the spares. Plus it could give you some feedback on if a full write is still possible.

Hope that helps.

3 Likes

Though I didn’t know about nwipe before, it would do the same full-disk write operation as badblock in write mode.
The former focusing on disk sanitazion (data erasure) and the later on testing disks.

As already mentioned, in total I had done 8 passes of write & verify-read operations on the whole disk, without a single error (neither by the badblock tool, nor in the dmesg log).


I assume, that has already happened when the errors where initially raised, therefore all write-tests did pass without a single error.

Though it is still a mystery to me, why the I/O errors appeared in the first place, as I was later able to retrive all the data without errors, before running the tests. So no data was actually lost.

Cheers Simon

1 Like