Device: /dev/sdi [SAT], unable to open device

Quick question:

Yesterday at 9:58pm EST I got a communication from the server saying that it couldn’t open sdi.

Looking at my disks I don’t see a sdi, all disks are accounted for, but they are out of sequence. They go from sdh to sdj. I suppose uuid can change, but while running?

[root@mellorockstor ~]# smartctl -A /dev/sdi
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.6.0-1.el7.elrepo.x86_64] (local build)
Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdi failed: No such device
[root@mellorockstor ~]# ls /dev/sd?
/dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdj /dev/sdl
/dev/sdb /dev/sdd /dev/sdf /dev/sdh /dev/sdk
[root@mellorockstor ~]# lsblk -o NAME,SERIAL,SIZE,TRAN,TYPE,FSTYPE,LABEL
NAME SERIAL SIZE TRAN TYPE FSTYPE LABEL
sda JK1105YAKYG9TX 1.8T sata disk btrfs pool
sdb YAK71UHV 1.8T sata disk btrfs pool
sdc JK11B1B9HRP24F 1.8T sata disk btrfs pool
sdd 154856401560 223.6G sata disk
├─sdd1 500M part ext4
├─sdd2 11.8G part swap
└─sdd3 211.3G part btrfs rockstor_rockstor
sde Y485REZTS 1.8T sata disk btrfs pool
sdf YAK6D79V 1.8T sas disk btrfs pool
sdg JK11B1YBKYBTHF 1.8T sas disk btrfs pool
sdh JK1105B8G6Z14X 1.8T sas disk btrfs pool
sdj JK1171YAJSMAKS 1.8T sas disk btrfs pool
sdk JK11B1B9K3ENRF 1.8T sas disk btrfs pool
sdl WD-WCAVY4457087 1.8T sas disk btrfs pool
sr0 K1HEAEA5206 1024M sata rom

Any ideas?

Thanks!

@m3elloa This is a little worrying as for a drive to change name like this without a reboot (I’m assuming there was no re-boot) it must have been on an interface that was reset (and why) or had a dodgy cable, which suggests either a drive failed to respond for a period that exceeded the /sys/block/sdX/device/timeout (usually 30 seconds) or was actually disconnected and then reconnected (hence the dodgy cable suggestion). This event then caused the re-naming to occur, ie sdi became maybe sdl, ie jumped to the end of the available names leaving a hole / gap in the naming where it used to be. So from that assumption it would look like device with serial WD-WCAVY4457087 is the suspect cable / drive.

Might be worth looking through you logs at around the indicated time to try and track this one down and get more info on what the system reported.

UUIDs of filesystems don’t change (without a re-format) and nor do the serial numbers of drives. Rockstor is flexible to drive name changes as it simply updates whatever name a drive has under a given serial number but the drive serial number is assumed not to change as per hardware example. However the disconnection / reconnection is the worrying part.

That’s all I can think of at the moment. Do report what ever you find on this one though as it’s a curiosity for sure.

Morning.

Server was up for 14 hours when I checked, so 8 to 9 hours running.

Not too familiar with btrfs, but here’s what I see in reference to sdi:

Nov 13 10:28:11 Rockstor kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 13, phy 3, sas_addr 0x90833825aca29ca9
Nov 13 10:28:11 Rockstor kernel: scsi 6:0:3:0: Direct-Access ATA WDC WD2003FYPS-2 5G09 PQ: 0 ANSI: 5
Nov 13 10:28:11 Rockstor kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 11, phy 4, sas_addr 0x816f9e519ec1f6be
Nov 13 10:28:11 Rockstor kernel: sd 6:0:3:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)

<---->

Nov 13 10:28:13 Rockstor systemd: Device dev-disk-by\x2duuid-3f010aa5\x2d5430\x2d4471\x2d8801\x2dfc1b6592b088.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:1f.2/ata6/host5/tar$
$evices/pci0000:00/0000:00:1c.4/0000:05:00.0/host6/port-6:3/end_device-6:3/target6:0:3/6:0:3:0/block/sdi

<—>

and around the e-mail time, the notification after several errors:
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354381, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354382, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354383, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354384, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354385, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354386, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354387, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354388, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354389, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354390, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:32 Rockstor smartd[2314]: Device: /dev/sdi [SAT], open() failed: No such device
Nov 13 21:58:32 Rockstor smartd[2314]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root …
Nov 13 21:58:32 Rockstor smartd[2314]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
Nov 13 21:58:33 Rockstor kernel: btrfs_dev_stat_print_on_error: 7675 callbacks suppressed

@m3elloa Hello again.

The interesting bit is probably just before this as here we see the /dev/sdi appearing to be gone from btrfs’s point of view, just wandering if theirs anything prior to this indicating it’s disappearance form the block layer rather than the fs layer. Also after the email time may also have the drives re-appearance, and indicating it’s new name.

Might help to add a mode to your lsblk command ie:

lsblk -o NAME,SERIAL,MODEL,SIZE,TRAN,TYPE,FSTYPE,LABEL

but looks pretty likely that /dev/sdi model number from log entry of “WDC WD2003FYPS-2 5G09” was some how lost and re-appeared at sdl.

The following command will show mappings as used by Rockstor on the disks page:

ls -la /dev/disk/by-id/

By the way this hw issue doesn’t bod well given your parity raid selection.

@phillxnet Phill thank you for your reply.

I can send you the message log with everything if you are interested. I just reflashed couple H200 to P20 IT and want to destroy the pool, reformat the disks, create a new pool, and put them to test.

After that I’ll have the perc 5/i back and will keep my eye on errors like that, so we can chat :slight_smile:

@m3elloa Thanks but no worries, looks very much like a drive failed / timed out / dodgy cable and not really Rockstor related.
Sounds like you have plenty to be getting along with anyway - Destroy away but I’d keep an eye on that drive / cable / port though.

Thanks for the update and do post anything you find, the more testing the merrier.

But I’d not bother with the parity raid levels just yet, unless you intend to help with the btrfs development on them. Pretty much known as not production ready but getting revamped in the background so there’s hope.

Thanks @phillxnet. I’ll be asking @suman to reset the key once more for sure :slight_smile:

One thing though … I’ve replaced my one Perc 5/i with 2 H200 for the test I’ll be performing and noticed that my IP, manually set on my interface, has changed (!?). Go figure :slight_smile:

Will save the logs before reinstall in case you guys need anything.

Aaaaahhhh :slight_smile: I’ve had a similar issue on pre failure OCZ vertex ssd. Issue here is that my vertex when it was slowly approaching a mortal coil to kick was (very rarelly) resetting a sata link and experiencing a “Bourne Identity”. System uses tricks to identify device and give it GUID (globally Unique ID), so if your disk will for example corrupt it’s serial number, GUID will change … your disk after sata reset will get a different /dev/sd* node … but hey btrfs will keep going no matter what :smiley:

And please remember that data being spit out by the disk can change … let’s say there is a sata reset, your disk has a hazy understanding of some of it’s data -> returns this crap data to the system, system assigns a different GUID, you get different node, 20 seconds later disk is out of amnesia and returns everything OK.

Yeap, I’m sure it is a bad disk. smartctl is reporting lots of error, just didn’t expect that uid change. Not too familiar with btrfs yet :slight_smile:

As soon as I’m sure the h200 are flashed correctly they will go back into testing

Thanks for your feedback.

@m3elloa Glad you found the dodgy drive.

Yes drive name changes occur regularly over a reboot / power cycle (more often), but these are system level drive names ie sda, sdb etc. Btrfs as implemented in Rockstor ie full disk only on data drives, uses these base drive names directly. As @Tomasz_Kusmierz points out there are mechanisms to track drives over these sda format temp_name types. Some track only the file system (by fs level uuids) and some track the partition table / boot sector type area. But below those mechanisms is tracking the drives themselves.

The Udev system provides one of these by way of the links in:

ls -la /dev/disk/by-id/

which are based in the main on the devices serial numbers and bus type and simply point back to the ‘current’ sda type names that the btrfs commands output. These names however are below btrfs and are not btrfs names as such, just normal /dev OS drive names.

Apologies if I misread your post on this point.

Hope the h200’s work out and let us know how the smart info looks when retrieved through these as some cards require custom smart parameters. But I don’t know about these ones (yet).

Yes it does look like we have some bugs in that area and funnily enough it looks like @Tomasz_Kusmierz himself (@tomtom13 on github) is looking into this currently:

@phillxnet

Not at all needed. You and the other members are been very helpful and patient with my learning curve.

Those are a little trickier than the M1015s, but so far no errors and no bricked cards.

Excellent. Before I open duplicated fix request on this IP issue, I also noticed that the CLI sometimes displays a message with the wrong IP. For instance, if you install the server using DHCP, it will display in the console that one. You static assign another in the router and reboot. The CLI displays the old IP, but the interface has the new one and the GUI is accessible using it…

Best regards!

@m3elloa Yes this console message update code was last updated prior to a major re-working of the network code so I think that’s where the dis-join / dis-function lies. If we are lucky then a simple Carriage Return in the console may update it but haven’t looked at this for ages.
Given Rockstor is mostly deployed headless it might be better to drop that feature and improve / mend the service discovery side of things for Rockstor’s own Web-UI. Had a quick look at this once but didn’t get anywhere.

Thanks and do dive in with GitHub issue creation, obviously the tighter defined the issue the better though.

@m3elloa could you please dump all your network wories into that one:

Please :slight_smile:

I know that @suman won’t be happy over not separating issues on git hub, but right now it seems like there is an endemic issue there that might require a cumulative rewrite ( there is a lot of “stale stuff” going on in gui as well :confused: )

Even the GUI related ones? i.e.: Cli not showing the correct IP?

@m3elloa I would be careful to distinguish cli (command line interface) from initial console message here, just in case there is a confusion. I believe that the recently re-written network bits esentially interface directly with the command line app ‘nmcli’ so I would be surprised if that didn’t tally with at least the intentions of the Web-UI. Hope that helps to extract out the console message from command line tools such as nmcli.

But I have to qualify this with not having looked at this area of Rockstor for a bit.

Ops… got to be careful with my translations :wink:

Thanks for keep me strait.

@phillxnet

After reinstalling the system and reformat the HDs, I can’t duplicate the issue. Just finished copying 6 TiB data to the server and no e-mails were sent to me. Will keep an eye and re-post in case of issues.

Thank you and all for the attention to my questions.

1 Like