Device: /dev/sdi [SAT], unable to open device

m3elloa · November 14, 2016, 6:01am

Quick question:

Yesterday at 9:58pm EST I got a communication from the server saying that it couldn’t open sdi.

Looking at my disks I don’t see a sdi, all disks are accounted for, but they are out of sequence. They go from sdh to sdj. I suppose uuid can change, but while running?

[root@mellorockstor ~]# smartctl -A /dev/sdi
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.6.0-1.el7.elrepo.x86_64] (local build)
Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdi failed: No such device
[root@mellorockstor ~]# ls /dev/sd?
/dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdj /dev/sdl
/dev/sdb /dev/sdd /dev/sdf /dev/sdh /dev/sdk
[root@mellorockstor ~]# lsblk -o NAME,SERIAL,SIZE,TRAN,TYPE,FSTYPE,LABEL
NAME SERIAL SIZE TRAN TYPE FSTYPE LABEL
sda JK1105YAKYG9TX 1.8T sata disk btrfs pool
sdb YAK71UHV 1.8T sata disk btrfs pool
sdc JK11B1B9HRP24F 1.8T sata disk btrfs pool
sdd 154856401560 223.6G sata disk
├─sdd1 500M part ext4
├─sdd2 11.8G part swap
└─sdd3 211.3G part btrfs rockstor_rockstor
sde Y485REZTS 1.8T sata disk btrfs pool
sdf YAK6D79V 1.8T sas disk btrfs pool
sdg JK11B1YBKYBTHF 1.8T sas disk btrfs pool
sdh JK1105B8G6Z14X 1.8T sas disk btrfs pool
sdj JK1171YAJSMAKS 1.8T sas disk btrfs pool
sdk JK11B1B9K3ENRF 1.8T sas disk btrfs pool
sdl WD-WCAVY4457087 1.8T sas disk btrfs pool
sr0 K1HEAEA5206 1024M sata rom

Any ideas?

Thanks!

phillxnet · November 14, 2016, 1:57pm

@m3elloa This is a little worrying as for a drive to change name like this without a reboot (I’m assuming there was no re-boot) it must have been on an interface that was reset (and why) or had a dodgy cable, which suggests either a drive failed to respond for a period that exceeded the /sys/block/sdX/device/timeout (usually 30 seconds) or was actually disconnected and then reconnected (hence the dodgy cable suggestion). This event then caused the re-naming to occur, ie sdi became maybe sdl, ie jumped to the end of the available names leaving a hole / gap in the naming where it used to be. So from that assumption it would look like device with serial WD-WCAVY4457087 is the suspect cable / drive.

Might be worth looking through you logs at around the indicated time to try and track this one down and get more info on what the system reported.

UUIDs of filesystems don’t change (without a re-format) and nor do the serial numbers of drives. Rockstor is flexible to drive name changes as it simply updates whatever name a drive has under a given serial number but the drive serial number is assumed not to change as per hardware example. However the disconnection / reconnection is the worrying part.

That’s all I can think of at the moment. Do report what ever you find on this one though as it’s a curiosity for sure.

m3elloa · November 14, 2016, 3:24pm

Morning.

Server was up for 14 hours when I checked, so 8 to 9 hours running.

Not too familiar with btrfs, but here’s what I see in reference to sdi:

Nov 13 10:28:11 Rockstor kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 13, phy 3, sas_addr 0x90833825aca29ca9
Nov 13 10:28:11 Rockstor kernel: scsi 6:0:3:0: Direct-Access ATA WDC WD2003FYPS-2 5G09 PQ: 0 ANSI: 5
Nov 13 10:28:11 Rockstor kernel: mptsas: ioc0: attaching sata device: fw_channel 0, fw_id 11, phy 4, sas_addr 0x816f9e519ec1f6be
Nov 13 10:28:11 Rockstor kernel: sd 6:0:3:0: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)

<---->

Nov 13 10:28:13 Rockstor systemd: Device dev-disk-by\x2duuid-3f010aa5\x2d5430\x2d4471\x2d8801\x2dfc1b6592b088.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:1f.2/ata6/host5/tar$
$evices/pci0000:00/0000:00:1c.4/0000:05:00.0/host6/port-6:3/end_device-6:3/target6:0:3/6:0:3:0/block/sdi

<—>

and around the e-mail time, the notification after several errors:
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354381, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354382, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354383, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354384, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354385, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354386, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354387, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354388, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354389, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:11 Rockstor kernel: BTRFS error (device sda): bdev /dev/sdi errs: wr 354390, rd 8, flush 26, corrupt 0, gen 0
Nov 13 21:58:32 Rockstor smartd[2314]: Device: /dev/sdi [SAT], open() failed: No such device
Nov 13 21:58:32 Rockstor smartd[2314]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root …
Nov 13 21:58:32 Rockstor smartd[2314]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
Nov 13 21:58:33 Rockstor kernel: btrfs_dev_stat_print_on_error: 7675 callbacks suppressed

phillxnet · November 14, 2016, 4:15pm

@m3elloa Hello again.

The interesting bit is probably just before this as here we see the /dev/sdi appearing to be gone from btrfs’s point of view, just wandering if theirs anything prior to this indicating it’s disappearance form the block layer rather than the fs layer. Also after the email time may also have the drives re-appearance, and indicating it’s new name.

Might help to add a mode to your lsblk command ie:

lsblk -o NAME,SERIAL,MODEL,SIZE,TRAN,TYPE,FSTYPE,LABEL

but looks pretty likely that /dev/sdi model number from log entry of “WDC WD2003FYPS-2 5G09” was some how lost and re-appeared at sdl.

The following command will show mappings as used by Rockstor on the disks page:

ls -la /dev/disk/by-id/

By the way this hw issue doesn’t bod well given your parity raid selection.

m3elloa · November 14, 2016, 7:32pm

@phillxnet Phill thank you for your reply.

I can send you the message log with everything if you are interested. I just reflashed couple H200 to P20 IT and want to destroy the pool, reformat the disks, create a new pool, and put them to test.

After that I’ll have the perc 5/i back and will keep my eye on errors like that, so we can chat

phillxnet · November 14, 2016, 7:42pm

@m3elloa Thanks but no worries, looks very much like a drive failed / timed out / dodgy cable and not really Rockstor related.
Sounds like you have plenty to be getting along with anyway - Destroy away but I’d keep an eye on that drive / cable / port though.

Thanks for the update and do post anything you find, the more testing the merrier.

But I’d not bother with the parity raid levels just yet, unless you intend to help with the btrfs development on them. Pretty much known as not production ready but getting revamped in the background so there’s hope.

m3elloa · November 14, 2016, 9:27pm

Thanks @phillxnet. I’ll be asking @suman to reset the key once more for sure

One thing though … I’ve replaced my one Perc 5/i with 2 H200 for the test I’ll be performing and noticed that my IP, manually set on my interface, has changed (!?). Go figure

Will save the logs before reinstall in case you guys need anything.

Tomasz_Kusmierz · November 15, 2016, 12:43am

Aaaaahhhh I’ve had a similar issue on pre failure OCZ vertex ssd. Issue here is that my vertex when it was slowly approaching a mortal coil to kick was (very rarelly) resetting a sata link and experiencing a “Bourne Identity”. System uses tricks to identify device and give it GUID (globally Unique ID), so if your disk will for example corrupt it’s serial number, GUID will change … your disk after sata reset will get a different /dev/sd* node … but hey btrfs will keep going no matter what

And please remember that data being spit out by the disk can change … let’s say there is a sata reset, your disk has a hazy understanding of some of it’s data -> returns this crap data to the system, system assigns a different GUID, you get different node, 20 seconds later disk is out of amnesia and returns everything OK.

m3elloa · November 15, 2016, 1:24am

Yeap, I’m sure it is a bad disk. smartctl is reporting lots of error, just didn’t expect that uid change. Not too familiar with btrfs yet

As soon as I’m sure the h200 are flashed correctly they will go back into testing

Thanks for your feedback.

phillxnet · November 15, 2016, 10:20am

@m3elloa Glad you found the dodgy drive.

Yes drive name changes occur regularly over a reboot / power cycle (more often), but these are system level drive names ie sda, sdb etc. Btrfs as implemented in Rockstor ie full disk only on data drives, uses these base drive names directly. As @Tomasz_Kusmierz points out there are mechanisms to track drives over these sda format temp_name types. Some track only the file system (by fs level uuids) and some track the partition table / boot sector type area. But below those mechanisms is tracking the drives themselves.

The Udev system provides one of these by way of the links in:

ls -la /dev/disk/by-id/

which are based in the main on the devices serial numbers and bus type and simply point back to the ‘current’ sda type names that the btrfs commands output. These names however are below btrfs and are not btrfs names as such, just normal /dev OS drive names.

Apologies if I misread your post on this point.

Hope the h200’s work out and let us know how the smart info looks when retrieved through these as some cards require custom smart parameters. But I don’t know about these ones (yet).

Yes it does look like we have some bugs in that area and funnily enough it looks like @Tomasz_Kusmierz himself (@tomtom13 on github) is looking into this currently:

github.com/rockstor/rockstor-core

network stack overhaul

opened 01:11AM - 11 Nov 16 UTC

closed 06:05PM - 22 Dec 23 UTC

tomtom13

I would like to use this issue to fix a lot of stuff in networking stack: - man…ual settings sometimes don't work - manual setting can magically suck in additional setting from dhcp and populate it magically inside of gui (yeah, scary ... init) - related to previous, when trying to fix this problem by going to edit page again and removing ALL dhcp aquired setting leaving only manual IP address that one would wish to have ... after hitting submit all manual setting disappear and you only get left with dhcp ones (checked with ifconfig to be true on interface as well). After togling interface OFF then back ON problem seems tobe fixed. - bonding only works on initial setup, after reboot it's bonkers - after setting up a bonding in 802.3ad for 4 interfaces and rebooting in VMware, this made interfaces go into crazy loop and get rockstor stuck. Forced reboot and removal of interfaces just to get to logon screen made rockstor forget my log on detail (possible config corruption ?) - teaming is just plainly defunct - manual setting forces to put in a random gateway that makes it put route with higher priority than dhcp one resulting in rockstor not being able to access internet. - Clicking a pencil (for edit) button on network connection sometimes takes you to 100% unpopulated network page ... but if you stick on the network page for longer than 30 seconds and click the pencil button it then takes you to a properly populated page. ps. guy could you please pile on any problems you see with networking setup ? I can join it here.

m3elloa · November 15, 2016, 12:57pm

@phillxnet

Not at all needed. You and the other members are been very helpful and patient with my learning curve.

Those are a little trickier than the M1015s, but so far no errors and no bricked cards.

Excellent. Before I open duplicated fix request on this IP issue, I also noticed that the CLI sometimes displays a message with the wrong IP. For instance, if you install the server using DHCP, it will display in the console that one. You static assign another in the router and reboot. The CLI displays the old IP, but the interface has the new one and the GUI is accessible using it…

Best regards!

phillxnet · November 15, 2016, 1:13pm

@m3elloa Yes this console message update code was last updated prior to a major re-working of the network code so I think that’s where the dis-join / dis-function lies. If we are lucky then a simple Carriage Return in the console may update it but haven’t looked at this for ages.
Given Rockstor is mostly deployed headless it might be better to drop that feature and improve / mend the service discovery side of things for Rockstor’s own Web-UI. Had a quick look at this once but didn’t get anywhere.

Thanks and do dive in with GitHub issue creation, obviously the tighter defined the issue the better though.

Tomasz_Kusmierz · November 15, 2016, 2:19pm

@m3elloa could you please dump all your network wories into that one:

Please

I know that @suman won’t be happy over not separating issues on git hub, but right now it seems like there is an endemic issue there that might require a cumulative rewrite ( there is a lot of “stale stuff” going on in gui as well )

m3elloa · November 15, 2016, 2:57pm

Even the GUI related ones? i.e.: Cli not showing the correct IP?

phillxnet · November 15, 2016, 3:18pm

@m3elloa I would be careful to distinguish cli (command line interface) from initial console message here, just in case there is a confusion. I believe that the recently re-written network bits esentially interface directly with the command line app ‘nmcli’ so I would be surprised if that didn’t tally with at least the intentions of the Web-UI. Hope that helps to extract out the console message from command line tools such as nmcli.

But I have to qualify this with not having looked at this area of Rockstor for a bit.

m3elloa · November 16, 2016, 1:34am

Ops… got to be careful with my translations

Thanks for keep me strait.

m3elloa · November 16, 2016, 1:55pm

@phillxnet

After reinstalling the system and reformat the HDs, I can’t duplicate the issue. Just finished copying 6 TiB data to the server and no e-mails were sent to me. Will keep an eye and re-post in case of issues.

Thank you and all for the attention to my questions.