VTd passthru of LSI2008 HBA not detecting disks in 3.8-10 Rockstor release

Phillxnet, much appreciated for the guidance, forgot to thank you earlier, still stuck in a hole but I’ll deal.

Woohoo, I got further, was able to successfully wipe/prep disks via this process.

dd if=/dev/zero of=/dev/sdb bs=1M
badblocks -wsv -o /root/badblocks-sdb.txt /dev/sdb
mkfs.ext4 -l /root/badblocks-sdb.txt /dev/sdb

Then the rockstor 3.8-10 GUI gave me an option to wipe disks. What does this cmd initiate? btrfs cmds or std Linux filesystem wipe/clean disks cmds, if so what are they? I found another screenshot of who knows what release of rockstor w/ unknown buttons I must have not seen before. Any idea what they are?

Not getting mpt2sas errors anymore on boot either…that was just weird. :frowning:

I am now able to snapshot and clone and re-share out that clone and mount up to vSP{here ESXi hosts (something I was unable to do w/ 3.8-7) and have replication jobs setup (appliances added to each other both ways) but rep is failing. Any idea why or the trick to get rep working beteween two rockstor appliances?

Still dunno what that Warning: UUID not legit/unique msg is, these run on vSphere 6.0U1 infra and I have tried placing rockstor stg appliance on both my vSAN datastore as well as an OmniOS backed NFS datastore, same result/complaint, any idea how to alleviate/get rid of this error/anomaly?

See screenshots.

Is this the snapshot promotion issue, says prune but I saw a previous forum post w/ similar issues? I SWEAR I DID NOT allow either appliance to update so if it was working on ship-date should be still. Took no more upstream appliance updates or manual yum updates.

I DO however now see this, why would receiver be unreachable, no FW’s, same subnet, both interfaces I use are marked as I/O but I told rep to use that interface. See any issues w/ this or should I shut down the VM appliances and add a third interface or just let it failback/set to mgmt interfaces?

Think I figured out what I did wrong, had a second vnic on the appliances but told it to use mgmt IP…doh…Re-testing now, will report back shortly w/ success/failure findings.

EDIT: No luv, even tried to swap back to just letting rep go out over default mgmt interfaces.

Any ideas or still broke/not working?

Great release btw, getting closer to solid GOLD!

suman/phill…help a brotha out :smiley: hahah

Take care guys, look forward to hearing from you.

@whitey Glad to see you’ve made some progress. Got to be quick so will have to be brief. Replication was working but we suspect that some recent upstream updates have upset things as they appeared shortly after the new replication stuff was put in. There is another thread which discusses recent problems in this area and mentions the same error at least at one end as you are getting.

If you refer back to an earlier post of mine in this thread there is an explanation of the Big Red serial warning that I highlighed with a “N.B.” Not being funny just brief (just give the drive a serial number) And in answer to another question of yours I believe the command used to wipe the drive is wipefs. Also in the same N.B. post there is reference to wiping the drives with the rubber icon.

Sorry got to dash now. Well done on persevering and sharing your findings. We are not there yet obviously but I am firmly of the opinion that things are moving in the right direction.

Note that if you have to go through all that badblocks business then things do seem a little strange with your drives.

Thx for the reply but BUMMER that the replication framework has experienced regression. :frowning:

How long do you think it ‘may’ be before we see that resolved or a new release w/ validated working replication or the framework that replication relies upon to become more robust/resilient to code changes. Do we know what underlying code was affected btw yet? Kernle, btrfs, other interdependent filesystem tools essential to replication?

Blows my mind thta the 3.8-10 release w/ NO updates and no manual yum updates is not working if that one was a known/verified w/ replication working code drop :frowning: Figure if I tell web UI NOT to update (even if you do it simply tells you you are running latest ver) that we would be good to go and just ‘revision lock’ it that was essentially until upstream issues are identified or resolved.

~whitey

Oh and as a side note these drives are hitachi/hgst husslbss600’s, typically VERY good drives (ent class, RIDICULOUS PBW stats), had a mix of good/bad luck w/ these, the last three I got seemed to initially throw a fit and NOT wanna work in ANYTHING but an OmniOS stg appliance VM. Linux would throw a fit, FreeNAS would as well (STILL can’t use 3 of the 4 of those disks in FreeNAS as it complains abt being able to lay down a GPT partition/format device). I finally got some time to troubleshoot them and apparently bring them back to life/usable state and decide to run a badblocks pass several times just to make sure they were good. What wiped them or seemed to initially make them visible to my ubuntu LTS utility VM was the dd trick. Although it does seem some OS/stg appliance seem VERY picky w/ previous or unclean disks that may have been used elsewhere…even if you think you clean them it seems to me.

If there is a ‘works everytime w/out a doubt’ way or preferred method I’m all ears. If wipefs /dev/sd* then wipefs -a /dev/sd* works the easiest I can do that in the future.

At the end of the day I got em for $50 each for 100gb slc sas ssd so I ain’t complaining and badblocks comes back silky clean so I am not quite sure what to make of them all in all, seem to be crushing it for now I/O-wise and performing reliably for me (crosses fingers/knocks on wood), all lab/play stuff anyways, critical stuff is still on my tried and true ZFS nas boxes just yet…someday :smiley:

@whitey, sorry been too busy fighting package dependency fires. Replication is fixed by just-released 3.8-10.02 udpate. Please see my comments here and here and proceed. It should work this time.

And, thanks for detailed posts. Makes it so much better to help :smile:

Thanks for the hard work and updating me in this thread suman. I DO however seem to have a knack for making stuff go BOOM.

I updated my primary rockstor stg appliance to 3.8-10.02 successfully, rebooted, and now web int won’t load :frowning: Tried systemctl restart rockstor only to receive these errors. My secondary stg appliance wont even update from 3.8-10 to 3.8-10.02. Must be a bad IT juju day.

Thoughts?

@whitey Hello again. To clarify I take it the first two console screen grabs are from the ‘updated but then didn’t start’ machine while the third screen grab is from the Rockstor machine that failed to upgrade.

On the first machine that now has no Rockstor Web UI did you also have python-devel installed. If so
yum remove python-devel
It’s a long shot but just in case, this can block the python downgrade during boot.

What is the output from the following command on each of the machines ie the Python version number.

yum info installed python | grep Release

It should be:-
Release : 18.el7_1.1
If the python downgrade worked, if not then it will be the problematic 34 release. Internally during boot the following command is executed to perform the python downgrade if it finds the 34 release installed.

yum downgrade python-2.7.5-18.el7_1.1 python-libs-2.7.5-18.el7_1.1

Well I can see that my rockstor appliances apparently DO not like to have dual interfaces. Had them playing nice, noticed interface forced me to set a DGW on the stg vlan on the second vnic but I didn’t want to as it is stubbed/isolated but I just slapped .1 in anyways to appease rockstor interface. After update and the other one which I have yet to update (you are right first two screenies are primary, third screenie is secondary) it just seems to have lost connectivity out to net, routing table looks fine to me, I added dgw on LAN int that should be able to get to internet/rockstor repo’s/etc. and it is now working so SOMETHING w/ that update borked up my interfaces.

LOL, never a dull day, anyways I think I need to go back to KISS principle for now, re-deploy both appliances, single interface, just bite bullet for now of running mgmt/data over one vnic/vlan, split it out later once I think everything else is solid and I reach replication bliss?

I’ll keep at it gents, you are both gentlemen and scholars! hah

And here is a GOLD screenshot, I think the int’s are still pissed at me, I can see no int conn on initial boot (may go remove secondary vnics that WERE to be used for stg traff) just for now, I had to ifconfig stgintvnic# down, then I saw hung/stale yum job take off like a rocket, caught one prior and one after. Looks like one is happy/one maybe now, this is using ‘screen’ so i can give you a two for one view. Do I need to down/up rev one or other. Whew this is GNARLY! hah

EDIT: both web int’s are back now w/ secondary data vnic down. both appliances at 3.8-10.02 now :-D, help me climb this final hump if you guys would be soo kind. Want to proceed cautiously, I knwo you said downgrade python but I wanna see what you think of current state w/ one at one ver and the other at another, very odd to me (looks to me like primary (rockstor) did the downgrade and that secondary (rockstor-dr) did not but that was the one I JUST upgraded after realizing net conn was hosing up upgrade) SMH

And they just fixed themselves (rather rockstor-dr/secondary did) NICE.

Both appliances at 3.8-10.02, python seemingly auto-magically downgraded. Am I good to go to setup a rep job ya gents think?

I see i must have AGAIN been blind or maybe web int changed, now on secondary NFS data vnics that were causing contention/routing confusion it seems I DO see how i can leave off the DGW/DNS info, YEA, things seem solid now. ROCKON FOR ROCKSTOR!

Shoot me, I just can’t win, now all disks are offline (mix of 4 disk HGST husslbss600 ssd’s in r10 and 4 disk intel s3700 200gb ssd’s in r10). Both working previous to upgrade, no import/wipe, just remove/delete. :frowning:

I feel like I am living a groundhog day :smiley:

Disks not showing up could be a matter of scan failure. What’s the output of systemctl status -l rockstor-pre rockstor rockstor-bootstrap

Did I ever mention how AWESOME you guys are w/ the quick responses, LOVE the direct interaction and being able to provide valuable feedback, hope I am not being a PITA/bother. :smiley:

I really do think there is something seriously wrong here, I just wiped a disk clean again, built out a new RS 3.8.10 box that COULD see disk prior to update to 3.8.10.02 box, just said I had to wipe it to be usable but the button was there AT LEAST, now after update and reboot all I can do it delete from system and now after several reboots it is still not seeing the disk.

Could use some guidance, very dejecting.

@whitey I think that would be a bit extreme.:grinning:
All I can think of off the top of my head (for now) is that another update has upset the disk scan with your particular hardware as we have reports of 3.8-10.02 installing and rebooting and replicating successfully outside of our own testing. The previously referenced forum thread re Replication has such a report from @jason1 . So there is hope. We just have to narrow down why on you’re system the disk scan is finding no drives after the 3.8-10.02 testing update.

Background: between 3.8-10 (iso) and 3.8-10.02 (testing) there were fixes to how missing drives are represented in the db. However there were no significant changes to how the drives were scanned, at least on the rockstor side. Given this info and from your last Web-UI screenshot (3.8-10.02) it looks like this system is working as expected; bar not seeing any of the attached drives of course, it is simply listing all those devices that use to be attached. When there are long random numbers for device names that is simply shorthand for missing / offline device (in testing) in the new arrangement.

Disks scan is performed by a Rockstor internal function (fs/btrfs/scan_disks) which is mostly unchanged except for how it allocates fake serial numbers to repeat serial of blank serial devices. scan_disks in turn uses the following linux command (from util-linux) to find info on all attached disks.

/usr/bin/lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID

the result in a vm here with 3.8-10 and no data drives is:-

NAME="sda" MODEL="QEMU HARDDISK   " SERIAL="QM00005" SIZE="8G" TRAN="sata" VENDOR="ATA     " HCTL="2:0:0:0" TYPE="disk" FSTYPE="" LABEL="" UUID=""
NAME="sda1" MODEL="" SERIAL="" SIZE="500M" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="ext4" LABEL="" UUID="f34fe05e-c9bf-49eb-bcd5-d027ebaba79a"
NAME="sda2" MODEL="" SERIAL="" SIZE="820M" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="swap" LABEL="" UUID="8f992f43-7100-4027-b09e-b6b29a3f0252"
NAME="sda3" MODEL="" SERIAL="" SIZE="6.7G" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="btrfs" LABEL="rockstor_rockstor" UUID="e526c68b-ca0f-43d0-9c16-69c75c1a7e32"
NAME="sr0" MODEL="QEMU DVD-ROM    " SERIAL="QM00001" SIZE="1024M" TRAN="ata" VENDOR="QEMU    " HCTL="0:0:0:0" TYPE="rom" FSTYPE="" LABEL="" UUID=""

This info is then parsed by scan_disks and passed on for db update purposes to update_disk_state(). In your case it looks like scan_disks is saying there are no attached disks. Util-linux was one of the recent base system updates from version util-linux.x86_64 0:2.23.2-22.el7_1.1 to util-linux.x86_64 0:2.23.2-26.el7.

The same command here in this example vm after updating to 3.8-10.02 (testing channel) gives the exact same output.

Could you post the output of that same command on your problem system with 3.8-10 and then with 3.8-10.02 testing; their shouldn’t be any differences but if there are then that could be the cause of the problem. Thanks.

If you paste the command results here a triple backtick on the lines before and after will aid in it’s reading as otherwise it’s not very eye friendly.

Also if you run:

tail -n 20 -f /opt/rockstor/var/log/rockstor.log

and then press the Rescan button in the Disk screen what is the output.

Thanks.

Output from 3.8.10.02 box of ‘/usr/bin/lsblk -P -o NAME,MODEL,SERIAL,SIZE,TRAN,VENDOR,HCTL,TYPE,FSTYPE,LABEL,UUID’ w/ 1 data disk attached but NOT seen in GUI.

NAME="fd0" MODEL="" SERIAL="" SIZE="4K" TRAN="" VENDOR="" HCTL="" TYPE="disk" FSTYPE="" LABEL="" UUID=""
NAME="sda" MODEL="Virtual disk    " SERIAL="" SIZE="20G" TRAN="spi" VENDOR="VMware  " HCTL="3:0:0:0" TYPE="disk" FSTYPE="" LABEL="" UUID=""
NAME="sda1" MODEL="" SERIAL="" SIZE="500M" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="ext4" LABEL="" UUID="12936957-4a19-457f-a8a0-e55368b82830"
NAME="sda2" MODEL="" SERIAL="" SIZE="2G" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="swap" LABEL="" UUID="edc4f918-2282-4317-b01b-77df15bae459"
NAME="sda3" MODEL="" SERIAL="" SIZE="17.5G" TRAN="" VENDOR="" HCTL="" TYPE="part" FSTYPE="btrfs" LABEL="rockstor_rockstor" UUID="561bccc2-0281-474d-b481-9faaead4f241"
NAME="sr0" MODEL="VMware IDE CDR10" SERIAL="10000000000000000001" SIZE="707M" TRAN="ata" VENDOR="NECVMWar" HCTL="1:0:0:0" TYPE="rom" FSTYPE="iso9660" LABEL="Rockstor 3 x86_64" UUID="2015-12-11-16-37-13-00"```

Outout from tail cmd

```[21/Dec/2015 11:14:45] INFO [storageadmin.views.disk:69] update_disk_state() Called
[21/Dec/2015 11:14:45] INFO [storageadmin.views.disk:81] Deleting duplicate or fake (by serial) Disk db entry. Serial = fake-serial-c9f74217-b2c7-4b26-89a4-8d795aeafb81```