Testing 4.6.1 Stable on Backup setup- Create RAID 0 Bug? 11/29/23 18:59 USA central "SOLVED"

Tex1954 · November 9, 2023, 6:39am

All updates applied before this happened:

Okay, changed my Backup NAS from two raid0 pools (12TB-3Disk and 4TB-2Disk) to a single 20TB-5Disk setup. All 5 drives are now the same 4TB units.

I went to Pool setup (after clearing all the old setup) and selected all 5 disks for a new Raid0 pool.

Everything seemed to go fine… the share worked fine and I started transfer speed testing…

Well, looking at the dashboard, I checked on the write speed of each disk and found one disk had no activity!!!

Hmmm, checked the pool and only 4 of the 5 disks had been used in the new Raid0 thing, but the Pool size was correctly reported!?!?!

Sooo, deleted the pool and share and reset all the disks again. Did the same thing and it did the same only 4 disks used in the pool. I then did the resize thing and manually added the 5th disk and it worked.

Now everything seems to be working normally, the disk writes in the Dashboard are all showing up etc.

I have no idea if some status bit or something went wrong and it was actually using all 5 disks and the Dashboard was just not reporting or what. I have no inclination to try to fill it up in that state and see what happens!

Before I put it back online, let me know if you want screen shots or anything…

Weird…

Hooverdan · November 9, 2023, 4:46pm

That seems to be curious behavior. Have you checked the rockstor log to see whether an error occurs when it’s trying to add the “last” device to the RAID0 profile? I would imagine that you should get a cmd error raised/traceback in the WebUI if something goes wrong during the pool creation according to this:

github.com

rockstor/rockstor-core/blob/4d60a2c683140aaa10a8e5b201157163cc9a898b/src/rockstor/fs/btrfs.py#L281


      
          provided, then attempts to enable quotas for this pool.
          :param pool: Pool object.
          :param disks: list of by-id disk names without paths to make the pool from.
          :return o, err, rc from last command executed.
          """
          disks_fp = [get_device_path(d) for d in disks]
          draid = PROFILE[pool.raid].data_raid
          mraid = PROFILE[pool.raid].metadata_raid
          cmd = [MKFS_BTRFS, "-f", "-d", draid, "-m", mraid, "-L", pool.name]
          cmd.extend(disks_fp)
          # Run the create pool command, any exceptions are logged and raised by
          # run_command as a CommandException.
          out, err, rc = run_command(cmd, log=True)
          # Note that given our cmd (mkfs.btrfs) is executed with the default
          # run_command flag of throw=True then program execution is stopped in the
          # event of rc != 0 so the following clause is redundant but offers an
          # additional level of isolation.
          # Only execute enable_quota on above btrfs command having an rc=0
          if rc == 0:
              out2, err2, rc2 = enable_quota(pool)
              if rc2 != 0:

But, who knows …

After you create the pool, you could also take a look at the command line view of the pool via:

btrfs fi show would show you which device is assigned, and with btrfs fi usage /mnt2/ it would show you some size info (if run from the WebUI SystemShell you need to sudo the commands).

Tex1954 · November 9, 2023, 6:54pm

Didn’t get any errors and such… almost done testing though.

I’ll try to do a video of the whole process again, not a problem at this point.

Tex1954 · November 13, 2023, 3:54am

Okay, here is the link to what is happening…

https://youtu.be/gttvbHzrzeM

Here is the status after creation.

Easy fix, but the Web GUI is slow to update or something…

Hooverdan · November 13, 2023, 6:11pm

Thanks for the video. I tried the same thing with 5 disks on a fresh Rockstor install, and in my case it did pick up all 5. Main difference for me is that I attempted to do this on a VirtualBox setup with virtual hard disks.
At least it seems to confirm, that it’s probably not related to the WebUI/backend is incorrectly evaluating how many/which disks it needs to pick up.

Stabbing in the dark, but would the behavior be the same, if you swap physical disks around on your connections to see whether it’s always a different disk that fails to be picked up? Stabbing in the dark really, but wondering whether it’s related to the disk controller (not that it’s defective, but just quirky) …

Tex1954 · November 13, 2023, 6:43pm

LOL! Well, I just backed up 16 TB of stuff… took like 15 hours… But, no problem. In the interests of troubleshooting, I’ll swap the two new IronWolf drives and see if the same drive gets dropped…

Just for the curious, my main NAS setup uses all Western Digital drives… 8 4TB in Raid10 and 6 1TB in Raid10 setup. The backup NAS uses 5 Seagate IronWolf drives… I like to mix things up just to balance the error chances… both systems use Intel E3-12xx processors and each have 16GB of ECC memory… more memory made no overall difference in performance for my uses…

Hooverdan · November 13, 2023, 6:45pm

Oh man, I did not want to incur another 15 hours of backing stuff up for you … sorry. Feel free not to do it, since this might not tell us anything.

Tex1954 · November 13, 2023, 6:52pm

Not a problem! I enjoy helping when I can… backup is simply set-it and forget-it in the background.

PS: deleting the the stuff is already done, have to open up the box and swap the disks now…
PPS: The o’l Can’t cancel reboot request thing is still there… LOL!

8-P

Tex1954 · November 13, 2023, 10:42pm

Swapped the two hard drives, all the drives ended up in the same order as before on the Disks display. Same hard drive failed to attach as before…

So, removed all the drive data cables,rebooted, detached all the drives, rebooted again.
Connected the drives one at a time and discovered which drive was the one that didn’t attach before. Turns out it was one of the original 3 4TB drives I had in there.

Swapped the cables on the drives and rebooted again. Everything stayed the same.

It appears there is something funny with that one drive. It isn’t the motherboard or cable or power connector…

Seagate website says there is no firmware upgrade for this or my other drives, seems they are all up to date…

So, it’s all back together and working again.

Seems to be a hardware glitch on that one drive or maybe some hidden glitch in the create software or BTRFS code?

At this point, have no other things to check on my end… if y’all come up with sumtin, I will be happy to check or I can also give you remote control access to the setup as well.

Tex1954 · November 14, 2023, 3:38pm

PS: Just for grins, I ordered another new drive, exact same type. When it comes in, I’ll swap it with the “glitchy?” drive and see if that makes any difference.

Also, for what it’s worth, all my NAS drives are CMR by design…

Hooverdan · November 14, 2023, 5:30pm

@Tex1954 CMR, SMR I would hope it doesn’t make a difference for this particular behavior, and since you have all CMRs . It will definitely be interesting whether a new drive will show the same behavior or not.
Keep us posted, as I am out of alternative ideas.

Tex1954 · November 22, 2023, 9:03pm

Update 11/22/23:

Swapped out the ??? hard drive with new unit. Performed all the tests as before… even checked the system BIOS and changed the drive specs to “Disable Hot Swap”.

Change SATA position on MB, changed SATA cable, changed POWER cable, Changed Hard Drive to same PN IronWolf unit I bought…

After swapping and clearing and rebooting and testing and troubleshooting, the exact same symptoms appear. I even tried different RAID levels including SINGLE and Raid5… all had the same exact symptoms.

Notice all I have to do after Pool creation is click on SHARE then click back to Pool… then the disks used on the right changes…

I tried a Raid0 using RockStor Devices 1,2,3,5 and even then it will NOT add device 5!!!
Device 5, whatever it happens to be never adds to the pool, not even for 2 drives…

As noted in the above pics, ZGY8DX40 is assigned device 5, this is the NEW drive. It is exactly the same part number and specs as all the other drives.

THEN I disconnect one of the other drives and guess what? Drive ZGY8SX40 assigned a different device number STILL FAILED to connect.

I think it is finally clear, the only thing left is there is something different between the drives, but once I create the Raid0 and manually add the weird drive, things work fine.

The only other thing I will state is this problem did not show itself until this last RockStor update… I think…

I am willing to give someone 24/7 access to this setup however you would like to help troubleshoot this.

The last thing I did is see if there is a firmware update for the drives… but I recall checking before and the answer was negative… All the drives have the same firmware installed.

Final note, the drive giving the current problem was part of a 3 HD Raid0 setup for over a year and never had problems until I upgraded to 4.6.1… but just for grins, I’ll try installing a WD drive and see what happens…

Tex1954 · November 23, 2023, 12:16pm

On the setup Motherboard, I swapped the ATA0 SATA connector with the ATA5 SATA connector and it made NO difference to the glitch. WHATEVER the glitch is, it also seems to be /sdf bound.

Last thing before WD drive swap is could I restore the pool after deleting it… and it worked fine except it report 18.20 TB instead of 18.19TB size… glitches everywhere…

Anybody want to remote control this thing to discover what’s causing the weirdness, lemmy know!!!

OR, I would be happy to do ANYTHING y’all ask to help stomp this critter…

Hooverdan · November 23, 2023, 7:49pm

I am wondering whether a vanilla OpenSUSE installation would give an idea whether it is related to the OS updates themselves or not …but then you have to “manually” create the pools/shares, etc.

Might be a little too much effort without knowing that it will pay off …

Tex1954 · November 23, 2023, 7:52pm

If you think it will help, I’ll try to find a 15.4 version of OpenSUSE and install it. I already tried a from scratch install of RockStor twice…

PS: Seems Leap 15.5 is the only DL version I can get…
PPS: Found it…

Tex1954 · November 24, 2023, 6:21pm

Okay, couldn’t get 15.4 Leap to install properly and gave up.

Instead, I installed Rockstore 4.09 and tested without any updates= All good!

Did updates= All good!

Did update to 4.1= All good!

Did updates= All good! Fast transfer speeds etc…

bug9

Used instructions at Distribution update from 15.3 to 15.4 — Rockstor documentation
to upgrade to 4.6.1==

BAD!!!

I think I have proven there is something wrong with the software…

I will try again to install Leap 15.4 generic, almost worked last time…

PS: Rebooted after every change to make sure… Saved all the logs I could as well…

Tex1954 · November 24, 2023, 10:35pm

I have 15.4 installed, working on things… Perhaps we could move this thread to “Troubleshooting”?

Hooverdan · November 26, 2023, 6:32pm

I did as you asked

Not sure whether you made any progress … after some side conversation, here could be another approach, if you’re up for it. Should have thought about that piece a bit earlier, my apologies.

Since we have not really changed the disk/pool/share piece over the last few years (and even have development tests for 27 or so disks that seem to set up fine), it could point to some kernel related changes that’s causing your symptoms.

Since Leap 15.4 will be EOL very soon (matter of weeks) we’re also looking towards moving to 15.5 as the next base OS level.

So, again, if you have nothing better to do and are up for it:
You could follow the instructions from here:

https://rockstor.com/docs/howtos/rpm_install.html

albeit on a 15.5 base OS installation, instead of 15.4. This would also install the latest testing release. The only caveat for this test might be, that replication (which you are using) might be broken (it shouldn’t be, but there is some work being done further in the development to address some issues in Rockstor’s replication functionality due to the various component upgrades that are in progress).

So, on a 15.5 base you would have to still follow the apparmor/wicked/network manager steps the same way:

systemctl disable apparmor

zypper install --no-recommends NetworkManager
systemctl disable wicked
systemctl enable NetworkManager
systemctl start NetworkManager

but the repositories to be added would now contain the 15.5 moniker, like so:

zypper --non-interactive addrepo --refresh -p105 https://download.opensuse.org/repositories/home:/rockstor/15.5/ home_rockstor
zypper --non-interactive addrepo --refresh -p97 https://download.opensuse.org/repositories/home:/rockstor:/branches:/Base:/System/15.5/ home_rockstor_branches_Base_System
rpm --import https://raw.githubusercontent.com/rockstor/rockstor-core/master/conf/ROCKSTOR-GPG-KEY
zypper addrepo -f http://updates.rockstor.com:8999/rockstor-testing/leap/15.5/ Rockstor-Testing
zypper --non-interactive --gpg-auto-import-keys refresh

And the actual install and start of Rockstor would look like this:

zypper in --no-recommends rockstor-5.0.5-0
systemctl enable --now rockstor-bootstrap

Depending on the outcome, this could @phillxnet and @Flox some additional ideas, unless this “fixes” the symptoms, so we can move forward with the upcoming next test releases with most underlying components updated that should then result in a new stable release fairly soon …

Tex1954 · November 27, 2023, 3:05am

Rather a PITA, but I was able to get 15.4 installed, updated and running. Spent many hours trying to get RDP/VNC to work and no joy there.

Finally created and enabled the 18.2TB Raid0 array, was able to transfer tons of files back and forth, able to delete said files, delete and recreate (assign/format, etc.) all over again and not a single hickup either time.

Keep in mind I did this on a separate SSD where I deleted the partition before hand to get 15.4 to install correctly.

So far as trying the updates, no problem! I can do that!

BTW, I stopped using replication because it caused to much extra disk space to be used. A simple mirror backup works fine for me…

In any case, I’ll give it a go and let you know!

Tex1954 · November 27, 2023, 8:46pm

First problem…

I need to learn how to type better… LOL!

Error: Can’t resolve https://raw.githubusercontent.com

Anyway, link works on my w10 setup, so probably a typo or some other glitch on my end.

PS: Tried so many times, for sure no typo, simply didn’t work for unknown reason…