System unresponsive

Noggin · November 29, 2020, 1:20pm

I have a SuperMicro based system with the OS installed on a USB flash drive.

Rockstor appears to be crashing, but I don’t know how to verify that. After the assumed crash, I can’t ping the server. So far, I’ve suffered what I believe are three crashes, all yesterday.

At around 5 PM, the system was running and responding. At 7 PM, my wife wanted to watch Plex, but her phone could not connect to it. Sometime between 5 and 7, the system appears to have crashed. No response to pings, it appears to have been dead to the outside world. I pressed the power button and waited half an hour for the server to shut down, but it never did, so I forced a power cycle.
At 7:45ish, the server was running again. By 8:00, it had crashed. I tried to start an IPMI session, but no luck. The embedded IPMI page worked, but the IPMI java application said there was no session to connect to. I force power cycled again.
Around 8:15 PM the server was running again. I opened an IPMI session and started getting, BTRFS error (device sda): parent transid verify failed on 69142933258240 wanted 8120483 found 8120481. It appeared to be otherwise fine, so I left it running and started a scrub (see below). This morning, the system had crashed again and will not honor an IPMI software shutdown or the power button shutdown.

I looked up the transid error and found out a few things (mostly on this forum). This indicates that BTRFS was in the middle of a write when it failed. As the second number is 2 larger than the third number, it appears that it lost two writes (journal entries?). When this happens, it appears that some people are unable to mount their pool, and some people are. Those who can mount their pool may not find any obvious issues with their system. I’m in the second group, I could mount the pool and experience no obvious issues aside from the crash.

As I can mount the pool, I found two solutions.

First Solution: btrfs zero-log <device> should zero out the journal entries and I shouldn’t get that error any more. I suspect that whatever file was being written would likely be corrupt at that point. I used /dev/sda as <device> and was given an error indicating that zero-log was an invalid token, but in hindsight I think I should have used /mnt2/main_pool

Second Solution: Run a BTRFS scrub. I started one last night, and after about four hours it had completed 502GB of about 19000GB. Estimated time is going to be about 6 days, except that it crashed after four hours. I suspect I can resume it.

However, I suspect the parent transid errors are a side effect of the system crashing, not the cause. How can I determine what it causing the crash? I’m thinking that the first thing I should do is unplug all of my drives so that they aren’t mounted at startup. Alternatively, I can pull the USB drive and plug it into my laptop. Maybe the first step would be to copy logs and then start a USB scan?

Edit 1: I’ve pulled the USB flash drive from the system and copied all files on it to my laptop’s desktop. I’m also making a dd backup dd if=/dev/sdc of=/home/Noggin/Desktop/rockstor_flash.img of the flash drive. I then plan to run badblocks on the USB drive, and then restore the image.

Edit 2: Badblocks has completed all four passes successfully, suggesting that the USB device is OK. While badblocks was running, I dug through the logs, though I’m not certain what I should look for. sudo find . -type f -exec cat {} + | grep -i panic only showed that a panic handler was registered, not that any panics were recorded. That command also seems to be skipping some files as I know that messages has transid errors in it but that find/cat/grep transid doesn’t show those lines. I suspect the next best thing for me to do is to reinstall Rockstor. I may plug the USB back in (after restoring the image and pulling power from the hard drives) and see if I can get some screenshots of setups, device IDs, and some other information. Am I correct in assuming that CentOS is more stable than OpenSuse?

phillxnet · November 29, 2020, 10:42pm

@Noggin Sorry I’ve not got much time on this one but re:

Quite possibly. You still need to scrub the pool, as you tried but I suspect you have either bad PSU or bad ram. Investigate those before you try anything else at the OS level as with bad hardware all OS/fs stuff is essentially doomed.

Not necessarily. I would say if you don’t need AD then definitely go with the openSUSE as it’s btrfs core and kernel are years newer and given btrfs has seen some significant imporovements in this time I would definitely go for the openSUSE variant via the DIY installer build method:

But in your case I strongly suspect hardware first. And the pool damage is due to that. You may not be able to repair the data pool but you may very well be able to retrieve the data, even if you have to mount read only.

Let us know what you find. And I’m assuming you are aware of utilities such as memtest86+ or whatever it’s called these days.

Hope that helps.

Noggin · November 29, 2020, 11:41pm

You might be right about the hardware issues. I’m neck deep in OS reinstallation, waiting for yum update to complete. It’s been running for about 6 hours now and probably has a couple more to go. No issues or crashes yet though. When it is done, I’ll make an image of the flash drive in case I decide to return to CentOS.

I’m familiar with Memtest 86. I really don’t want to deal with bad hardware right new (but who ever does?). I can run memtest overnight, but really don’t have a good way to test any other hardware.

Regarding installation of openSUSE based Rockstor, a few things aren’t completely clear to me. It looks like it is suggested to build the installer using an openSUSE installation. My daily driver runs Mint, so what I read makes it sound like I shoudn’t use this system to build the installer. There’s a warning that suggests I use a “discrete OS” install, I’m not sure what that means. Does it mean “don’t do this in a VM” or does it mean “neither VM nor live CD”? Perhaps I could install openSUSE on my target hardware (assuming my hardware isn’t toast) and build the installer using it (sounds painful as it’ll be done on a USB flash drive).

Unless I hear otherwise, I’ll plan on using an openSUSE live image.

Noggin · November 30, 2020, 3:35am

I built the installer using an openSUSE live image, didn’t run into any trouble building it. Only issue is that it seems that Rockstor doesn’t want to accept my activation key. My appliance ID didn’t change though, so I’m not sure why it isn’t taking my key.

I’ll run Memtest86 overnight.

phillxnet · November 30, 2020, 9:49pm

@Noggin, Re:

PM me here on the forum with your Appliance ID and activation code and I’ll look into the back end of things. The openSUSE variant can subscribe to Stable with no known issues so this is likely an issue with your subscription at my end.

Well done on building the new installer and getting it installed OK. Incidentally the ‘place holder’ rpm with the Stable channel for the ‘Built on openSUSE’ variant is currently identical to the one in the testing channel for this variant anyway. Currently 4.0.4:

Hope that helps and keep up updated on how things go. Incidentally you can do the installer build in a VM, i.e KVM, just find. I’ll see about clarifying the text when I’m next in that repo.

Noggin · December 4, 2020, 12:57pm

For anyone that finds their way here because their appliance ID doesn’t work: @phillxnet edited my apppliance ID on the https://appman.rockstor.com/ site to be lowercase. This matches the Rockstor Appliance Info page and then I was able to register. Had I considered this to be an issue, I could have modified it myself.

Regarding the problem with the system becoming unresponsive…

It is running on openSUSE without the hard lockups. I don’t know if there was a problem with the old install or if openSUSE has better error handling with whatever problem I was having previously.
I still have the transid issue. I’ve done nothing yet that attempt resolving this so that’s expected.
I’m rsync’ing all of my data, at least what I care about, to local hard drives. My BTRFS keeps dropping down to a read-only state. I don’t care, because I’m only reading data from it.
I have to periodically reboot the system because after 24 to 48 hours, I start getting increasing numbers of I/O errors which causes rsync to fail. After a reboot, rsync succeeds on files it previously failed on.
I suspect that @phillxnet is correct in that this is a power supply issue. I’ll replace it when I get my new one in. I might as well replace my thermal grease too while I’m in there monkeying around.

Noggin · December 6, 2020, 1:51pm

Installed a new power supply, replaced thermal grease, and the drives were still dropping down to read-only. I imagine they’re dropping to read-only because file system damage was already done. I’ve unmounted all mounts and am running a scrub. I suspect the scrub is going to take quite a while. Says 0.00B (bytes, not TB or TiB?) after more than 30 minutes. I’ll assume that the system is just psyching itself up before getting started.

scrub status for 07f188d4-44d6-49fd-86ac-0380734fe1d2
    scrub started at Sun Dec  6 07:20:25 2020 and was aborted after 00:00:00
    total bytes scrubbed: 0.00B with 0 errors

I’ve copied everything off that I care about already. I’m on the verge of deleting and recreating the pool. I currently have the following drives in the server:

Qty 2 - 2 TB (9 years old)
Qty 2 - 6 TB
Qty 4 - 8 TB

I have a 5 TB (shingled, yuck) and two 10 TB USB drives with data on them. If I remake the pools, I’d prefer to put the two 6 TB’s on their own RAID1, remove the 2 TB, and put the 8’s and 10’s into a pool. Am I correct in thinking that there isn’t a command to absorb an NTFS drive into a BTRFS pool? I suspect I’d need to:

Make the 8 TB drives into a RAID6
Copy the data from a 10 TB drive to the server
Add the 10 TB to the pool and rebalance
Copy the other 10 TB to the pool and rebalance
Copy the 5 TB drive to the pool

Or maybe I’ll do the 8 TBs as a RAID5 so I can copy everything over, then convert and add the 10s and convert to RAID6

Edit: Duh, I’m blind. It says it was aborted immediately. Nuclear option it is.