I have a SuperMicro based system with the OS installed on a USB flash drive.
Rockstor appears to be crashing, but I don’t know how to verify that. After the assumed crash, I can’t ping the server. So far, I’ve suffered what I believe are three crashes, all yesterday.
- At around 5 PM, the system was running and responding. At 7 PM, my wife wanted to watch Plex, but her phone could not connect to it. Sometime between 5 and 7, the system appears to have crashed. No response to pings, it appears to have been dead to the outside world. I pressed the power button and waited half an hour for the server to shut down, but it never did, so I forced a power cycle.
- At 7:45ish, the server was running again. By 8:00, it had crashed. I tried to start an IPMI session, but no luck. The embedded IPMI page worked, but the IPMI java application said there was no session to connect to. I force power cycled again.
- Around 8:15 PM the server was running again. I opened an IPMI session and started getting,
BTRFS error (device sda): parent transid verify failed on 69142933258240 wanted 8120483 found 8120481
. It appeared to be otherwise fine, so I left it running and started a scrub (see below). This morning, the system had crashed again and will not honor an IPMI software shutdown or the power button shutdown.
I looked up the transid error and found out a few things (mostly on this forum). This indicates that BTRFS was in the middle of a write when it failed. As the second number is 2 larger than the third number, it appears that it lost two writes (journal entries?). When this happens, it appears that some people are unable to mount their pool, and some people are. Those who can mount their pool may not find any obvious issues with their system. I’m in the second group, I could mount the pool and experience no obvious issues aside from the crash.
As I can mount the pool, I found two solutions.
First Solution: btrfs zero-log <device>
should zero out the journal entries and I shouldn’t get that error any more. I suspect that whatever file was being written would likely be corrupt at that point. I used /dev/sda
as <device>
and was given an error indicating that zero-log was an invalid token, but in hindsight I think I should have used /mnt2/main_pool
Second Solution: Run a BTRFS scrub. I started one last night, and after about four hours it had completed 502GB of about 19000GB. Estimated time is going to be about 6 days, except that it crashed after four hours. I suspect I can resume it.
However, I suspect the parent transid errors are a side effect of the system crashing, not the cause. How can I determine what it causing the crash? I’m thinking that the first thing I should do is unplug all of my drives so that they aren’t mounted at startup. Alternatively, I can pull the USB drive and plug it into my laptop. Maybe the first step would be to copy logs and then start a USB scan?
Edit 1: I’ve pulled the USB flash drive from the system and copied all files on it to my laptop’s desktop. I’m also making a dd backup dd if=/dev/sdc of=/home/Noggin/Desktop/rockstor_flash.img
of the flash drive. I then plan to run badblocks on the USB drive, and then restore the image.
Edit 2: Badblocks has completed all four passes successfully, suggesting that the USB device is OK. While badblocks was running, I dug through the logs, though I’m not certain what I should look for. sudo find . -type f -exec cat {} + | grep -i panic
only showed that a panic handler was registered, not that any panics were recorded. That command also seems to be skipping some files as I know that messages
has transid
errors in it but that find/cat/grep transid doesn’t show those lines. I suspect the next best thing for me to do is to reinstall Rockstor. I may plug the USB back in (after restoring the image and pulling power from the hard drives) and see if I can get some screenshots of setups, device IDs, and some other information. Am I correct in assuming that CentOS is more stable than OpenSuse?