Error mounting BTRFS on reinstall after crash

Went trough the steps described here: https://en.opensuse.org/SDB:BTRFS below the warning. Came as far as zero-log. (Output below).
It evidently found some errors. Upon trying to mount the pool after doing this step, the system froze with a ton of text on the monitor of the system. Upon restarting, the web interface of rockstor wont come up. (see text below)

[root@rockstorhome1 ~]# btrfs rescue zero-log /dev/sda
Clearing log on /dev/sda, previous log_root 0, level 0
parent transid verify failed on 143327232 wanted 259603 found 247297
parent transid verify failed on 143327232 wanted 259603 found 247297
parent transid verify failed on 143327232 wanted 259603 found 247297
Ignoring transid failure
[root@rockstorhome1 ~]# mount /dev/disk/by-label/BigHomeDisk /mnt2/BigHomeDisk
packet_write_wait: Connection to 10.0.0.8 port 22: Broken pipe
[root@rockstorhome1 ~]# systemctl status -l rockstor
● rockstor.service - RockStor startup script
   Loaded: loaded (/etc/systemd/system/rockstor.service; enabled; vendor preset: enabled)
   Active: active (running) since ma. 2019-07-15 21:45:40 CEST; 14s ago
 Main PID: 3387 (supervisord)
   CGroup: /system.slice/rockstor.service
           ├─3387 /usr/bin/python /opt/rockstor/bin/supervisord -c /opt/rockstor/etc/supervisord.conf
           ├─3391 /usr/bin/python /opt/rockstor/bin/gunicorn --bind=127.0.0.1:8000 --pid=/run/gunicorn.pid --workers=2 --log-file=/opt/rockstor/var/log/gunicorn.log --pythonpath=/opt/rockstor/src/rockstor --timeout=120 --graceful-timeout=120 wsgi:application
           ├─3392 /usr/bin/python /opt/rockstor/bin/data-collector
           ├─3393 /usr/bin/python2.7 /opt/rockstor/bin/django ztaskd --noreload --replayfailed -f /opt/rockstor/var/log/ztask.log
           ├─3422 /usr/bin/python /opt/rockstor/bin/gunicorn --bind=127.0.0.1:8000 --pid=/run/gunicorn.pid --workers=2 --log-file=/opt/rockstor/var/log/gunicorn.log --pythonpath=/opt/rockstor/src/rockstor --timeout=120 --graceful-timeout=120 wsgi:application
           └─3423 /usr/bin/python /opt/rockstor/bin/gunicorn --bind=127.0.0.1:8000 --pid=/run/gunicorn.pid --workers=2 --log-file=/opt/rockstor/var/log/gunicorn.log --pythonpath=/opt/rockstor/src/rockstor --timeout=120 --graceful-timeout=120 wsgi:application

juli 15 21:45:43 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:43,636 INFO spawned: 'nginx' with pid 3455
juli 15 21:45:43 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:43,649 INFO exited: nginx (exit status 1; not expected)
juli 15 21:45:44 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:44,651 INFO success: data-collector entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
juli 15 21:45:44 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:44,651 INFO success: ztask-daemon entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
juli 15 21:45:45 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:45,655 INFO spawned: 'nginx' with pid 3456
juli 15 21:45:45 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:45,667 INFO exited: nginx (exit status 1; not expected)
juli 15 21:45:47 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:47,670 INFO success: gunicorn entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
juli 15 21:45:48 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:48,674 INFO spawned: 'nginx' with pid 3458
juli 15 21:45:48 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:48,686 INFO exited: nginx (exit status 1; not expected)
juli 15 21:45:49 rockstorhome1 supervisord[3387]: 2019-07-15 21:45:49,688 INFO gave up: nginx entered FATAL state, too many start retries too quickly

So, giving up for tonight, try again tomorrow morning.
Will possibly try tumbleweed tomorrow, but if i cannot get it up and running with that, i am afraid i am out of time, and will reinstall and wipe, as i need to move on. around 7 Terra less data… :zipper_mouth_face:

@jopaulsen Re:

This could very well be the following issue:

which is easy to ‘sort’ via 3 commands I have copied into that issue.
Worth a try as you could do with a break on this one and those command shouldn’t break anything if it isn’t that issue.

Hope that helps.

Yupp that was it, thanx. I actually believe i found it trought searching the net at the same time as you replied here :smile: Ok, have a good night for now. Will report back tomorrow what i find (no matter how it goes).

I followed several suggestions, and ended up destroying the pool due to the tests being run on it (specifically btrfs rescue chunk-recover /dev/sdg).
However, i also got some log entries and results/error messages from this and other commands that to me suggested possible faulty ram or other components. The server would hang on commands that seemingly had no reasons to make it hang.
So i ended up moving the array to different hardware altogether. It is now up and running on that hardware, and it seems to be happy. I even have started loading data on it to test it.
Also talked with some neighbors, and it looks like there was some power issues the same night that my server went down. The rig sits on an UPS, but just maybe there is some correlation there.
Will be keeping a close eye on the logs for the disks, thou, and run some SMART tests.

Oof, that’s unfortunate. At least you managed to find a likely culprit, it’d been even more annoying if you had rebuilt the array on the same hardware, only to get subtle weirdnesses later on, I suppose.

@jopaulsen Thanks for the update. Shame you had to destroy the pool in the end. But there does seem to have been some progress re:

With regard to this you could, out of curiosity, try testing the old machines ram via the instructions in our Pre-Install Best Practice (PBP) doc howto, ie specifically it’s Memory Test (memtest86+) subsection. Can take quite some time and folks often leave this running for 24 hours or so to give it a fair chance to uncover issues. Obviously decent cooling is going to be preferred with such tests. Might help to settle you mind re the cause if it comes up fault.

With regard to UPS’s protection capabilities re spikes etc this can depend on the nature of the specific UPS, ie Online (Double Conversion) / Line Interactive / Offline all give varying degrees of protection. The following is a simple guide to this:

https://www.riello-ups.co.uk/questions/2-what-are-the-differences-between-online-and-offline-ups

Hope that helps.

2 Likes