[SOLVED] Unknown internal error doing a DELETE to /api/shares/5/snapshots/cloud-hourly__202001050100

legion · January 7, 2020, 2:43am

[Please complete the below template with details of the problem reported on your Web-UI. Be as detailed as possible. Community members, including developers, shall try and help. Thanks for your time in reporting this issue! We recommend purchasing commercial support for expedited support directly from the developers.]

Brief description of the problem

Attempting to remove snapshots fails.

Detailed step by step instructions to reproduce the problem

Selecting snapshots in list, and selecting the Delete button.

Web-UI screenshot

Error Traceback provided on the Web-UI


N/A. There is just an empty field as seen in the screenshot.

Thanks in advance for any assistance.

Flox · January 7, 2020, 4:57am

Hi @legion,

I personally am not sure for the moment, but I wonder if the logs would give more information. If not done already, would it be possible to have a look at rockstor.log right after you try deleting these snapshots? It is located at /opt/rockstor/var/log/rockstor.log, or you can see its content in the System > logs manager menu.

This might bring more information and help somebody with a better idea of the cause of the problem.

On a side note, do you observe this error as well for any snapshot or one on a different pool / share?

Sorry for not being more helpful here, but hopefully it’ll give somebody else more information.

Hope this helps,

legion · January 7, 2020, 5:08pm

Hey Flox,

I appreciate the response, and apologize for the delay in replying… but I wanted to make sure I wasn’t just being impatient, but the NAS is now freezing shortly after booting, which gives me around a minute or 2 in SSH between it hard locks and I end up having to cold boot the box (obviously CIFS access, etc. are moot and only expedite the ceasing). Racing for the log file via SSH. I’d imagine this love is familiar with anyone reading this.

Mentally impaired n00b say: I think at this point, that I have totally ID10Ted my BtrFS and I’m trying to script out how to terminate any processes to try to be able to safely retrieve data off of the volume. I would rather learn if/what I did wrong and correct the error, as I’m relatively sure it is a simple fix. They all are though, once you have fixed it.

Again amigo, thank you for the prompt response and I apologize for not getting back to you sooner (as I was rudely being preoccupied imprinting my forehead with my keyboard’s keys). At this point if I can burn the hay stack down and pick up the needle… I’ll just toss down fresh hay, instead of trying to master magnetism.

My window of attempting to exercise patience as dictated by port 22, is growing increasingly shorter. Sorry to meet the community with what is more than likely a day 1 BtrFS issue if you weren’t cognitively impaired.

Flox · January 7, 2020, 5:36pm

There is no need for any apology… I personally always prefer a delayed but useful response over a rushed and potentially inaccurate one . Plus, you were still relatively quick in answering!

Boot issues aren’t my forte, but I hope we can make some progress into figuring that one out. As you say, it’s always better to learn what the problem is/was. To do so, my first thought is to learn more about your setup and configuration:

Is it a new install or an already existing install that has been working for a while without issue(s)?
If the latter, has there been any substantial change to the setup either in term of hardware or software?
What is your install media? a simple HDD, SSD, USB? How about your pool(s) configuration (if it’s a previously-existing install)?

You mentioned you were troubleshooting through SSH, so it seems the OS is booting at least far enough to have that ON. It may be informative to temporarily hook up a monitor though, and see the following:
4. When does the freeze occur? Is Rockstor boot process completing (do you see the login prompt?) or does it freeze while loading?
5. Do you see any error or information on the screen while booting?
6. I know it’s unlikely given the rapid nature of this freeze, but were you able to get some output from the logs, or even dmesg?

If you are experiencing a complete freeze (command prompt unresponsive), I would tend to think of a hardware issue, which could incidentally explain why you had an error in the web-UI when trying to delete a snapshot. As a result, you might be able to try (if not done so already) disconnecting everything but the system drive and see if it boots. If it doesn’t, then your data pool(s) is/are potentially fine. If it does, then it’s probably something related to pools themselves, indeed.

Let’s see if your system disk boots up normally and doesn’t freeze after a few minutes first so that we can rule that out (or not).

Hope this helps,

legion · January 7, 2020, 7:10pm

Thanks Flox,

In response to your questions:

Is it a new install or an already existing install that has been working for a while without issue(s)?

This is greenfield, though I had played with numerous NAS distros, and have encountered no hardware errors/issues. The NAS is comprised of an i3 proc, 8GB RAM, 1 1TB NVME for /, and 6 1TB SSDs in a hardware RAID5 configuration.

If the latter, has there been any substantial change to the setup either in term of hardware or software?

None that I’m aware of. I’m of the impression that the issue is the Russian Doll of snapshots that BtrFS is having to attempt to traverse for any form of I/O causing a queue stacking scenario. Just a hunch based off of the snapshots being configured as if it was an archive regiment. I think the snapshot conf is doing what I told it, and this is the end result.

What is your install media? a simple HDD, SSD, USB? How about your pool(s) configuration (if it’s a previously-existing install)?

Installation media was USB flash drives from ISOs using Etcher, Rufus, etc. I’m unsure as to the pool configuration. This is my first rodeo with BtrFS, though I’m assuming it is my bad.

When does the freeze occur? Is Rockstor boot process completing (do you see the login prompt?) or does it freeze while loading?

It began with the NAS boots, and then initially slowly would lock CIFS shares, and I would reboot from web GUI. It has quickly evolved into an increasingly shorter window per boot prior to locking not only CIFS, but now web access, SSH, etc. It gives the feel/impression of a branching roll through transactional snap iterations to perform anything which is causing the hang.

Do you see any error or information on the screen while booting?

Nyet. You had it nailed on the SSH remoting. I’m assuming if I’m able to hook up an actual KVM (no pun intended) to her, she would boot to a prompt as per usual, as I have access to terminal sessions. I have not hooked them up though to visually verify.

I know it’s unlikely given the rapid nature of this freeze, but were you able to get some output from the logs, or even dmesg ?

I was able to SCP the log file successfully. If interested in reviewing it, the log can be found here:

Error%20QR

Thank you again. I sincerely appreciate the patience and assistance.

Regards,

Legion

Flox · January 7, 2020, 8:08pm

Thanks a lot for all the information…

It seems I was on the wrong path with a hardware issue and now I tend to favor an issue with snapshots, indeed, and thus agree with your impression.

We have seen some systems showing slow downs when using a high number of snapshots, especially when combined with quotas being enabled. As the latter is the default Btrfs setting, unless you manually disabled it (which is possible using the webUI), you likely have them enabled.

Have a look at the post below:

In particular, I would try to run the command below to see how many snapshots you have:

btrfs subvolume list /mnt2/<you-pool-name-here>/ | wc -l

The Btrfs usage report is also always a good thing to check, even if it’s unlikely to reveal something wrong in your case:

btrfs fi usage /mnt2/<your-pool-name-here>

@phillxnet explained very well the situation in which you may find yourself (too many snapshots) in the post I linked below, so I recommend trying his recommendations therein. In particular, he explains how deleting a snapshot is a very “intensive” task (relatively), which would particularly explain why you saw the first webUI error (in your first message) related to deletion of snapshots (the system was probably taking a little while trying to deal with the task and the API request timed out, resulting in the error you received).

This is in @phillxnet’s area of expertise so he can probably correct me if I’m wrong there.

Hope this helps,

PS: nice system you have there, by the way!

legion · January 7, 2020, 10:10pm

Gracias Flox. Quotas were enabled on the boot pool (rockstor_rockstor), which I disabled. The main file share pool (Bubbles-Cloud) already had quotas disabled in the anticipation that (in my [lack of] understanding of BtrFS snapshots) I would be having numerous instances of snapshots taking place on the shared pool in use.

I forgot to mention that typically when I’m making use of the shell or web GUI I notice as the slow down/lock up occurs, the HD LED activity increases until solid for a few moments, then goes dark as a further data point. If I reboot the NAS without establishing a connection initially, I’m simply unable to connect though intermittent/sporadic HD LED activity still occurs (which is the state the NAS is in currently). I’m reluctant to reboot to reconnect further until either directed to, or my Pez dispenser full of xanax runs out… whichever takes place first (and my monthly Pill Pack just arrived).

The statements I was able to execute:

btrfs subvolume list /mnt2/Bubbles-Cloud/ | wc -l
btrfs fi usage /mnt2/Bubbles-Cloud/

btrfs subvolume list /mnt2/rockstor_rockstor/ | wc -l
btrfs fi usage /mnt2/rockstor_rockstor/

The statements I was unable to execute:

btrfs subvolume list /mnt2/home/ | wc -l
btrfs fi usage /mnt2/home/

btrfs subvolume list /mnt2/Cloudy-Bubbles/ | wc -l
btrfs fi usage /mnt2/Cloudy-Bubbles/

The results of the cli commands are:

Boot Device:

root@c1 ~]# btrfs subvolume list /mnt2/rockstor_rockstor/ | wc -l
3

root@c1 ~]# btrfs fi usage /mnt2/rockstor_rockstor/
Overall:
Device size: 110.94GiB
Device allocated: 8.06GiB
Device unallocated: 102.88GiB
Device missing: 0.00B
Used: 2.50GiB
Free (estimated): 106.49GiB (min: 55.06GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 16.00MiB (used: 0.00B)

Data,single: Size:6.00GiB, Used:2.38GiB
/dev/nvme0n1p3 6.00GiB

Metadata,DUP: Size:1.00GiB, Used:57.83MiB
/dev/nvme0n1p3 2.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB
/dev/nvme0n1p3 64.00MiB

Unallocated:
/dev/nvme0n1p3 102.88GiB

RAID5:

root@c1 ~]# btrfs subvolume list /mnt2/Bubbles-Cloud/ | wc -l
67

root@c1 /]# btrfs fi usage /mnt2/Bubbles-Cloud/
Overall:
Device size: 4.66TiB
Device allocated: 1.38TiB
Device unallocated: 3.27TiB
Device missing: 0.00B
Used: 1.29TiB
Free (estimated): 3.37TiB (min: 1.73TiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 1.39MiB)

Data,single: Size:1.38TiB, Used:1.28TiB
/dev/md127 1.38TiB

Metadata,DUP: Size:4.00GiB, Used:2.55GiB
/dev/md127 8.00GiB

System,DUP: Size:40.00MiB, Used:176.00KiB
/dev/md127 80.00MiB

Unallocated:
/dev/md127 3.27TiB

Flox · January 8, 2020, 2:01pm

Thanks a lot for taking the time to run these commands…

That does not look excessive to me so I don’t think your issues result from a high number of snapshots.

On the other hand:

legion:

root@c1 /]# btrfs fi usage /mnt2/Bubbles-Cloud/
Overall:
Device size: 4.66TiB
Device allocated: 1.38TiB
Device unallocated: 3.27TiB
Device missing: 0.00B
Used: 1.29TiB
Free (estimated): 3.37TiB (min: 1.73TiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 1.39MiB)

Data,single: Size:1.38TiB, Used:1.28TiB
/dev/md127 1.38TiB

Metadata,DUP: Size:4.00GiB, Used:2.55GiB
/dev/md127 8.00GiB

System,DUP: Size:40.00MiB, Used:176.00KiB
/dev/md127 80.00MiB

Unallocated:
/dev/md127 3.27TiB

I’m curious as to why your disks show as mdraid devices as this is not my area of expertise… @phillxnet, would you have an idea as to why we see this?

legion · January 8, 2020, 2:05pm

Good morning/afternoon/evening Flox.

The ceased blinking of the HD LED this morning motivated me to see what we couldn’t do to further making it happen, captain.

After connecting a physical KVM, I was able to log into the terminal shell and execute the “btrfs” command at the prompt, and then let it sit. In approximately a minute, the following (of what I think is) confirmation of the initial suspicion as to the underlying issue popped onto to screen.

Should I attempt to boot off of a live distro and see if I’m able to import the BtrFS volumes, and migrate the data to another NAS (or external drive) temporarily? I would prefer to tell the kernel to calm it’s roll and let’s see if we can’t work out this snapshot thing with some python scripting skillz and love. I’m not the cat I used to be cognitively, so…

Would anyone have any thoughts on how I should proceed from here? If I didn’t have a few hundred GB of data placed on the NAS, that is not archived elsewhere and is of import, it would be moot.

Update: I know the obvious answer is… “I don’t know dude, how about typing in ‘echo 0 > /proc/sys/kernel/hung_task_timeout_secs’ into the console, reboot, and see what happens?” I like the way your thinking, and I have. I got to the web GUI which is progress and am attempting to transfer my data off just for CYA.

The initial issue with snapshot removal still persists, but the light at the end of the tunnel may not be a train. Thanks again for all of the assistance so far.

legion · January 8, 2020, 6:01pm

Hi Flox,

Sorry for the lag in response. The md designation was assigned by the kernel for the hardware RAID 5 of 6 1TB SSDs. I was able to gain access to the web GUI briefly after issuing the “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” statement observed in my earlier update. The situation remains roughly the same though with the NAS experiencing a kernel panic shortly after boot as it exhausts RAM.

Last error message noted in the web GUI was:

Unknown internal error doing a DELETE to /api/shares/5/snapshots?id=410,408,407,405,403,402,401,400,399,398,397,395,394,393,392,391,390,389,388,387,386,384,383,382,381,379,378,377,376,372,359,348,344

I found an interesting article discussing the "task btrfs-transacti:3912 blocked for more than 120 seconds"error presented on the terminal screenshot at:

Again, all of the assistance is appreciated.

doenietzomoeilijk · January 8, 2020, 8:17pm

Huh, I’ve seen those messages before… I haven’t seem them in a while though, and I’m fairly confident the root cause in my case is different. I wish I could offer you something more useful…

legion · January 8, 2020, 8:54pm

I appreciate the look and link. It did provide some help in providing some further details that Phil asked we supply when we encountering an issue with BtrFS by issuing:

btrfs fi show

Label: ‘rockstor_rockstor’ uuid: 8e993499-bf63-4ea3-9bca-5ed111de947a
Total devices 1 FS bytes used 2.44GiB
devid 1 size 110.94GiB used 8.06GiB path /dev/nvme0n1p3

Label: ‘Cloudy-Bubble’ uuid: bbe227c4-150f-4ffb-8df3-3ba59a8e972a
Total devices 1 FS bytes used 1.28TiB
devid 1 size 4.66TiB used 1.38TiB path /dev/md127

Thank you again for the other set of eyes and neurons.

I’m hoping that perhaps there may be a gem in need of being polished to be found here.

legion · January 12, 2020, 10:56pm

Please mark this incident as closed. Thanks to all for the help along the way.

Edit: I apologize I neglected to mention the steps I took to resolve the issue.

Booted into the terminal of a livecd.
I executed the following commands:

cd /
umount /dev/md127 /mnt2/Cloudy-Bubbles/
umount /dev/md127 /mnt2/Bubbles-Cloud/
fuser -i /mnt2/Cloudy-Bubble/
fuser -i /mnt2/Bubbles-Cloud/
fuser -i /dev/md127
btrfs scrub start /mnt2
btrfs scrub status /mnt2
btrfs restore /dev/md127 /mnt2/

In an generic equivalent syntax modifications would be:

cd /
umount /dev/your-dev /mnt2/your-devs-pool/
umount /dev/your-dev /mnt2/your-other-devs-pool/
fuser -i /mnt2/your-devs-pool
fuser -i /mnt2/your-other-devs-pool/
fuser -i /dev/your-dev
btrfs scrub start /mnt2
btrfs scrub status /mnt2
btrfs restore /dev/your-dev /mnt2/

Obviously, these steps may take quite amount of time depending upon your configuration and the amount of data in rest on your NAS. I hope this of help to anyone else in a similar situation, and thank you again to all for the help along the way.

Flox · January 13, 2020, 5:40pm

I’m really glad you could get yourself sorted out, and thanks a lot for taking the time to share your resolution steps with the community, that is greatly appreciated!

I’m sorry again I wasn’t able to help you further in your issue.

Cheers,