Rockstor crashes when writing to samba share

geert · July 18, 2021, 10:02pm

I am using Rockstor 3.9.2-57 on a virtual machine. I was able to copy a big amount of data without any problems when I first defined my samba shares. Since then (a couple of months) I only had read access to the files on the samba shares, without any problems.
Now several months later, I want to copy some new folders to my samba share and I consistently get the same result:

Copying starts, a couple of files are copied without any problem and then suddenly copying stops.
At that moment, when checking the virtual machine manager, it says the Rockstor VM is “Paused”, it stays like that indefinitely and the only way to get it up again is shutting down and rebooting the Rockstor VM.
After restarting and checking the samba share, some files are copied correctly and some others have the file names, but with a “0” file size.

This happens consistently, no matter if I copy from my Linux Desktop (which is on another physical machine than the Rockstor VM) or from my Windows10 laptop.

The Rockstor VM is running a linux kernel version 4.12.4-1. There is an older kernel version on the VM, I tried rebooting with the older kernel version but get the same result.

I checked log.smbd and log.nmbd log files, the only error I can see in log.smbd is “could not find child xxx”

I am not 100% sure, but I think that the first bulk copying of files to the samba share in the beginning was done with a previous version of Rockstor. After that, I switched on automatic update, it updated to 3.9.2-57. Since then I did not copy any files to the samba share until now, which is when this problem occurred.

Any suggestions are welcome…

phillxnet · July 19, 2021, 9:56am

@geert Welcome to the Rockstor community.

This sounds like the associated pool is going into read only, or slowing up so severely that it looks like a crash. For more info on the state of your system, and to explore the space usage, the output of the following commands may help:

btrfs fi show

and to see the usage of the relevant data pool:

btrfs fi usage /mnt2/<pool-name-here>

That may help with folks chipping in to see what’s going on.

Also if you could past log entries from journalctl at the time the failure happens that should also provide more info.

Tricky I know but we need to establish if this is a time out where you host is just slowing down far too much and so the kernel is potentially set in a block device reset loop of sorts, or if the pool is poorly (they go read only to protect what remains). Often with poorly pools they can initially work OK until called to do stuff and then they find disparities and jump to read-only.

Another factor here is that you say the Rockstor instance is virtualisation. You don’t mention if the drive/drives backing the pool are passed through or hosted on say qcow2 or the like. This can severely impact performance; to the extend that a machine can become unworkable. Btrfs is best used directly with hardware, or at least without too much ‘fancy’ between it and the raw device. VM’s can introduce their own layers to provide look-a-like devices that can in turn cause severe performance penalties.

On this performance side, you could try reducing the cpu/read-write penalties within the Rockstor instance by disable quotas via the Rockstor Web-UI, that way at least you aleviate some of what has to be tracked and potentially free the system up to be accessible to see what’s actually happening. Quotas are pool wide and can be disabled via the pools overview or details page and the effect in turning off is imidiate. But turning quotas on again can take quite a few minutes to settle in. So in this case you need to turn this off to help reduce the cpu and read/write load on the pool for now to see if this is just a time out where btrfs is getting stuck waiting for something that is taking an unexpectedly long time to resolve, such as quota update for example.

In all cases you are likely to be better off with our Rockstor 4 offering but lets first get more info on whats going on here. But if it comes to making any changes you are better off with the far newer kernel/btrfs in our Rockstor 4 variant. Also the raid level is important. The btrfs parity raid levels of 5/6 are not yet ready for production - via popular opinion; and also have weaknesses in their repair capability so best stick to 1/10 for now at least.

Hope that helps and for the time being try disabling quotas directly after a reboot, to re-enable the read/write (read working) state of the system and hopefully that will free stuff up to allow you to paste the output of the above commands and to investigate further.

Also to double check your rockstor version you could paste the output of the following:

yum info rockstor

Hope that helps, at least to nudge this along some.

geert · July 19, 2021, 1:37pm

Hi Philip,
Thanks for your elaborate reply.
In the meantime, I have been digging a bit deeper into this problem of mine, and realised the following:

I have 3 samba shares, one on a pool that is directly on the same disk as the OS, one that is on another disk that is ext4 formatted and one on a disk that is btrfs formatted. The one that gives me problems is the one on the btrfs formatted disk.
Answering your question on my VM: Yes, both pools (on the ext4 and on the btrfs disk) are hosted on qcow2. If that impacts heavily on performance, I may have to check on how to configure them to be passed through (I don’t have a lot of VM experience).
As for the raid configuration: the Rockstor VM is installed on a machine that has hardware raid (Raid1), so no Raid is configured in Rockstor.
Turning off quotas did not help.
Checking the version with yum confirms that 3.9.2-57 rockstor stable is installed.
Now to the output of the 2 commands you suggested (the problematic share is the one on /dev/sdb):

[root@ProservRockstor]# btrfs fi show
Label: ‘rockstor_rockstor’ uuid: e8d36071-8ccd-467e-baad-002886fb9971
Total devices 1 FS bytes used 5.55GiB
devid 1 size 17.51GiB used 8.33GiB path /dev/vda3

Label: ‘RockStorPoolDisk’ uuid: 63ac0dd7-cf73-4288-80c1-6456242479f8
Total devices 1 FS bytes used 67.31GiB
devid 1 size 1.82TiB used 70.02GiB path /dev/sdb

Label: ‘IHTSPool’ uuid: e8d15870-fffe-4acf-a3ff-4e51890f0a4b
Total devices 1 FS bytes used 10.44GiB
devid 1 size 1000.00GiB used 14.07GiB path /dev/sda

[root@ProservRockstor]# btrfs fi usage /mnt2/RockStorPoolDisk
Overall:
Device size: 1.82TiB
Device allocated: 70.02GiB
Device unallocated: 1.75TiB
Device missing: 0.00B
Used: 67.41GiB
Free (estimated): 1.75TiB (min: 895.78GiB)
Data ratio: 1.00
Metadata ratio: 2.00
Global reserve: 75.47MiB (used: 0.00B)

Data,single: Size:68.01GiB, Used:67.21GiB
/dev/sdb 68.01GiB

Metadata,DUP: Size:1.00GiB, Used:102.84MiB
/dev/sdb 2.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB
/dev/sdb 16.00MiB

Unallocated:
/dev/sdb 1.75TiB