Rockstor in VM on Hardware Raid

b_b · August 31, 2023, 4:24pm

Hi all,

this is my first post, but I have been reading for a while, so big thanks for your great work an help.

Short:
Does it make sense (or in other words: is there anything that speaks against) to give Rockstor a virtual drive, where the host handles RAID and Rockstor basically only handles shares and access control?

Long:
I am planning on setting up a Rockstor instance on an VMWare or Proxmox server that is a VM host as primary purpose. But because I do not need advanced NAS functionality, I would like to retire the old dedicated NAS box and set a VM up for that. That way I clean up the rack, networking and reduce power consumption.
Occasionally I have to move the VMs (the NAS VM too) between different Hosts, so I think I have to avoid anything that is related to forwarding/passing through Hardware (HDDs, HBAs, …).
As far as I understand

Rockstor should be fine when the virtual disk is moved around on the host storage as long as the identifier for that disk stay the same
the checksums and the error correction of BTRFS should work even if Rockstor is unaware of SMART and physical disk details (that seems not to be the case for XFS for example)
that way I can grow the disks for Rockstor if required
performance would probably be better with direct disk/controller access (or on bare metal), but performance for file sharing is not that critical

I already spent a lot of time in this (and other) websites and am not sure if I missed something obvious, because this case is never really discussed…

Thanks a lot and happy to elaborate,
Ben

Hooverdan · August 31, 2023, 7:21pm

@b_b welcome to the Rockstor community.

couple of questions for my own clarity:

the virtual drive (possible comprising multiple hard disks) and any respective RAID set-up would be managed by the host and not Rockstor, correct?
You would allow Rockstor to format the Virtual Drive and then create shares, samba exports, etc.?
Rockstor would hence also manage access control (accounts, AD, whatever you’re planning to implement there)?

I think, and @phillxnet has posted this a couple of times in other contexts that one of the reasons to choose btrfs for Rockstor was to have it managing the devices, RAID configurations, etc. as that’s the only way to take advantage of its benefits. A virtual drive that already has its own RAID setup, that is unknown to btrfs and Rockstor would make any failures hard to recover from.
Independent of what you hear on this forum you probably have to do a bit of trial-and-error to validate whether your assumptions will work for this configuration.

Fundamentally, I think it’s possible what you’re trying to do with the details you pointed out. But I am not sure that you could manage that more easily without relying on Rockstor but instead using an OpenSUSE basic VM (with a UI like Gnome or KDE) and run it that way.

But there are more experienced people on this forum with probably widely diverging opinions on this. From the concept of Rockstor being an appliance, rather than an application, this type of setup is certainly not the “target” architecture.

b_b · September 1, 2023, 7:26pm

Hi @Hooverdan,
thanks for your reply!

“Yes” to your three questions.

I think, and @phillxnet has posted this a couple of times in other contexts that one of the reasons to choose btrfs for Rockstor was to have it managing the devices, RAID configurations, etc. as that’s the only way to take advantage of its benefits.

Then I did not understand BTRFS correctly (and I am by far no file system expert), because I thought the key concept is the copy-on-write with the b-tree and checksums (for metadata AND data?). And that should still be possible/available when Rockstor does not have infos about the physical disks, or am I wrong here?
Sure: when it has direct access to the disks, it can “sense” hardware failures and rely on write confirmations, but on the other hand the RAID controller is battery backed and monitors the disk health itself.

My thoughts that led to considering Rockstor (instead of a plain Linux VM) were, that it has an integrated front end (other user than me will create and manage shares) and all required features (checksum-filesytem, snapshots, mail notifications, graphical user interface, easy sharing and active directory support), are available and run out of the box**.
But to be honest, I haven’t used a “plain vanilla” for file sharing for a while. As far as I remember, this always has to do with manually editing smb.conf and I am not aware of a good graphical frontend (editing the conf is fine for me, but difficult when someone else also has to edit shares). Do you know of a convenient way (maybe a tool with gui?) to set up samba shares in for example openSUSE?

Thanks again for your input!

Ben

** I am still struggeling with some shares and ad groups as, but I open a seperate topic if required.

phillxnet · September 2, 2023, 10:52am

@b_b Hello there. Just wanted to chip-in a little here. First - what @Hooverdan said. Plus:
Re:

not necessarily.

That is correct, so you are not wrong (entirely).

However:

Yes, there are then less layers between btrfs (kernel) and the hardware: direct disk via kernel.
But:

First part (battery backed - as is the entire system (preferred) when on UPS - essential for serious storage.
Second part (monitors the disk health) it’s not so much the disk health, more that hardware raid does not know the integrity of the data (except in extremely high end hardware raid). As such it does not know which copy of a data chunk is good: but btrfs does. Keeping in mind that all-out drive failure is far-far less common than bad sectors or corrupt or failed returns.

So in short: putting something less intelligent underneath btrfs lowers data integrity. Hardware raid can end up switching out a drive that has partial failure and leaving the higher level btrfs with only a corrupt version of data left. Hardware raid works at the drive level: btrfs-raid works at the chunk level. A drive fails to return a good btrfs data/metadata chunk (surface/heaad/allignment/cell damage or what-ever) btrfs know’s it’s bad due to your highlighted check-sum of all things and copy-on-write. Assuming:

a multi-device pool &
a redundant btrfs-raid profile (i.e.raid1 or raid1c3) it can then check and verify the ‘same’ chunk from another disk - if good it can return it to the calling application and re-assert it on the previously failed drive (if that drive can take the new write). Said failed drive, if off reputable design has now marked it’s failed sector as bad (and if we are also lucky) but btrfs has ensured that it’s ‘part’ of the pool is now good again - along with shiny new checksum for the very next time that bit of data is accessed from that drive. It often does round robin on requests so this could have been an intermittent read error, but if bad enough the filesystem can self-heal to an extent as it has multiple sources of truth: and redundancy (assuming 2.).

If one puts hardware raid under btrfs you compromise it’s ability to maintain integrity - even if all you want to know is what chunk - (file set) fails a check-sum. I.e. you data integrity. Granted you are in a better position than without the COW / checksum fs on the raid. But you are in a worse position than allowing only the kernel to deal directly with individual drives that have their remit (their own data) and btrfs multiplying this up to a redundant pool where it can know of partial in-flight issues reading from one disk: far more common than out-right disk failure. And can then maintain fs/data consistency via the methods you mention and the redundancy afforded by it having only drives under it: not some other layer second guessing what is the ‘correct’ piece of data. Or throwing out an entire drive (with only a partial failure in say one place) when it may hold the key to a other partial failures on other drives. Again partial failure is way more common than entire drop-out. And I mention the raid1c3 (raid1c4) as btrfs-raid’s equivalent of high-availability options.

So the key point as I see it is, why use a non fs aware raid when the filesystem of our choice has one build in: especially given the above where hardware raid weakens data: it is more appropriate for availability than integrity as it know not of that form a filesystem point-of-view. An unintelligent HW raid ‘repair’ coudl end up trashing the fs on-top. Also keep in mind the chunk based raid not drive based raid of hardware raid. Btrfs-raid, when given the chance, is also way more flexible (varying drive sizes etc).

So ideally one passes ‘real’ drives through to the btrfs-raid setup: as we recommend with the no-raid under us approach. But as you indicate, and as pragmatism sometimes dictates, we may not always want/care/or afford to have idealised systems. So another compromise is map multiple virtual drives into the Rocksor setup - but they must have the required stable serials - assuming the VM can do this virtio can. One still has the single point of failure of the raid card - which as stated is more creative than we often want it to be. But we can at least hope it will not make the same mistake on all virtual drives involved in the Pool. This, after all, is akin to the single point of failure of multiple drives hanging off a single multi-port drive controller - but they are far less ‘creative’ than raid re data-switching - and the drives remain accessible ‘individually’.

Likely a long winded explanation but more concisely btrfs has filesystem aware raid - no raid controller does. Why reduce data integrity by adding more layers (especially when those layers are all-together other-wise concerned. Drive based rather than data/fs/pool integrity based.

Hope that helps, and keep in mind that we only support a sub-set of btrfs’s capabilities but we continue to improve our options as we go along, i.e. the recent addition of zstd compression by forum member @StephenBrown2 for example, whole pool or share (btrfs subvol) selectable.

b_b · September 4, 2023, 3:43pm

@phillxnet first of all many thanks for your explanations this helped a lot already!

I am not sure if I got everything and every detail, but I try to wrap up in my words:

BTRFS (and thus Rockstor) works without direct disk access (for example on virtual volumes or [hw] raid controller) and can utilize for example the built-in copy-on-write and checksums.
On physical disks Rockstor can (at least on redundant disks) not only detect bitflips/block errors (at this point no matter the reason of the error), but correct those errors using the redundant data from the other disks (if they match their checksum).
On a physical RAID controller (or any intransparent disk management) BTRFS can only detect, that the data/block it recieved from the underlying layer (RAID/virtual Volume/Disk) is broken, but it cannot fix it, because it has no controll/access to the underlying disks.
That should bring the safety/integrity of the data for BTRFS-on-HW-RAID to the same level that BTRFS has on a single disk.
Wouldn’t I gain the possibility to (only) correct errors when I forward two virtual disks to Rockstor and let Rockstor create a BTRFS-RAID1 inside its VM?
If number 5. is true, Rockstor still is missing the ability to “give feedback” to the drive if it detects any kind of errors (bad blocks/sectors) that it disables affected blocks/sectors.
Best way to do BTRFS/Rockbox: bare metal without HW-RAID
Second best: PCIE-passthrough of HBA to VM
Third best: direct passthrough of whole disk to VM
Forth best: VM with multiple virtual disks to be managed by BTRFS inside the VM
Fifth best: VM with only one virtual disk
Sixth best: VM with non-BTRFS filesystem
…
For case 1-3 the data integrity should be as good as it can get (as long as disk passthrough works as it should), but unfortunately all three cases prohibit a migration of the system to a different host/machine with probably different hardware.
Case 5 “only” gains copy-on-write and error detection (without the chance to correct errors) in comparison to case 6.

For more context:
We did use (and still use in most cases) QNAP dedicated NAS boxes (enterprise grade rack mount versions), but are not that happy with the experience: we had lots of problems with bitrot/broken files, the system updates were a gamble (for our boxes we had about 1 out of five updates leave us in an infunctional state and required extensive fixes in the terminal) and besides that they are loud and cost-benefit-ratio is quite bad (imho). So I want to retire them and because there are only a couple users the performance of the basic file sharing is not that crucial.
We already have stronger machines for VM hosting and there are plenty of resources for file sharing available, but those hosts are equipped with HW RAIDs and they will not get replaced in the near future.

On my personal NAS (Synology) I also have BTRFS as file system. I think Synology uses BTRFS on top of mdadm - isn’t this similar to using BTRFS on a hardware RAID?

Thanks and regards,

Ben

phillxnet · September 5, 2023, 12:39pm

@b_b Hello again.
Re:
1, 2, 3 - all clear I think. Assuming the minimum system requirements are met - predominantly access to a static reliable serial for said devices.

That’s a tricky one - as from the btrfs point of view it kind of does and I think that was the meaning here. Hardware raid adds disk fail redundancy (underneath - if in a redundant profile of course) but it does it blind to the consequences. There is also the possibility of passing two raid backed virtual drives through:

but you cover that in point 5. So agreed on that point. But we have the single point of failure that it will likely be the same raid controller - but multi-port HBA has this same single point of failure: however the HBA doesn’t invisibly mix the drives!! So I would say passing two virtual drives is better but still suffers from having something unaware of the data integrity (hw or md/dm raid) as a whole undermining something that is (checksum cow fs).

We agree I think on 5 as true - there-abouts. But the key element here (re 6.) is that drives are directly managed by the kernel - i.e. same ‘house/ball-park’ as btrfs itself. The particular subsystem responsible for the type of block device has the call here. But it will be far more ‘native’ than what is essentially a dump hw raid. HW raid just wants to present the OS with presumed in this case redundant of sorts block device. But it will drop an entire drive to do this. And potentially good copies of other data in the ‘deal’. That was all we had at one point: we now have software raid that can be more clever at this. Especially when that raid system is intergrated into the fs.

We also have to consider flexibility into the mix here. We have thus far only considered redundancy: a system that continues to serve is of more use than one that must be replaced, or taken off line to rebuild a newer bigger pool. Btrfs can do this online - while serving. That is another element of data integrity - a systems ability to adapt to changing use. See online growing/shrinking ReRaid capabilities that are also a win (for some).

I quite like this break down - after some discussion we should consider patching that in somewhere in the docs as this type of discussion comes up from time to time and I am know expert but it’s pretty much as I see things - given our Minimum System requirements (specifically the ‘drive’/device serial thing)

We don’t actually cater to this. We are all-in on btrfs as our entire design is build around it’s unique capabilities.

Why? - Without hw raid and their often proprietary (or firmware linked) dive format - one can move a set of drives from one machine to another. So I do not agree here. If we are talking of the summary presented in 7. Btrfs is an in-kernel filesystem: given a equal or later kernel these exact same dives can be attached to equivalent ports on an other machine - in any arrangement. It is very much not like hw raid in that respect.

Again: only sort of. If both virtual drives are from the same raid controller: they are not independent: a key design element that btrfs assumes with its chunk based raid. Each chunk in say raid1 (2 chunks per data/metadata) is placed on what is assumed to be independant devices. But if they are from the same raid controller they are not - they just look that way.

That is what makes this discussion interesting to a wider audience I think. We have to be pragmatic with the resource we have. But I’m hoping this thread helps to clarify the associated risks; and to inform folks that Rockstor, btrfs really, can only do what it’s foundations allow. And can be completely hobbled if not instantiated correctly. Hence our advice as it stands. But that is all very well and ideal: but much can be mitigated with other systems of say regular back-up. And with btrfs comes potentially more back-up options and snapshot capability etc. So definitely we have to be pragmatic in real-world situations. But also realistic. Say one has a VM instance with a single dev raid backed pool that senses catestrophic corruption because ‘what-ever’. You loose all data access as the Pool as a whole may well be lost, with no repair scenario/options. But you were not returned corrupt data - potentially for years!! You know know this data store is defunct - that is the first step of data integrity - availability should not trum validity. I.e. ‘Oh good I got access to my data - it’s all nonsense - but still’ So concerns of uptime etc are also a thing with storage designs. Restores can take ages and cost continuity of access. But checksumming and cow can really help with ensuring data remains data - not nonsense. Availability is another thing: I think best rolled in with data integrity - hw raid can’t do that in the same way - and likewise independant sw raid similarly. It has it’s concerns: not the pools concern. Btrfs when resourced for it’s volume management brings flexibility and continuity. All is managed under the same roof as it were. But yes there is still the block layer of the kernel etc (but that can undermined by hw raid is my thinking here).

Yes, a number of the major NAS providers have now moved over to btrfs. But I don’t know of any that have also dropped the legacy mdraid (dedicated software raid in-kerne). They all have a very large investment in that - and it’s good. More mature than the drive management in btrfs - but not integrated and not nearly as flexible. They have basically substituted their prior fs (ext2/3/4 I think it was) for btrfs to get snapshots send/receive cow etc.

Kind of yes: but I would say that mdraid is far better than almost any hardware raid. Even given it was carried predominantly by one person for quite some time. Plus you have that portability thing where any mdraid drive can be taken to any equal or newer linux kernel if the hardware has the same port - or can carry an adaptor. That is definitely not the case with hw raid. Plus all it’s management (outside the drive of course) is in-kernel. A key in-house advantage. The kernel is then the on-stop shop for the entire storage subsystem - and all in software down as far as is reasonable. So fixable as we go - not left with bugs that never get fixed as the seller is more interested in the next hw raid model: and its maintenance - if there is any.

In time I think we will see the like of synology move from mdraid and onto the more flexible forward-looking device management within btrfs itself. But they are a large ship and it takes time to turn such things.

You may also be interested in time in our next ‘layer’ once we get our dependencies all up-to-scratch (current testing channel) I at least hope to begin laying the ground work for GlusterFS. Another whole layer of redundancy/availability. But stead now - lots to do as we stand.

Hope that helps. I am no expert but this is how I see things and I like to think Rockstor is fleshing out in its capabilities as we go and I am entirely happy with our choice of technologies: we just need to get them all up-to-date and shiny before we branch out feature wise.

b_b · September 5, 2023, 2:27pm

I just wanted to explain, why I rated case a-c as impossible or at least difficult to migrate:
I should have added that this comes from my idea of a virtualised NAS system were I want to move the whole system to different computing/storage resources.
Using direct hardware passthrough (no matter if disk or controller) add dependencies to that exact hardware: I cannot move the VM to a a different host that may have totally different hardware. When I want to move the VM with hardware passthrough, I have to physically move disks or controller.
If I want to upgrade the disk for example, I have to attach the new disk to the VM and move the data from inside the VM to the new disks, but that does not work (without relocating the physical disks) when the host is about to change.
With virtual disks I can use the VM management to copy/move them to a different location and run the VM there. All this is possible without the system inside the VM being aware of it.

But without the premise of a virtualised system you are right - it is probably easier to move/recover direct BTRFS-managed disks to/on an other system than it is when they were attached to a RAID controller.

Ah - I am glad that I posted at least something useful

I know, but I am searching for a solution for my problem and not for a optimal BTRFS system. Don’t get me wrong - Rockstor checks most boxes (that is why I am here), but in general it may be a better solution for some to use something else if they cannot meet the minimum requirements to have at least some benefit from using Rockstor/BTRFS.

For sure - big thanks for the time you took, to create such extensive answers!

When I can sort out the active directory issue (from the other thread), I give it a try on a development machine

Best,
Ben