[HOW-TO] Rockstor (or any modern linux) inside of PROXMOX (or any QEMU + KVM) on any modern (less than 8 years old) machine with good performance

Tomasz_Kusmierz · December 10, 2016, 10:05pm

IMPORTANT !!!
Until further notice, there seems to be a problem (investigation in process) with proxmox 4.4 where virtio_scsi (_pci / _singe) will cause serious disk corruptions !!! Use Proxmox 4.3 and any never version test with intended configuration on test system for at least a week before pushing to production system !!!

Hi,

This howto is predominantly aimed at rockstor on proxmox, but as stated in titele it will work for any modern Linux environment (3.12 and above) on qemu + kvm (for which proxmox is just a very smart GUI management system).

CPU

This should come as no surprise that CPU is very important for any software but there seems to be some misconceptions based on old forum threads on proxmox / qemu. Namely proxmox insists on new installations of VM to use “kvm64” as your processor. This cpu type (it’s just a type reported to operating system) is only exposing a functionality of a Pentium4 cpu with 64 bit capability (or x686 architecture as you would wish) - reasoning here is that cpu out there (99.9999%) will have exactly the same features in terms of SSE etc. There is also an old thread that discourages use of “host” in favour of kvm64 because kvm64 allows of running code directly untranslated on host machine.

Unfortunately performance wise, that old post is just load of rubbish and does not keep with reality of CPU progress. Old Pentium4 architecture is binary machine code compliant with any modern CPU, unfortunately it lacks almost any enchantment that modern CPU has.
If we firstly skip the obvious offenders like in CPU hardware acceleration for encryption, checksum computing, very advanced matrix floating point operations, which do not require any code emulation from VM environment, when CPU type is set Pentium 4 are simply ignored by guest machine and performed manually (software emulation within actual GUEST). That is VERY BAD but we’re not yet.
Secondly Pentium4 was CPU that did not live in era of multi-core CPU, hyper-threading and multi-everything … so

if you have a intel cpu with hyper threading where you where you can run 2 threads per physical core - with kvm64 you can’t
if you have multi-core CPU that allows to overclock it self to run fever cores but at 60% faster - this will not work because CPU requires code to understand fast core code switching (guest runs native but it thinks that this feature does not exist because it’s pentium4)
if you have a multi CPU system ( or some of AMD weird modern solutions that require NUMA) NUMA will not work. NUMA lets you run process on CPU that is physically connected to memory containing this process (cpu pinning) without is you process could run on different CPU and require to continuously have CPU talking to each other to exchange memory of that process (processes do not migrate by them self from one memory bank to another, and even it operating system could perform this migration without NUMA operating system does not know which CPU is connected to which memory bank)
list just goes on …

Thirdly it seems to me that nowadays even of you have a “host” selected as your CPU, code translation is not performed because we have features in linux kernel that allows "jailing of code"to confined address space making it impossible to exit to HOST machine.

Not bashing proxmox here, because they were VERY clear - of you want to migrate systems LIVE between nodes of your cluster use kvm64, otherwise use host. Shame that it’s buried deep within documentation that is hard to reach / read and all VM’s are created with kvm64 by default.

Memory

I can not express how important it is on modern machine to enable NUMA. Even thou you don’t necessary have to have a multi cpu setup to see (or justify) the benefits in first place. NUMA is so much more than just cpu pinning. Today you can have core only connected controllers and have cache coherency mechanism exchange data between cores to emulate unified memory model … with NUMA this problem goes away, you operating system can be aware of what is connected where an chose best way all the time. If you have NUMA - USE IT !!
Note this:
"… Intel announced NUMA compatibility for its x86 and Itanium servers in late 2007 with its Nehalem and Tukwila CPUs …"
(https://en.wikipedia.org/wiki/Non-uniform_memory_access)
So anything past 2007 has it.

Again not bashing proxmox but they are to vague in my mind about importance of it.

Storage

This part is what originally made me go for reading session because my performance was SO BAD !.
Some history:
First there was a VM … so virtual it emulated everything … it even emulated CPU to the point it was pretending to be a x386 from 1980’s to be most compatible and had a very badly written ATA controller emulator. This controller was so bad that only allowed a guest a single operation and guest had to wait for each command to be committed to disk before issuing another. If you ever seen Si-Fi movie from 80’s that had computer slowly printing text to the screen - it was running on that VM ATA controller.
Situation improve over time, emulation of controller improved and more controllers were emulated giving slightly better performance. ( 5% )

Man kind decided to move on and even HDD controllers started to cross pollinate it’s features … (ATAPI comes to mind) … so there was some pressure and SCSi emulation appeared on VM world … it did provide some improvements but not much speed gain (it was worth the shot since it was solving a lot of nasty that came with emulating retarded ATA controllers)

Man kind moved even more … speeds on controllers increased … lack of any buffering on VM was a pity … GOD came down and created libVIRT. At this point everybody came up with conclusion "rather than faking stuff to guest, emulating features in software then translating it to HOST operating system then letting HOST system perform the task - let’s simplify this !!!"
Result was driver that was not hiding to GUEST system that it’s being virtualized and for the first time there was a driver for operating system to talk to purely virtual device. It exposed a very simple interface with basic form of buffering that will work minimal over head from VM in terms of translation. Two main drivers were created: VIRT network interface and VIRT block device.
VIRT network interface performed flawlessly, because in networking world, if you want to send a packet in network interface you want to send it now and not play with it - simplicity paid of !
virtIO_blk - did improve situation because finally operating system did not had to jump through hoop to talk to broken controller … that had all brokenness emulated by VM in software - now the driver was written to match how operating systems predominantly operate and it did provide some buffering. Unfortunatelly it had some pitfalls (ring buffer with size of tens of messages vs hdd buffers with megabytes of buffers, not talking advanced protocols etc etc etc)

Man kind was had mixed opinions about virtIO … so GOD in a revenge send two software hackers with phony accents to make mankind suffer … but they decided to fix the problem in gods name and created VIRTIO_SCSI. Now people were happy that their guest could talks scsi and stuff became very fast.

So we will skip the first two as those are ones to avoid.

virtio_blk
Don’t get me wrong, virtio did solve a lot of problems because we gave up on faking stuff … but it worked on ethernet interfaces just perfect, not so much on storage.
virtIO_blk essentially presents anything to you (storage) as massive continuous memory region and gives you 15 command ring buffer for writing data to this region and reading from it. This work absolutely perfect if your virtio_blk is emulating a disk drive to GUEST that is on host as a file inside of filesystem (the host fs and block device driver are sorting out all barrier writing, scheduling, queueing etc etc) … if virtio_blk is using a full hard drive on HOST system things are not so good.

We have all this complex protocols to talk to storage because it’s not as simple as talking to RAM like devices (for which virtio_BLK is perfect), because this driver hides whole drive from guest, guest is not aware of any buffer on hard drive and guest will wait for a lot of changes to hit the disk to eliminate chances of corruption in case of power down. On the other hand if virtIO is using file on host, most of changes are performed in ram (HOST filesystem will buffer changes in RAM buffer for a bit and immediately issue “write OK” back to guest ) so this stuff will actually work very well there. To continue problem with talking directly to the disk - since host operating system is not sure what guest wanted to do it will use crazy safe practices to make sure that data comes in desired sequence (no ncq of tcq), since guest is not aware of host disk geometry it can no optimise write pattern (can hamper your writes and read by 3x !!!). Since guest does not “talk any protocol” there is a need for a separate thread to pick data from this ring buffer -> figure out how to put it on storage -> translate to host storage command -> issue command to host … a bit much. Guest OS technically does not even know a true sector size so it will chop all writes to default 512byte (old sector size) … nowadays 4k is norm with crazy disks having megabytes per so called “field” (late 2016).

virtio_scsi
On the other hand virtio_scsi lets guest operating system “talk” SCSI Directly to host hardware. Fun part is that on linux it does not mean that you need SCSI or SAS (Serially Attached SCSI) disk / controller. Linux will emulate SCSI protocol for you (libATA) in very efficient manner. Alto if you have a SATA disk there will be overhead for changing SCSI commands to SATA commands but host linux will tell you EXACTLY which commands are available based on command set of SATA disk / controller - because of that there are no extremely complex operation like trying to change one sector size to another, recompute CRC, sync command etc … header and footer are replaced with minimum cost and stuff just flows to hardware controller ! There is a myriad of other features that are nearly drop in replacement like

NCQ, alto NCQ allows only 32 messages to be rehashed by drive, TCQ (scsi standard) allows up to 128, BUT drive can report how many it is capable of - so libATA will just let you believe it’s SCSI TCQ with depth of 32 messages.
drive buffer where you can just dump your data through UDMA transfer request and not being bothered by it
drive geometry accessible by guest so it can allocate data better.
S.M.A.R.T … which as a true irony is a ATA standard that was cloned by SCSi and it byte by byte the same so no translation there
power management
barrier writes
error management
discard / trim (very important if you have a SSD !!! )
Where virtio_scsi will fall short is if you are using it with file as a backing store … it will work but it’s not as good as virtio_blk ( alto for consistency I would allow qemu use default scsi controll - at least you will get some pretence of buffer on guest )

Other settings:

aio=[threads / native] (default is threads)
default is threads and it works a charm for virtio_blk and file in file system - you NEED a thread to work independently to talk to FS … native option will cause corruption !, if you use virtio_scsi you WANT your guest to talk directly to controller -> use aio=native (or semi directly through libATA - which is VERY EFFICIENT !)

scsihw: [virtio_scsi_pci / virtio_scsi_single]
you need to specifically state that you want virtio to work as scsi controller otherwise qemu will fall back to crummy old scsi emulation. Documentation of those two is VAGUE so I’ll just use my interpretation mixed with my testing experience. Just worth mentioning that every guest OS in qemu is presented with PCI bus that has 16 destination, your ethernet occupies one (1) so do hard drive controller - so you have to think how you’re going to set it up not to run out !

virtio_scsi_pci - allows 256 hard drives per controller
virtio_scsi_single - allows ONE hard drive for controller.
I’m personally working with rule of thumb that less devices connected to controller the better performance wise, but if you have an large array of disks this may become academic for you. Also I’ve noticed that while on SAS controller this makes no difference on pure SATA controller _single seems to work with least glitches.

aiothread=[0 / 1]
this is a simple ON / OFF switch that if enabled (1) it forces behaviour that each scsi / blk target will receive an EXLUSIVE thread sorting out it’s dirty work. If you’ve even read half of this write up you will notice that with virtio_scsi with aio=native this setting makes no sense since with those setting it will not use any thread at all … but this setting has a dangerous fallout - it forces controller to being able to have only ONE target device per controller, so your virtio_scsi_pci will not be able to host 256 devices - it wil be reduced to 1 device per controller (remember that 16 PCI devices limit per guest OS ? this is where it hits you on large SAS arrays where your machine keeps crashing for no apparent reason even thou you used virtio_scsi_pci)
.
.
.

So if you were not bothered to read explanation:

if you DON’T CARE about live guest migration from machine to machine and CARE about performance use "host"rather than "kvm64"as a cpu flag
for network driver ALWAYS use virtIO
if you use file as a hard drive for you guest OS use virtio_blk
if you use physical hard drive for a hard drive for your guest OS, use virtio_scsi
if you use virtio_scsi for direct disk access, use aio=native
if you use virtio_scsi you need to state that your scsihw is of virtio_scsi type (read below)

Configuration caveat:
It’s counter intuitive on what to use inside of your configuration for QEMU.

to use drives as virtio_blk
virtioX: /dev/disk/by-id/ata-…
where X is number from 0 - 15

to use drives as virtio_scsi
scsihw: virtio_scsi_pci (or virtio_scsi_single)
scsiX: /dev/disk/by-id/ata-…
where X is number from 0 - 15 (technically it’s higher but documentation here is non existent so do your own research & testing, I’m not willing to throw you under the bus)

Conclusion:
I’ve migrated my work’s CCTV server that does continuous motion detection on all cameras. Results are that

with “dd” I can punch data down to storage with nominal speed of all disks in array (actually nearing the SAS controller limit) - while performing this and reaching 500MB/s I was getting ~4% io delay in proxmox graph, and no real CPU usage.
CPU utilisation is exactly the same as it was, considering that this is computationally heavy setup: pick video from camera -> decode h264 -> create frames -> compare frame data for motion detection -> compress frames with jpeg -> store frames to storage.
all disks in array act now like pass through, alto name and serial number are faked by qemu, model is the same, ALL s.m.a.r.t. data is available, all models names, serial numbers, protocols supported are available and 100% acurate.

(FYI, I’m very well prepared to be pointed holes, errors, typos and out right lies here … we’re big boys now and don’t cry … that often)

Flyer · December 11, 2016, 1:17pm

Thanks @Tomasz_Kusmierz, being a Proxmox guy really appreciated it
M.

Tomasz_Kusmierz · December 21, 2016, 10:47am

A tinny update:
on rockstor 4.4 “virtio_scsi_pci” will produce a lot of errors with btrfs and continously cause it to fall back into read only (on sas controller, not sure about other ones)

Tomasz_Kusmierz · December 27, 2016, 10:39pm

IMPORTANT !!!
Until further notice, there seems to be a problem (investigation in process) with proxmox 4.4 where virtio_scsi (_pci / _singe) will cause serious disk corruptions !!! Use Proxmox 4.3 and any never version test with intended configuration on test system for at least a week before pushing to production system !!!