Bonding vs Teaming: Any advice, benchmarks, gotchas, replication performance?

f2fbf60b · October 27, 2016, 2:14pm

TLDR
I have two HP microservers: one runs Rockstor under ESXi with the drives passed through as RAW devices (RDMs); the other runs Rockstor natively. Both have 3 LAN-facing network adapters (NICs). Does anyone have any experience of bonding or teaming NICs, and what options are best to get the most speed out of replication?

Details
Originally both appliances were running under ESXi with each using just one NIC. When replicating shares between the boxes I was getting about 35-40MB/s. I tried using NIC-teaming in ESXi and the replication rate dropped to 20MB/s - possibly because my switch doesn’t support 802.3ad link aggregation.

I have now rebuilt the second appliance to run natively, removed NIC-teaming from the ESXi Rockstor appliance and given both access to three LAN facing NICs.

I have about 3TB of data to replicate across my GB network and I’d like it to run as quickly as possible. Plus I’d like to get teaming / bonding to work because why not.

ESXi appliance has 4x3GB WD Red running in RAID10 plus an SSD for Rockstor (and the other ESXi VMs). The native appliance has 7 disks of various sizes totalling 3.5TB, setup as a “single” pool, and a USB stick for Rockstor. I would really like to max out the network for replication.

Does anyone have any advice as on:

Whether to choose bonding or teaming?
Which options to choose within the above - loadbalance, round robin etc?
Which would work best (if at all) for three NICs?

Note, my switch doe NOT support 802.3ad, but I would consider upgrading it if there was evidence to suggest it would give me a major speed bump under a particular setup. Both appliances are running 3.8-14.

Tomasz_Kusmierz · October 27, 2016, 5:27pm

Hi mate,

You have actually hit all the points that I was looking at while trying to set up server links with more punch.

teaming vs bonding
those are technically 2 different drivers doing the same thing just in different fashion (kernel space vs user space):

as you see there is not much difference between the two for YOUR use case (and mine for that matter) So I would go for bonding (older and more tested driver)

802.3ad
I would say this is the way to go ultimately … the weird configurations that you will see are really for cheapskates you technically don’t need a switch that supports that - just plug few cables between machines, set IP addresses manually (plus the bonding of designated ports) and you’re golden
Of course, if you really want to go down the switch route I would suggest going down the used enterprise switch … 50$ and you are sorted ! and you will be able to connect an od AC wifi router to in and bond few ports there to get more data on wifi ( if ever a 1Gb ethernet cable was a limit there … but hey if you like to tinker ).

good old ethernet
I’m raising this one up since there is a massive elephant in the room:
If you’ll add up a cost of
switch + few ethernet cards + cabling
it may actyally be more than
2 x “SFP+cards” + 1 x 10Gb link cable with SFP+ transceivers
With SPF+ you will get 10Gb link straight out of the box with no messing about … every switch that I’ve seen will giv you LAG (Link Agregation Group) of 8 ports MAX !!! that’s 8Gb maximum !!! (not counting overhead for LACP)

PCI-E caveat (or plain lack of bandwidth)
Have you ever checked the ethernet interface card connector to the mother board ? I’ve nearly bought a 4 port pci-e card which had a pci-e connector that could only do something in line of 2.6Gb … not really enough for 4Gb trafic, right ? Then I’ve started digging more, and on some motherboards links from pci-e cards are shared between port … EVIL !! then is you will go even deeper, on some servers you can have a raiser cards that split pci-e, that comes from a port that is multiplexed with other port … that is connected to cheap south bridge that has zero bandwidth available and it’s serving your HDD’s (where data needs to go from hdd -> sounth bridge -> north bridge -> ram -> northbridge -> sounthbridge -> pci-e -> etherne card.

Tomasz_Kusmierz · October 27, 2016, 5:58pm

For shiets and gigles I was trying to setup a link agregation on my spare server at work and gues what - when setting up it works OK, a reboot with only LAG link connected to switch and you can’t connect to it any more Yes you can ping it but no services are accessible

f2fbf60b · October 28, 2016, 12:33pm

@Tomasz_Kusmierz thanks for your response.

I set up both appliances with three NICs bonded using 802.3ad and tried a replication. Now, I had my second server upstairs in my lab [study] where all three NICs are connected to a switch which uses a single 1GB cat6 cable to go down to the server room [cupboard] where the primary appliance sits. In this configuration I got 20MB/s replication speed.

So I moved the second appliance down to the server room and plugged it into the same 1GB switch as the primary appliance. I tried again replicating another share and still got 20MB/s replication speed.

Do you think that a different config of the NICs might speed things up? Or is the bottleneck (as you discuss above) likely to be the NIC-PCIe interface?

I think both appliances have IO Crest Dual Port PCIe x1 network cards. So two of the three ports on each machine will share the bandwidth of the PCI bus (PCIe 2.0 x16 = 8Gb/s). This should be enough for two full speed 1Gb/s connections, although I need to check that they’re not plugged into the x1 slot as they will be sharing 500Mb/s between them.

On the second (native, non ESXi) appliance I also have a 4-port SATA card installed, so this will use PCIe bandwidth too. However, the second appliance’s disks are configured as “single”, and I think the disks get filled up one at a time so the disks connected to the SATA card haven’t been written to yet.

So far I have tested using 35GB an 75GB shares. I have another 80GB share and then three shares of 800-950GB. It would be nice if I could get a speed bump before I start moving large amounts of data around, otherwise each replication test will take 14+ hours to run .

Any thoughts?

Tomasz_Kusmierz · October 28, 2016, 1:24pm

Just a side note: your IO Crest card is in fact x1 pci-e lane already … not the best but not the worst for just 2 x 1Gb links.
Now, it does not really matter how many link you will plug to a switch if there is physically ONE cable going down stairs … THIS is you bottle neck (LAN wise)

Now, to the main problem:
How exactly you are replicating your shares ? because to me now it seems like you’re trying to push data through samba … and samba IS SLOW (not configured). Please describe your process because 20MB (bytes) gives a link bandwidth of 160Mb (bits) … so something is killing your performance and my guess is a massive transmit overhead + a heavy IO SEEK readout and write + delay on the link for confirmations + (if you’re pushing it through ssh than your performance is toasted due to extremely secure link encryption).

f2fbf60b · October 28, 2016, 1:53pm

@Tomasz_Kusmierz yes I knew the one cable would ultimately affect things, but I just wanted to make sure that replication worked with three bonded NICs sending to three bonded NICs - before I unplugged everything, hauled it all downstairs and replugged…

Re how the replication works / is configured I’m not sure. I’m just using the Send / Receive function build into Rockstor, so my performance should be comparable to any vanilla option.

My primary (replication Send from) box is a HP Microserver running ESXi 6.0. The drives are connected as RAW mapped disks (i.e. not virtualised) in the hope that a) not virtualising the data would make them safer (see ZFS) and b) it would run faster. ESXi will add some overhead, but its enterprise level virtualisation, so I can’t see how it would slow things down this much.

Is there any way to see the timings of the data as they are replicated, so I can break out the info on transmit overhead; IO seek; confirmation delay? I’m not a networks guys, so I’ve no idea how to debug these kind of issues.

f2fbf60b · October 28, 2016, 2:12pm

This chap suggests I need a managed switch to allow me to create VLANs for each port for bonding (round robin) to work. I guess (and it is only a guess) that 802.3ad does all the hard work for you.

Has anyone ever plugged two boxes together directly, without going to the switch. With three NICs available I could have one going to the LAN and two plugged directly into the other appliance.

Tomasz_Kusmierz · October 28, 2016, 2:19pm

So you physically use “btrfs send / receive” …

where do I begin …

Let me start with: “It’s not your fault”

btrfs send / receive are meant for back up … to create a “perfect” copy of a FS to somewhere else. It’s meant for incremental back ups etc. Send / receive was given to people to they an start working with btrfs in more production sytems … do some back ups … etc … it works great (when it works) but it’s not yet very optimised. Also btrfs send / receive is pushing all the data as more of less text through a SSH channel - this results very substantial overheads of:

btrfs walking it’s tree a lot - a major hit (on any FS for that matter).
send converts raw data to more pleasant to ssh tunel format and applies some sort of flow controll to it, other wise if you would terminate the link you would end up with FS on the other side that is DEAD !!!
SSH takes all that crap and encrypts it, puts it into tcp and sends on it’s marry way.

btrfs send receive is for time being a slow beast ( some other FS are dramatically slower in that field, but lest keep flame wars to minimum ) but there is a one thing that makes it stand out of other sync methods - what ever you send it will be 105% same with what the receive side put into storage …

For example you can set up “Syncthing” and let it sort it our in the background … at least I do between different servers … but if you want to do a carbon copy backup - send / receive is your way forward.

I know that it did not push you into a better solution, but life is tough sometimes …

[edit]
FYI sync thing is not a speed demon as well … they have a messy IP stack that sometimes makes stuff grind to 1Mb speed … but since it’s in background “I don’t care”

f2fbf60b · October 28, 2016, 2:50pm

Ahah, well that makes me feel a bit better. I may have a play with the settings anyway, to see if they make any difference one way or the other.

But I suppose once the data are replicated across the diffs will be relatively minor going forwards. Still would have been nice to find the silver bullet.

Thanks.

Tomasz_Kusmierz · October 28, 2016, 2:55pm

In that manner once you will do an initial clone with send receive … everything else if just incremental and VERY VERY EFFICIENT !

Any other bit of software will have to traverse whole FS, do checksum, check whenever those match and then send files that are different. BTRFS already knows which sectors are different due to nature of COW … and sends those straight away !!!

grizzly · October 31, 2016, 9:04am

In my experience btrfs send\receive is significantly faster than Rockstor replication, although I don’t have specific throughput figures.