InifiBand 40GB Network -Mellanox MLNX_OFED_LINUX -SOLUTION solve

alpha754293 · December 6, 2019, 1:30pm

re: your Sandisk PCIe SSD
Be careful when using ANY kind of consumer grade SSD when working with large files as you can and probably will burn through the write endurance of the drive well before the warranty is up.

I know because I’ve burned through the write endurance limit of four Intel 540s Series 1 TB SATA 6 Gbps SSDs AND also an Intel 750 Series 400 GB PCIe 3.0 x4 NVMe SSD as well.

If you’re going to be doing a lot of heavy video editing, I’d either recommend that you get a lot of mechanically rotating drives and put them into either a RAID0 (if you want absolute performance and zero redundancy) or RAID5 if you want to be at least one drive fault tolerant, or RAID6 if you want to be two drive fault tolerant.

That’ll depend on your budget.

Both of my current Qnap NAS servers are set up for RAID5, so if I migrate that back over into a single system, I would create two RAID5 arrays (which apparently is still experimental with RockStor OS due to the nature of btrfs).

Otherwise, you’re looking at possibly $810 per 1.92 TB Micron SATA 6 Gbps SSD or U.2 NVMe SSD drive which has a write endurance of 3 drive writes per day (DWPD). (There are other drives that goes as high as 11 DWPD.)

This is one of the reasons why I’ve gone back to using a lot of mechanically rotating disks because with conventional HDDs, I don’t have to worry about write endurance. And again, if you can’t really break through 10 Gbps, then as far as your PCIe SSD is concerned, it’s barely even lifting a finger to do the work that’s asked of it.

(Also, BTW, testing even NFSoRDMA transfer speeds is difficult because there’s a significant difference moving very large (1 TB+) continuous files vs. moving a million little, tiny files.)

But after I had burned through the write endurance on my Intel SSDs (which, they shouldn’t be any slouches), the last of which, I burned through in like 1.6 years of continued use – I now do NOT recommend people use consumer grade SSDs for any of the heavy duty lifting because the SSDs will remap the blocks silently in the background so you don’t even know that your media is wearing out until suddenly, one day, it finally can’t do it anymore, and now it’s only writing at 2 MB/s because the drive is functionally dead.

With HDDs, I don’t have to worry about that at all. I just need a fair number of them to catch up with the total interface/spindle speeds.

roberto0610 · December 6, 2019, 2:15pm

Sandisk PCIe SSD. It’s an enterprise grade SSD with 22 PBW. to my current video usage that almost 3 yeras of Video cache files. Way more than what I’m looking for. I do have a hold lot of the. I mean over 12x HGST 4TB 7.2k Enterprise from Sun Oracle Hard Drives on RAID0 at a server R510 RAID controller H700 16GB RAM Rockstor. and another server same R510 with 12x HGST Sun Orache 8TB 7.2k SAS Enterprise Overdrive on RAID10. This 2 server for testing performance. I Don’t think spinning drives will perform as any of the PCIe IODrive Accelerators. as the accelerators. uses full 8x PCIe lines. way more bandwidth than the RAID 4x controllers. Even tho with cache enabled.

This is been a pretty good journey. No I’m not good at all in old this linux or rockstor. But I’ll love to have the possibility to have a solution as QNap OS. at leas the iscsi. Will be pretty handy. and a reverse ssh. So I can send some of this server I have to my coworkers but still be able to go In to configure server o enable functionality without having to forward port from the routers. do you have Experience on any of this? If you do. Send me a private. I will also love to build a Render Farm but I have no idea on how to do it. and by the way all my gear, server, hds, ssd, nics, ib and all that stuff even licensed OS is coming used from eBay. Lots of things you can fine in there.

alpha754293 · December 6, 2019, 3:29pm

re: Sandisk (apparently, they bought Fusion-io, who in turn, got bought out by WD).
So…it depends on how they calculate drive writes per day.

Some people assume only 8 hours of operation per day, whereas, for me (and presumably for you, because the systems runs 24/7), I tend to count 24 hours of operation per day.

Given that, 22000 TB / 6.4 TB = 3437.5 drive writes total write endurance / (5 years * 365 days/year, e.g. warranty life) = 1.883562 DWPD.

It’s definitely better than like Intel or Samsung SSDs which range from 0.3 DWPD to 0.7 DWPD, but because I ran into this problem earlier this year, my eyes are on Micron drives. (There are some WD/HGST datacenter SSDs that goes up to 11 DWPD.)

I haven’t spent a great deal of time testing U.2 PCIe/NVMe SSDs in RAID0 because that’s currently out of my reach in terms of budget/capital expenditures.

So I live with what I am able to get.

re: PCIe SSDs
In THEORY they should be able to have all of the PCIe 2.0 x8 lanes available to them, but it really depends on how you have it set up.

The Sandisk PCIe SSD apparently uses a PCIe 2.0 x8 lane (40 Gbps).

Also, apparently, the Dell R510 and also the Dell PERC H700 are also running PCIe 2.0 x8 (as the max speed), which means that unless the drives are either SAS or SATA 3 Gbps, if you have 12 SAS/SATA 4 TB HGST drives that run at 6 Gbps each, the drives would demand a total theorectical bandwidth than the PCIe 2.0 x8 slot will be able to provide.

For the same reason, this is why my Broadcom/Avago/LSI MegaRAID 9341-8i 8-port SAS/SATA 12 Gbps actually outstrips the bandwidth available to the RAID card itself because a PCIe 3.0 x8 slot can only deliver 64 Gbps, but the card itself, if fully populated with SAS 12 Gbps drives, would demand a total of 96 Gbps. (But this isn’t a problem for me because I’m still only using SATA 6 Gbps drives.)

Ffour SATA 6 Gbps HDDs writing to four SATA 6 Gbps SSDs, I can max it out a 3 GB/s locally for a very short period of time before it quickly settles down to around 500 MB/s across the two RAID0 arrays. From my testing/benchmarking, the advertised speeds are often peak, buffered speeds. Unbuffered speeds can sometimes be even slower than the advertised random I/O operations per second speed which can be converted to MB/s as most typically will publish the block size that they ran that with.

In other words, it has been my experience that in actual, practical usage, the advertised numbers are almost meaningless. I’ve been moving around about 100 TB of data (as I mentioned, preparing the data to be written to tape), and with four nodes, and a headnode on 100 Gbps IB, and two NAS units (which maxes out at 10 Gbps SFP+ speeds), my headnode has been reading/writing data at around 200-ish MB/s when reading/writing lots of tiny files. If I am moving around very large files (1 TB+), then I can write at upto 1800 MB/s, but only very briefly/momentarily. It’s not stable. (I’m using iotop to measure this.)

The only way for me to test anything faster would be to use U.2 NVMe/PCIe 3.0 x4 drives, but there isn’t a hardware RAID adapter for that. The Broadcom that’s available is just a HBA, not a RAID HBA.

The other consideration is how often you think you’re going to replace the hardware. With mechanically rotating disks, if I have a lot of them like you do, I don’t ever have to replace them - at least not due to write endurance. With SSDs, you WILL have to replace them eventually.

My point is that like you, I’ve embarked on this journey as well, but I have to balance between assigning PCIe 3.0 lane bandwidth between the actual storage devices, the GPU (even if it is something very small/with very little in the way of functional features), and networking.

My Mellanox ConnectX-4 takes up a full PCIe 3.0 x16 slot. My current GTX Titan that’s in the headnode takes up another PCIe 3.0 x16 slot. And my MegaRAID 9341-8i takes up a PCIe 3.0 x8 slot. My Core i7-4930K can only supply 40 PCIe 3.0 lanes.

I’ve been looking to moving to either AMD Ryzen Threadripper 3000-series or AMD EPYC, but I’m waiting to see if AMD is going to release a 64-core 3rd gen Threadripper (and how many PCIe 4.0 lanes that’s going to come with) because now I am PCIe lane bound. And the bare system alone is going to be almost $5500 US, so…it can be quite expensive. (If I go with AMD EPYC, then that can jump to $13000 US).

re: iSCSI
Now that I have all of this IB stuff, what I would be looking at, if I were to deploy iSCSI would be ISER. You might have a use for that as well.

I haven’t personally set up iSCSI on my QNAP NAS, but they have tutorials on how to do it, and they make it pretty easy now with the pretty GUI. Yes, I’ve done some stuff with port forwarding.

re: eBay
eBay is awesome for this.

re: render farm
Depends on what kind of a render farm you’re looking to deploy. If you’re talking about Adobe Premiere, unfortunately, I haven’t tried that. In theory, it might be doable, and I think that Adobe has a network rendering engine so that you can add render nodes to it via the network, but I don’t have any experience configuring that.

My work is centered around mechanical engineering/high performance computing/computer aided engineering/finite element analysis/computational fluid dynamics.

Someone gave me some scripts and instructions on how to set up a blender network rendering stuff (for a Blender networked render farm), but I haven’t tried to deploy that yet as I am getting ready to switch to a different Linux distro (CAE Linux, built off of Xubuntu 16.04) for my mechanical engineering stuff. (I’m testing different distros to find the one that works with as many of my engineering applications as possible.)

But that would be on my to-do list once I can get the core engineering cluster back up and running with this other distro as it would be a nice “fringe” benefit to be able to have a Blender networked rendering farm available because I would have the rest of the hardware and infrastructure already there to support it.