nanog mailing list archives

Re: Broadcom vs Mellanox based platforms


From: ff () ozog com
Date: Mon, 04 Jun 2018 10:33:32 +0200

Hi Kim,

I'll share key learnings about since I started to work on high speed software networking in 2006, when everyone was laughing at me becaused I claimed to achieve 10Gbps networking with a CPU.


CPU is less important than memory/QPI
On x86 memory subsytem include things like Cache Boxes, Home Agent, DRAM controllers... Home Agent is reponsible to know on which CPU node is a cacheline. So it can become a centralized bottleneck.... DRAM controllers have a queue of pending DRAM requests (instruction pipeline, data prefetch, data...). QPI routing may also severely impact performance. I remember using a 4 socket system that was half the performance of a 2 socket system because of either bad QPI routing programing by the BIOS or a hardware issue. An order of magnitude to keep in mind is that at 100Gbps, each 64-byte packet and each associated 64-byte used metadata cacheline is consuming roughly a full DRAM channel. As an example and not counting application data to be leveraged (FIB, DNS database...) a 100Gbps DPDK bridging application requires 3 memory channels per port (to reach line rate if the IO allows it)... There is a lot more to say but I let you do your own research ;-) BTW, why would you want to do 100GBps line rate (or very close to it)? To ensure that each node has the capacity to resist a DDoS attack powered by DPDK/ODP/native "applications".

PCI is your ennemy (or not that a good friend)
PCI chipset behavior is complex. The typical payload on x86 is 256bytes. So I assumed that using a 1KB max payload to support the average 670 byte internet packet size would give better results... But no, early DMA transaction acknowledgement is disabled if payload not 256 so it dropped performance significantly. You may have an embedded switch on the NIC. So you think that offloading will give you a benefit. Yes at low speed but you can't build a 50Gbps service chain because most of the NIC are on PCI x8 Gen3 slots which is limited to 50Gbps BW. So the conclusion is: don't try to understand those limits, create a testbed that really mimics the target "size" and topology of your use case and measure.

Don't do tests at 10Gbps if your target is 100Gbps.
Starting at 50Gbps you will be bumping on PCI DMA transaction rate barrier. Unless you have a smart IO model (multiple packets per DMA transaction - see Netcope for instance) supported in zero-copy by the SDK architecture you won't reach line rate or be able to have an application (zero-copy of data or metadata reduction can save a DRAM channel for application at this "speed"). I think (but not sure) you can squeeze two packets in a buffer with Mellanox cards: that can be instrumental in reaching 50Gbps line rate but I don't know if DPDK supports this feature.

Don't do pps at the switch level if your target is fast VM application behavior. Measuring that a software switch can do 10Gbps line rate with 64 byte packets does not help at all to predict TCP application performance in a VM. Factors such as GRO/GSO support are more important as limiting factor is TCP window opening. I measured web traffic over IPSec links between VMs. The key performance factor was latency of the switching/IPsec combo: if latency is above a certain level, TCP window of the endpoints does not open and the in-between software switches become under-utilized.


My vision is that if you use a hardware specific SDK to build your hardware specific application, you will get the best of the hardware. The gains can range from 30% to 100% depending on HW, so it is not negligible (you may have to prove this assertion ;-). One major reason being the ability to use the exact sotfware metadata which may become a single cache line or even no software metadata at all as you could leverage the hardware descriptor directly. The other reason is to leverage the native IO model for the device which DPDK may not support. The price to pay is hardware or vendor dependence.


FF


PS1: You may want to clarify your search: you haven't stated if your interest is L2 switch or L3 switch, if you consider baremetal switching, container or VM switching. If you want L3 then you probably want to focus on VPP, Contrail or Snabb rather than the low level packet io frameworks. With latest Intel AVF technology, DPDK is almost irrelevant for VPP and actually slows things down with the same hardware (Intel XL 710 card) AAdditionally, the kernel community is working on AF_XDP which may be relevant for your case.

PS2: I am not sure NANOG is the best list to discuss the technical details you want. That said, it may be the best place to discuss the use cases or realistic testbed setup.

On 04.06.2018 07:41, Kasper Adel wrote:
Hello

I’m asked to evaluate switching platforms that has different forwarding
chips but the same OS.

Assuming these vendors give the same SDK and similar documentation/support, then what would be comparison points to consider, other than the obvious
(price, features, bps, pps).

I’m thinking, how do i validate their claims about capability to do
leaf/spine arch, ToR/Gateways, telemetry, serviceability, facilities to
troubleshoot packet drops or FIB programming misses, hidden tools...etc

It would be great if anyonw can give some thoughts around it, specially if
you have tried one or both.

Thanks
Kim


Current thread: