nanog mailing list archives

Re: Broadcom vs Mellanox based platforms

From: ff () ozog com
Date: Mon, 04 Jun 2018 10:33:32 +0200

Hi Kim,

I'll share key learnings about since I started to work on high speedsoftware networking in 2006, when everyone was laughing at me becaused Iclaimed to achieve 10Gbps networking with a CPU.



CPU is less important than memory/QPI

On x86 memory subsytem include things like Cache Boxes, Home Agent, DRAMcontrollers... Home Agent is reponsible to know on which CPU node is acacheline. So it can become a centralized bottleneck.... DRAMcontrollers have a queue of pending DRAM requests (instruction pipeline,data prefetch, data...). QPI routing may also severely impactperformance. I remember using a 4 socket system that was half theperformance of a 2 socket system because of either bad QPI routingprograming by the BIOS or a hardware issue.An order of magnitude to keep in mind is that at 100Gbps, each 64-bytepacket and each associated 64-byte used metadata cacheline is consumingroughly a full DRAM channel. As an example and not counting applicationdata to be leveraged (FIB, DNS database...) a 100Gbps DPDK bridgingapplication requires 3 memory channels per port (to reach line rate ifthe IO allows it)... There is a lot more to say but I let you do yourown research ;-)BTW, why would you want to do 100GBps line rate (or very close to it)?To ensure that each node has the capacity to resist a DDoS attackpowered by DPDK/ODP/native "applications".


PCI is your ennemy (or not that a good friend)

PCI chipset behavior is complex. The typical payload on x86 is 256bytes.So I assumed that using a 1KB max payload to support the average 670byte internet packet size would give better results... But no, early DMAtransaction acknowledgement is disabled if payload not 256 so it droppedperformance significantly.You may have an embedded switch on the NIC. So you think that offloadingwill give you a benefit. Yes at low speed but you can't build a 50Gbpsservice chain because most of the NIC are on PCI x8 Gen3 slots which islimited to 50Gbps BW.So the conclusion is: don't try to understand those limits, create atestbed that really mimics the target "size" and topology of your usecase and measure.


Don't do tests at 10Gbps if your target is 100Gbps.

Starting at 50Gbps you will be bumping on PCI DMA transaction ratebarrier. Unless you have a smart IO model (multiple packets per DMAtransaction - see Netcope for instance) supported in zero-copy by theSDK architecture you won't reach line rate or be able to have anapplication (zero-copy of data or metadata reduction can save a DRAMchannel for application at this "speed"). I think (but not sure) you cansqueeze two packets in a buffer with Mellanox cards: that can beinstrumental in reaching 50Gbps line rate but I don't know if DPDKsupports this feature.

Don't do pps at the switch level if your target is fast VM applicationbehavior.Measuring that a software switch can do 10Gbps line rate with 64 bytepackets does not help at all to predict TCP application performance in aVM. Factors such as GRO/GSO support are more important as limitingfactor is TCP window opening.I measured web traffic over IPSec links between VMs. The key performancefactor was latency of the switching/IPsec combo: if latency is above acertain level, TCP window of the endpoints does not open and thein-between software switches become under-utilized.

My vision is that if you use a hardware specific SDK to build yourhardware specific application, you will get the best of the hardware.The gains can range from 30% to 100% depending on HW, so it is notnegligible (you may have to prove this assertion ;-). One major reasonbeing the ability to use the exact sotfware metadata which may become asingle cache line or even no software metadata at all as you couldleverage the hardware descriptor directly. The other reason is toleverage the native IO model for the device which DPDK may not support.The price to pay is hardware or vendor dependence.

FF

PS1: You may want to clarify your search: you haven't stated if yourinterest is L2 switch or L3 switch, if you consider baremetal switching,container or VM switching.If you want L3 then you probably want to focus on VPP, Contrail or Snabbrather than the low level packet io frameworks. With latest Intel AVFtechnology, DPDK is almost irrelevant for VPP and actually slows thingsdown with the same hardware (Intel XL 710 card)AAdditionally, the kernel community is working on AF_XDP which may berelevant for your case.

PS2: I am not sure NANOG is the best list to discuss the technicaldetails you want. That said, it may be the best place to discuss the usecases or realistic testbed setup.


On 04.06.2018 07:41, Kasper Adel wrote:

Hello

I’m asked to evaluate switching platforms that has different forwarding
chips but the same OS.
Assuming these vendors give the same SDK and similardocumentation/support,then what would be comparison points to consider, other than theobvious
(price, features, bps, pps).

I’m thinking, how do i validate their claims about capability to do
leaf/spine arch, ToR/Gateways, telemetry, serviceability, facilities to
troubleshoot packet drops or FIB programming misses, hidden tools...etc
It would be great if anyonw can give some thoughts around it, speciallyif
you have tried one or both.

Thanks
Kim

Current thread:

Broadcom vs Mellanox based platforms Kasper Adel (Jun 03)
- Re: Broadcom vs Mellanox based platforms Chris Grundemann (Jun 04)
- Re: Broadcom vs Mellanox based platforms Tom Hill (Jun 04)
- Re: Broadcom vs Mellanox based platforms Nick Hilliard (Jun 04)
- Re: Broadcom vs Mellanox based platforms ff (Jun 05)
- Re: Broadcom vs Mellanox based platforms Jean Delestre (Jun 05)
- Re: Broadcom vs Mellanox based platforms Sylvain COUTANT (Jun 05)