nanog mailing list archives

RE: 400G forwarding - how does it work?


From: <ljwobker () gmail com>
Date: Fri, 5 Aug 2022 13:31:26 -0400

Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately familiar with any of these devices, but I'm 
familiar with the high level tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, especially 
once you get into the second- and third-order implementation details.  Your mileage will vary...   ;-)

If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, simpler 
programming, etc.  A major downside is that to do this, all of these cores have to have access to all of the different 
memories used to forward said packet.  Conversely, if you break up the processing into stages, you can only connect the 
FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap memories to the 
cores/blocks that are doing the encapsulation work.  Those interconnects take up silicon space, which equates to higher 
cost and power.  

Packaging two cores on a single device is beneficial in that you only have one physical chip to work with instead of 
two.  This often simplifies the board designers' job, and is often lower power than two separate chips.  This starts to 
break down as you get to exceptionally large chips as you bump into the various physical/reticle limitations of how 
large a chip you can actually build.  With newer packaging technology (2.5D chips, HBM and similar memories, chiplets 
down the road, etc) this becomes even more complicated, but the answer to "why would you put two XYZs on a package?" is 
that it's just cheaper and lower power from a system standpoint (and often also from a pure silicon standpoint...)

Buffer designs are *really* hard in modern high speed chips, and there are always lots and lots of tradeoffs.  The 
"ideal" answer is an extremely large block of memory that ALL of the forwarding/queueing elements have fair/equal 
access to... but this physically looks more or less like a full mesh between the memory/buffering subsystem and all the 
forwarding engines, which becomes really unwieldly (expensive!) from a design standpoint.  The amount of memory you can 
practically put on the main NPU die is on the order of 20-200 **mega** bytes, where a single stack of HBM memory comes 
in at 4GB -- it's literally 100x the size.  Figuring out which side of this gigantic gulf you want to live on is a 
super important part of the basic architecture and also drives lots of other decisions down the line... once you've 
decided how much buffering memory you're willing/able to put down, the next challenge is coming up with ways to provide 
access to that memory from all the different potential clients.  It's a LOT easier to wire up/design a chip where you 
have four separate pipelines/cores/whatever and each one of them accesses 1/4 of the buffer memory... but that also 
means that any given port only has access to 1/4 of the memory for burst absorption.  Lots and lots of Smart People 
Time has gone into different memory designs that attempt to optimize this problem, and it's a major part of the 
intellectual property of various chip designs.  

--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com () nanog org> On Behalf Of Saku Ytti
Sent: Friday, August 5, 2022 3:16 AM
To: Jeff Tantsura <jefftant.ietf () gmail com>
Cc: NANOG <nanog () nanog org>; Jeff Doyle <jdoyle () juniper net>
Subject: Re: 400G forwarding - how does it work?

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of line-of-PPE (identical cores). I don't 
understand the advantages of the FP model, the Trio model advantages are clear to me. Obviously the FP model has to 
have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in Trio6, there is no local-switching between WANs of 
trios in-package, they are, as far as I can tell, ships in the night, packets between trios go via fabric, as they 
would with separate Trios. I can understand the benefit of putting trio and HBM2 on the same package, to reduce 
distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers for Trio6. The collateral damage potential is 
significant, because WAN which asks most, gets most, instead each having their fair share, thus potentially arbitrarily 
low rate WAN ingress might not get access to ingress buffer causing drop. Would it be practical in terms of 
wattage/area to add some sort of preQoS towards the shallow ingress buffer, so each WAN ingress has a fair 
guaranteed-rate to shallow buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura <jefftant.ietf () gmail com> wrote:

Apologies for garbage/HTMLed email, not sure what happened (thanks 
Brian F for letting me know).
Anyway, the podcast with Juniper (mostly around Trio/Express) has been 
broadcasted today and is available at 
https://www.youtube.com/watch?v=1he8GjDBq9g
Next in the pipeline are:
Cisco SiliconOne
Broadcom DNX (Jericho/Qumran/Ramon)
For both - the guests are main architects of the silicon

Enjoy


On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura <jefftant.ietf () gmail com> wrote:

Hey,



This is not an advertisement but an attempt to help folks to better understand networking HW.



Some of you might know (and love 😊) “between 0x2 nerds” podcast Jeff Doyle and I have been hosting for a couple of 
years.



Following up the discussion we have decided to dedicate a number of upcoming podcasts to networking HW, the topic 
where more information and better education is very much needed (no, you won’t have to sign NDA before joining 😊), 
we have lined up a number of great guests, people who design and build ASICs and can talk firsthand about evolution 
of networking HW, complexity of the process, differences between fixed and programmable pipelines, memories and 
databases. This Thursday (08/04) at 11:00PST we are joined by one and only Sharada Yeluri - Sr. Director ASIC at 
Juniper. Other vendors will be joining in the later episodes, usual rules apply – no marketing, no BS.

More to come, stay tuned.

Live feed: https://lnkd.in/gk2x2ezZ

Between 0x2 nerds playlist, videos will be published to: 
https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2
yJB7



Cheers,

Jeff



From: James Bensley
Sent: Wednesday, July 27, 2022 12:53 PM
To: Lawrence Wobker; NANOG
Subject: Re: 400G forwarding - how does it work?



On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker <ljwobker () gmail com> wrote:

So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 
8 of these pipelines and get my performance target that way.  I could also build a "pipeline" that processes 
multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and 
so on and so forth.



Thanks for the response Lawrence.



The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the

J2 to have something similar (as someone already mentioned, most 
chips

I've seen are in the 1-1.5Ghz range), so in this case "only" 2

pipelines would be needed to maintain the headline 2Bpps rate of the

J2, or even just 1 if they have managed to squeeze out two packets 
per

cycle through parallelisation within the pipeline.



Cheers,

James.





--
  ++ytti


Current thread: