nanog mailing list archives

RE: 400G forwarding - how does it work?


From: <ljwobker () gmail com>
Date: Wed, 27 Jul 2022 09:56:55 -0400

The Broadcom KBP -- often called an "external TCAM" is really closer to a completely separate NPU than just an external 
TCAM.  "Back in the day" we used external TCAMs to store forwarding state (FIB tables, ACL tables, whatever) on devices 
that were pretty much just a bunch of TCAM memory and an interface for the "main" NPU to ask for a lookup.  Today the 
modern KBP devices have WAY more functionality, they have lots of different databases and tables available, which can 
be sliced and diced into different widths and depths.  They can store lots of different kinds of state, from counters 
to LPM prefixes and ACLs.  At risk of correcting Ohta-san, note that most ACLs are implemented using TCAMs with 
wildcard/masking support, as opposed to an exact match lookup.  Exact match lookups are generally used for things that 
do not require masking or wildcard bits: MAC addresses and MPLS label values are the canonical examples here.  

The SRAM memories used in fast networking chips are almost always built such that they provide one lookup per clock, 
although hardware designers often use multiple banks of these to increase the number of *effective* lookups per clock.  
TCAMs are also generally built such that they provide one lookup/result per clock, but again you can stack up multiple 
devices to increase this.

Many hardware designs also allow for more flexibility in how the various memories are utilized by the software -- 
almost everyone is familiar with the idea of "I can have a million entries of X bits, or half a million entries of 2*X 
bits".  If the hardware and software complexity was free, we'd design memories that could be arbitrarily chopped into 
exactly the sizes we need, but that complexity is Absolutely Not Free.... so we end up picking a few discrete sizes and 
the software/forwarding code has to figure out how to use those bits efficiently.  And you can bet your life that as 
soon as you have a memory that can function using either 80b or 160b entries, you will immediately come across a use 
case that really really needs to use entries of 81b.

FYI: There's nothing particularly magical about 40b memory widths.  When building these chips you can (more or less) 
pick whatever width of SRAM you want to build, and the memory libraries that you use spit out the corresponding 
physical design.

Ohta-san correctly mentions that a critical part of the performance analysis is how fast the different parts of the 
pipeline can talk to each other.  Note that this concept applies whether we're talking about the connection between 
very small blocks within the ASIC/NPU, or the interface between the NPU and an external KBP/TCAM, or for that matter 
between multiple NPUs/fabric chips within a system.  At some point you'll always be constrained by whatever the slowest 
link in the pipeline is, so balancing all that stuff out is Yet One More Thing for the system designer to deal with.



--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com () nanog org> On Behalf Of Masataka Ohta
Sent: Wednesday, July 27, 2022 9:09 AM
To: nanog () nanog org
Subject: Re: 400G forwarding - how does it work?

James Bensley wrote:

The BCM16K documentation suggests that it uses TCAM for exact matching 
(e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in 
something called the "User Data Array" (with 16M 32b entries?).

Which documentation?

According to:

        https://docs.broadcom.com/docs/16000-DS1-PUB

figure 1 and related explanations:

        Database records 40b: 2048k/1024k.
        Table width configurable as 80/160/320/480/640 bits.
        User Data Array for associated data, width configurable as
        32/64/128/256 bits.

means that header extracted by 88690 is analyzed by 16K finally resulting in 40b (a lot shorter than IPv6 addresses, 
still may be enough for IPv6 backbone to identify sites) information by "database"
lookup, which is, obviously by CAM because 40b is painful for SRAM, converted to "32/64/128/256 bits data".

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which 
is within the access time of TCAM and SRAM

As high speed TCAM and SRAM should be pipelined, cycle time, which matters, is shorter than access time.

Finally, it should be pointed out that most, if not all, performance figures such as MIPS and Flops are merely 
guaranteed not to be exceeded.

In this case, if so deep packet inspections by lengthy header for some complicated routing schemes or to satisfy NSA 
requirements are required, communication speed between 88690 and 16K will be the limitation factor for PPS resulting in 
a lot less than maximum possible PPS.

                                                Masataka Ohta


Current thread: