Inside a 32x400G TH3 switch

We already talked quite a bit about white-box switch architecture and how networking software works. There were pieces about basics, ICOS, OpenSwitch. All squares, arrows, names, and other theoretical cliches, completely unbound from the material base.

This one will be a more physical piece, looking inside one of the most advanced products on the market – a 12.8Tbps box with 32x400G QSFP-DD ports. As of now, it’s cutting-edge hardware, being deployed by hyper-scalers, and only started getting tentative interest outside these huge guys.

Today we are covering Aurora 820 by Netberg with the Broadcom Tomahawk3 BCM56980 at the core.

Aurora 820

The StrataXGS® Tomahawk® 3 doesn’t need much introduction as the silicon announcement happened two years ago. However, unlike PC/server worlds, networking products have quite a distance to go before reaching end-users. Transceivers, cables, even NIC products, they all need to reach maturity and reasonable cost before a mass-user goes for it.

A single 100G port demands 12.5GB/s bandwidth from the host system. A dual-port card wants double that amount. The most recent PCIe revision in production, the 4.0, falls slightly short of 2GB/s per lane – 1.97GB/s at 16GT/s.

What happens if we go to 400G Ethernet? Even a single-port card would eat 50GB/s, while an x16 PCIe 4.0 slot delivers only 31.52GB/s.

So at this moment, 400G Ethernet makes sense only at aggregation/spine layers of networks. One can easily imagine one of the countless Leaf-Spine Fabric depictions.

Enough with introductions. Let’s have a look at the box itself.

External features

The 1U front panel houses 32 QSFP-DD ports and management interfaces. QSFP-DD stands for Quad Small Form Factor Pluggable Double Density as it doubles the number of interfaces over the previous QSFP28 modules. And it’s backward compatible, allowing the use of existing QSFP modules for cost-saving and flexibility.

Aurora 820 front view

On the right side, we see two management RJ45 ports (who said daisy-chaining?), one RJ45 console, one ToD port, and one USB Type-A port. System status LED and reset button are right below the USB port.

ToD stands for Time-of-Day, a port used for time synchronization using 1PPS signal. An handy feature for the Synchronous Ethernet (SyncE) applications.

Aurora 820 management interface area

Moving to the back, we can see two power modules and six fans – latches, handles, and more latches next to handles.

Aurora 820 rear view

As usual in this class of gear, all fans and power supply modules are hot-swappable. Fan cages have LED indicators for quick diagnosis. The AC inlet has a cog in it – a C15 type. It’s intentional, as the 1.3KW Platinum module demands special thick cables for safety and stable power delivery reasons.

Aurora 820 fans and PSU

Insides

Now we come to the interesting part. The overall layout is pretty standard – the switching ASIC next to QSFP-DD ports followed by the CPU board.

Aurora 820 without its top cover

The TH3 heatsink stands out from all 100G-era products. It’s MASSIVE, two-thirds from the 19″ chassis width.

The Tomahawk 3 heatsink

A 400G particularity we want to pay attention to is heatsinks over QSFP-DD cages. The QSFP28 power limit is only 3.5W, so modules are OK with simple cages. QSFP-DD got two power modes; Low Power Mode and High Power mode, and eight power classes. The longest reach module may go over 14W by the standard!

Aurora 820 supports 400G QSFP-DD LR8, SR8, SR4.2 BiDi, FR8, and DR4 with up to 12W per module (Power Class 6).

QSFP-DD cages

Major visible parts inside are: the Tomahawk 3 ASIC, Intel Xeon D-1527, Aspeed AST2520 BMC, and an FPGA by Lattice.

Major parts

Next to the QSFP-DD cages and the ASIC is a MachXO2-4000 FPGA that controls LEDs during boot. Once system OS is up and running, it’s taking control over via the i2c bus.

MachXO2

The Aspeed AST2520 BMC brings more of a server-like control to the network equipment. Nobody likes walking to an unresponsive switch for a quick reboot!

AST2520

The CPU board has two memory slots for DDR4 ECC SO-DIMMs, up to 32GB total, one M.2 SATA connector, even one PCIe Gen.3 x1 slot for those in need. Our SKU has 8GB memory, 128GB of SSD storage, and a Xeon D-1527 quad-core CPU onboard. As usual, it’s configurable on demand.

CPU board

Under the CPU board, there is one more FPGA hiding. Again, it’s MachXO2-4000 by Lattice.

Another FPGA

The fan board is a passive power and signal delivering PCB. All the control comes from the mainboard.

Fan board

Another unique particularity we wanted to point out is how thick the main PCB is. Almost double from the CPU board! That’s what it takes to deliver 12.8Tbps without losing the signal integrity. Desktop and even server boards are playthings next to this one.

PCB thickness

Overall, a modern switch resembles a server with an accelerator. The packet processing logic made on the CPU got uploaded into the ASIC to run at a wire speed. Even more, programmable chipsets in many ways resemble FPGAs – it’s taking quite a while to compile the program one wants to run.

Booting the switch up

Anyone who ever worked with a server in the last ten+ years will recognize the AMI Aptio BIOS setup utility. Just like in regular hardware, there are many platform options, security, boot features, etc. One tab shows the BMC settings with access to event logs, the BMC network port configuration, and resilient boot options.

In Linux (ONL in our case), an all-time favorite, the ipmitool, allows remote control over the system power and sensors.

ipmitool

An ONLPD output with the system info:

root@localhost:~# onlpd
  System Information: = {
      Product Name: Aurora 820
      Part Number: NBA820-FtB
      Serial Number: B-E1-2019050604
      MAC: 70:b3:d5:cc:f8:c7
      MAC Range: 4
      Manufacturer: Netberg
      Manufacture Date: 15/12/2020 12:00:00
      Vendor: Netberg
      Platform Name: x86_64-netberg_aurora_610-r0
      Label Revision: 1
      Country Code: 886
      Service Tag: 0700036224
      ONIE Version: master-03161956
  }

System ports:

root@localhost:~# onlpd -S
Port  Type            Media   Status  Len    Vendor            Model             S/N
----  --------------  ------  ------  -----  ----------------  ----------------  ----------------
  0  NONE
  1  NONE
  2  NONE
  3  NONE
  4  NONE
  5  NONE
  6  NONE
  7  NONE
  8  NONE
  9  NONE
 10  NONE
 11  NONE
 12  400G-CR8        Copper  L       1m     Optech            OPQSDD-T-01-PE    OP1890KE000002
 13  NONE
 14  NONE
 15  NONE
 16  400G-CR8        Copper  L       1m     Optech            OPQSDD-T-01-PE    OP1890KE000002
 17  NONE
 18  NONE
 19  NONE
 20  NONE
 21  NONE
 22  400G-CR8        Copper  L       1m     Optech            OPQSDD-T-01-PE    OP1890KE000003
 23  NONE
 24  NONE
 25  NONE
 26  400G-CR8        Copper  L       1m     Optech            OPQSDD-T-01-PE    OP1890KE000003
 27  NONE
 28  NONE
 29  NONE
 30  NONE
 31  NONE

The primary purpose of a switch running ONL is developing apps. We have the OpenNSA package from Broadcom for exactly this purpose. An open network switch APIs for programming Broadcom silicon to whatever ends one may need. Switching, Network Packet Brokers (NPB), you name it.

Here is how it boots up:

root@localhost:~/opennsa_release/bcmshell# ./bcm.sh
[ 1040.020318] linux-kernel-bde 0000:06:00.0: enabling device (0100 -> 0102)
root@localhost:~/opennsa_release/bcmshell# ./bcm.user
Broadcom Command Monitor: Copyright (c) 1998-2021 Broadcom
Release: sdk-6.5.19 built 20210209 (Tue Feb  9 21:53:03 2021)
From qiping@s240:/home/qiping/Projects/Build/3.4/bcm_sdk
Platform: X86
OS: Unix (Posix)
DMA pool size: 67108864
BDE dev 0 (PCI), Dev 0xb980, Rev 0x11, Chip BCM56980_B0, Driver BCM56980_B0
SOC unit 0 attached to PCI device BCM56980_B0
WARNING: bcm esw command CoupledMemWrite not alphabetized
rc: unit 0 device BCM56980_B0
Loading M0 Firmware located at linkscan_led_fw.bin
Firmware download successed (0x3c5efa3a).
Loading M0 Firmware located at custom_led.bin
Firmware download successed (0x9dda8a35).
rc: MMU initialized
0:bcmi_xgs5_bfd_init: uKernel BFD application not available
r1c: BCM driver initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
BCM.0> ps
                ena/        speed/ link auto    STP                  lrn  inter   max   cut             loop
          port  link  Lns   duplex scan neg?   state   pause  discrd ops   face frame  thru?    FEC     back
      cd0( 23)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd1( 22)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd2( 41)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd3( 40)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd4(  4)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd5(  3)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd6( 42)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd7( 43)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      xe0( 38)  down   1   10G  FD   SW  No   Forward  TX RX   None   FA    XFI  9412    No       NONE
      cd8(  1)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      cd9(  2)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd10( 20)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd11( 21)    up   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd12( 60)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd13( 63)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd14( 62)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd15( 61)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd16( 80)    up   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd17( 81)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd18( 82)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd19( 83)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd20(123)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd21(122)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd22(140)    up   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd23(143)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
      xe1(118)  down   1   10G  FD   SW  No   Forward  TX RX   None   FA    XFI  9412    No       NONE
     cd24(142)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd25(141)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd26(100)    up   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd27(101)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd28(102)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd29(103)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd30(120)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN
     cd31(121)  down   8  400G  FD   SW  No   Forward  TX RX   None   FA    CR8  9412    No  RS544-2xN

Unfortunately, we didn’t get 400G optics this time. Otherwise, we would show how Open Optical Monitoring (OOM) and its optoe driver handle QSFP-DD modules. Many proprietory drivers can read only the first page of the EEPROM from a module, and some drivers can reach up to four pages. Only the optoe driver, instantiated as i2c devices, can read the whole content.

(c) Finisar

Final thoughts

400G gear is only getting traction, so software options are limited. There is ONL for application development, ICOS as it’s a Broadcom-based device, and omnipresent SONiC with many features in development. Open monitoring standards are taking place as well – the optoe driver support gets to all new models. With a more server-style switch design, the whole infrastructure management simplifies and unifies.

Before PCIe Gen.5 gets into mass-production servers, 400G switches will find a place as super-spines in Data Centers and telecom applications. The speed boost over 100G comes with increased power consumption at every aspect and a higher cost per unit. These are setbacks for sure, but there are advantages too. More bandwidth per switch flattens the networks as we need fewer of them. With that comes lower latency as fewer hops stand between endpoints. While bandwidth over 3.2Tbps switches is quadrupled, a unit price and power consumption did not. It means more power to storage and compute nodes. And don’t forget the four times drop in cables count!

We’ll see more 400G gear rolling out quite soon and plan to review it too.

PayPal Information

To pay with PayPal, select PayPal as your payment method at Checkout. You will be redirected to the PayPal payment page, where you can log in with your PayPal username and password and confirm your payment. This method also allows payments without a PayPal account. You can enter your credit card information and pay safely via PayPal.

Our PayPal account is paypal@netbergtw.com.

Wire Transfer Information

When placing the order, select Bank Transfer on the Checkout page, and you will see our bank account information.

After you pay with a bank transfer, please send the bank slip to your account manager for tracking.

Bank: E.Sun Commercial Bank, Ltd., Taipei, Taiwan 
Bank Address: No. 66-1, Sanchong Rd. Nangang District, 115, Taipei.
Account name: Netberg Ltd.
Account number: 1182441011646
SWIFT: ESUNTWTP