We already talked quite a bit about white-box switch architecture and how networking software works. There were pieces about basics, ICOS, OpenSwitch. All squares, arrows, names, and other theoretical cliches, completely unbound from the material base.
This one will be a more physical piece, looking inside one of the most advanced products on the market – a 12.8Tbps box with 32x400G QSFP-DD ports. As of now, it’s cutting-edge hardware, being deployed by hyper-scalers, and only started getting tentative interest outside these huge guys.
Today we are covering Aurora 820 by Netberg with the Broadcom Tomahawk3 BCM56980 at the core.
The StrataXGS® Tomahawk® 3 doesn’t need much introduction as the silicon announcement happened two years ago. However, unlike PC/server worlds, networking products have quite a distance to go before reaching end-users. Transceivers, cables, even NIC products, they all need to reach maturity and reasonable cost before a mass-user goes for it.
A single 100G port demands 12.5GB/s bandwidth from the host system. A dual-port card wants double that amount. The most recent PCIe revision in production, the 4.0, falls slightly short of 2GB/s per lane – 1.97GB/s at 16GT/s.
What happens if we go to 400G Ethernet? Even a single-port card would eat 50GB/s, while an x16 PCIe 4.0 slot delivers only 31.52GB/s.
So at this moment, 400G Ethernet makes sense only at aggregation/spine layers of networks. One can easily imagine one of the countless Leaf-Spine Fabric depictions.
Enough with introductions. Let’s have a look at the box itself.
External features
The 1U front panel houses 32 QSFP-DD ports and management interfaces. QSFP-DD stands for Quad Small Form Factor Pluggable Double Density as it doubles the number of interfaces over the previous QSFP28 modules. And it’s backward compatible, allowing the use of existing QSFP modules for cost-saving and flexibility.
On the right side, we see two management RJ45 ports (who said daisy-chaining?), one RJ45 console, one ToD port, and one USB Type-A port. System status LED and reset button are right below the USB port.
ToD stands for Time-of-Day, a port used for time synchronization using 1PPS signal. An handy feature for the Synchronous Ethernet (SyncE) applications.
Moving to the back, we can see two power modules and six fans – latches, handles, and more latches next to handles.
As usual in this class of gear, all fans and power supply modules are hot-swappable. Fan cages have LED indicators for quick diagnosis. The AC inlet has a cog in it – a C15 type. It’s intentional, as the 1.3KW Platinum module demands special thick cables for safety and stable power delivery reasons.
Insides
Now we come to the interesting part. The overall layout is pretty standard – the switching ASIC next to QSFP-DD ports followed by the CPU board.
The TH3 heatsink stands out from all 100G-era products. It’s MASSIVE, two-thirds from the 19″ chassis width.
A 400G particularity we want to pay attention to is heatsinks over QSFP-DD cages. The QSFP28 power limit is only 3.5W, so modules are OK with simple cages. QSFP-DD got two power modes; Low Power Mode and High Power mode, and eight power classes. The longest reach module may go over 14W by the standard!
Aurora 820 supports 400G QSFP-DD LR8, SR8, SR4.2 BiDi, FR8, and DR4 with up to 12W per module (Power Class 6).
Major visible parts inside are: the Tomahawk 3 ASIC, Intel Xeon D-1527, Aspeed AST2520 BMC, and an FPGA by Lattice.
Next to the QSFP-DD cages and the ASIC is a MachXO2-4000 FPGA that controls LEDs during boot. Once system OS is up and running, it’s taking control over via the i2c bus.
The Aspeed AST2520 BMC brings more of a server-like control to the network equipment. Nobody likes walking to an unresponsive switch for a quick reboot!
The CPU board has two memory slots for DDR4 ECC SO-DIMMs, up to 32GB total, one M.2 SATA connector, even one PCIe Gen.3 x1 slot for those in need. Our SKU has 8GB memory, 128GB of SSD storage, and a Xeon D-1527 quad-core CPU onboard. As usual, it’s configurable on demand.
Under the CPU board, there is one more FPGA hiding. Again, it’s MachXO2-4000 by Lattice.
The fan board is a passive power and signal delivering PCB. All the control comes from the mainboard.
Another unique particularity we wanted to point out is how thick the main PCB is. Almost double from the CPU board! That’s what it takes to deliver 12.8Tbps without losing the signal integrity. Desktop and even server boards are playthings next to this one.
Overall, a modern switch resembles a server with an accelerator. The packet processing logic made on the CPU got uploaded into the ASIC to run at a wire speed. Even more, programmable chipsets in many ways resemble FPGAs – it’s taking quite a while to compile the program one wants to run.
Booting the switch up
Anyone who ever worked with a server in the last ten+ years will recognize the AMI Aptio BIOS setup utility. Just like in regular hardware, there are many platform options, security, boot features, etc. One tab shows the BMC settings with access to event logs, the BMC network port configuration, and resilient boot options.
In Linux (ONL in our case), an all-time favorite, the ipmitool, allows remote control over the system power and sensors.
An ONLPD output with the system info:
root@localhost:~# onlpd
System Information: = {
Product Name: Aurora 820
Part Number: NBA820-FtB
Serial Number: B-E1-2019050604
MAC: 70:b3:d5:cc:f8:c7
MAC Range: 4
Manufacturer: Netberg
Manufacture Date: 15/12/2020 12:00:00
Vendor: Netberg
Platform Name: x86_64-netberg_aurora_610-r0
Label Revision: 1
Country Code: 886
Service Tag: 0700036224
ONIE Version: master-03161956
}
System ports:
root@localhost:~# onlpd -S
Port Type Media Status Len Vendor Model S/N
---- -------------- ------ ------ ----- ---------------- ---------------- ----------------
0 NONE
1 NONE
2 NONE
3 NONE
4 NONE
5 NONE
6 NONE
7 NONE
8 NONE
9 NONE
10 NONE
11 NONE
12 400G-CR8 Copper L 1m Optech OPQSDD-T-01-PE OP1890KE000002
13 NONE
14 NONE
15 NONE
16 400G-CR8 Copper L 1m Optech OPQSDD-T-01-PE OP1890KE000002
17 NONE
18 NONE
19 NONE
20 NONE
21 NONE
22 400G-CR8 Copper L 1m Optech OPQSDD-T-01-PE OP1890KE000003
23 NONE
24 NONE
25 NONE
26 400G-CR8 Copper L 1m Optech OPQSDD-T-01-PE OP1890KE000003
27 NONE
28 NONE
29 NONE
30 NONE
31 NONE
The primary purpose of a switch running ONL is developing apps. We have the OpenNSA package from Broadcom for exactly this purpose. An open network switch APIs for programming Broadcom silicon to whatever ends one may need. Switching, Network Packet Brokers (NPB), you name it.
Here is how it boots up:
root@localhost:~/opennsa_release/bcmshell# ./bcm.sh
[ 1040.020318] linux-kernel-bde 0000:06:00.0: enabling device (0100 -> 0102)
root@localhost:~/opennsa_release/bcmshell# ./bcm.user
Broadcom Command Monitor: Copyright (c) 1998-2021 Broadcom
Release: sdk-6.5.19 built 20210209 (Tue Feb 9 21:53:03 2021)
From qiping@s240:/home/qiping/Projects/Build/3.4/bcm_sdk
Platform: X86
OS: Unix (Posix)
DMA pool size: 67108864
BDE dev 0 (PCI), Dev 0xb980, Rev 0x11, Chip BCM56980_B0, Driver BCM56980_B0
SOC unit 0 attached to PCI device BCM56980_B0
WARNING: bcm esw command CoupledMemWrite not alphabetized
rc: unit 0 device BCM56980_B0
Loading M0 Firmware located at linkscan_led_fw.bin
Firmware download successed (0x3c5efa3a).
Loading M0 Firmware located at custom_led.bin
Firmware download successed (0x9dda8a35).
rc: MMU initialized
0:bcmi_xgs5_bfd_init: uKernel BFD application not available
r1c: BCM driver initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
BCM.0> ps
ena/ speed/ link auto STP lrn inter max cut loop
port link Lns duplex scan neg? state pause discrd ops face frame thru? FEC back
cd0( 23) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd1( 22) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd2( 41) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd3( 40) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd4( 4) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd5( 3) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd6( 42) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd7( 43) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
xe0( 38) down 1 10G FD SW No Forward TX RX None FA XFI 9412 No NONE
cd8( 1) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd9( 2) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd10( 20) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd11( 21) up 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd12( 60) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd13( 63) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd14( 62) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd15( 61) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd16( 80) up 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd17( 81) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd18( 82) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd19( 83) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd20(123) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd21(122) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd22(140) up 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd23(143) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
xe1(118) down 1 10G FD SW No Forward TX RX None FA XFI 9412 No NONE
cd24(142) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd25(141) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd26(100) up 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd27(101) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd28(102) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd29(103) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd30(120) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
cd31(121) down 8 400G FD SW No Forward TX RX None FA CR8 9412 No RS544-2xN
Unfortunately, we didn’t get 400G optics this time. Otherwise, we would show how Open Optical Monitoring (OOM) and its optoe driver handle QSFP-DD modules. Many proprietory drivers can read only the first page of the EEPROM from a module, and some drivers can reach up to four pages. Only the optoe driver, instantiated as i2c devices, can read the whole content.
Final thoughts
400G gear is only getting traction, so software options are limited. There is ONL for application development, ICOS as it’s a Broadcom-based device, and omnipresent SONiC with many features in development. Open monitoring standards are taking place as well – the optoe driver support gets to all new models. With a more server-style switch design, the whole infrastructure management simplifies and unifies.
Before PCIe Gen.5 gets into mass-production servers, 400G switches will find a place as super-spines in Data Centers and telecom applications. The speed boost over 100G comes with increased power consumption at every aspect and a higher cost per unit. These are setbacks for sure, but there are advantages too. More bandwidth per switch flattens the networks as we need fewer of them. With that comes lower latency as fewer hops stand between endpoints. While bandwidth over 3.2Tbps switches is quadrupled, a unit price and power consumption did not. It means more power to storage and compute nodes. And don’t forget the four times drop in cables count!
We’ll see more 400G gear rolling out quite soon and plan to review it too.