Dropbox needs its underlying network infrastructure to be reliable, high-performing, cost-effective, and truly scalable. In previous posts we described how the edge network was designed to improve user performance, and how the supporting multi-terabit backbone network spans continents to interconnect edge PoPs and multiple data centers. In this post we describe how we evolved the Dropbox data center network from the legacy chassis based four-post architecture to a scalable multi-tier, quad-plane fabric. Also, we successfully deployed our first fabric at our newest data center in California earlier this year!

Figure 1: Location of Dropbox global points of presence (PoPs)

We currently have global network presence and multiple data centers in California, Texas and Virginia. From a redundancy perspective, the North American continent is carved into regions—East, Central, and West—thereby having a distributed data center approach and improving resiliency in events of failure.

Legacy four-post architecture

Figure 2: Four-post cluster architecture (CC = cluster connector, CR= cluster router)

A cluster is comprised of ‘n’ rack switches and four supporting CR devices in a tier, and each CR device is a chassis based system. The cluster router (CR) tier handles inter-rack traffic within the cluster and traffic entering or leaving the cluster. Typically, a data center is comprised of more than one cluster, and interconnection between clusters is done via the cluster connector (CC) tier, itself comprised of four CC devices (see figure #2). The cluster connector (CC) tier handles inter-cluster traffic within the datacenter and is the entry/exit point for all traffic at a datacenter. We used 100G connectivity across all the links seen in the above figure. From a protocol perspective, we run eBGP between the cluster connector (CC) and cluster router (CR) tiers using private autonomous system numbers (ASN) numbers. In contrast, between a rack and cluster router (CR) we use iBGP and route-reflectors to propagate routing information within the cluster. Additionally, devices within the CC and CR tiers are fully meshed using ISIS as the IGP and peering via iBGP using loopback interfaces.

Scaling limitations and growth

As we kept adding megawatts of data center space to support growth, we quickly realized that the older four-post architecture wouldn’t be able to scale to meet our future needs. The port density on the chassis based units directly translated to the number of racks a cluster could support, making network infrastructure a limiting factor. Adding new clusters to an existing data center or upgrading cluster-uplink capacity meant having unused ports on the upstream aggregating CC-tier. Also, troubleshooting a traffic-related issue within a system using multiple line cards and supporting fabric modules could quickly become complex and time consuming.

Keeping in mind the challenges with the older design, we wanted to ensure our next generation design solved them while remaining future-proof. Below are design considerations we focussed upon while engineering the fabric: Achieving a rack count greater than that offered by the legacy design

Ability to scale horizontally while adding capacity on demand

Non-blocking fabric with increased redundancy

Chassis-free design

Utilizing an identical ASIC at every tier

All fabric links/interconnects utilize 100G connectivity

Multiple ECMP options at every layer Quad-plane, 3-tier fabric: 16 racks per pod

16 pods per fabric

64 spine switches

4 pod switches per pod

Figure 3: 256 rack fabric

As seen above in figure #3, the spine connectors (SC) and spine switches (SSW) are the building blocks of the fabric, and the pod switches (PSW) can be added incrementally on-demand, allowing the design to be elegantly scaled horizontally. Each device above, including the rack switch, is using an identical ASIC that offers 3.2 Tbps of switching capacity (32 ports of 100G), with super-low latency, and has a single rack-unit footprint, while leveraging merchant silicon. Our latest deployment includes 16 pods, however this can be further scaled to support 31 pods thereby accommodating close to 500 racks. The number of pods in the fabric is dictated by the number of ports reserved on each spine switch (SSW) device. Based on our traffic patterns, we initially reserved fewer ports for upstream connectivity responsible for traffic entering and leaving the fabric. While accommodating for future growth, we can always increase the upstream capacity and appropriately scale the number of pods in the fabric.

Cabling and optics

As seen in figure #4 below, the new design resulted in a few thousand interconnections across the building blocks of the fabric itself, not accounting for the rack uplinks! Based on our physical layout these distances were within the operating limits of a QSFP-100G-SR4 optic. Hence, we deployed QSPF-100G-SR4 in a combination with MPO connectors over multimode fiber (MMF). The longest cable in the physical layout was roughly 140 feet, which spanned from the main distribution frame (MDF) to our farthest cabinet location (again within the operating limits of a QSFP-100G-SR4 optic).



Figure 4: Zooming in on the fabric with few thousand interconnects

Physical design and layout

Figure 5: MDF row representation

In addition to logical network failure domains, we decided to separate the planes across the two main distribution frame (MDF) rows, in our case MDF ‘A’ and MDF ‘B’ to add physical diversity. Simultaneously, this allowed us to increase our failure domain when dealing with electrical anomalies as each MDF row is supported by redundant electrical power distribution panels. Furthermore, each MDF row has diverse network pathways to each cabinet position.

Figure 6: Physical representation of a small section of the fabric

Why quad-plane?

Figure 7: Quad-Plane design

Our idea of a “plane" reflects an independent failure domain. Based on our needs, each rack switch connecting to the fabric is offered 4X100G of uplink capacity, hence the term ‘quad-plane’. At any given time, we should be able to lose an entire plane’s worth of capacity and still remain healthy, this would mean losing one spine connector (SC), 16 spine switches (SSW) and 16 pod switches (PSW) identically colored.

Non-blocking and oversubscription

Non-blocking means that the number of inputs equals the number of outputs. It wasn’t a pre-requisite, but we preferred to utilize the fabric to its fullest potential starting on day one, and hence we constructed a non-blocking network with a 1:1 oversubscription ratio between any two racks within the fabric. This also helped us steer away from any potential speed mismatch related issues and simplified cabling by not needing to split a 100G port.

Colors and routing within the fabric

Every device and interconnect within the fabric is uniquely colored. Each color signifies a unique plane. A plane within the fabric is responsible for 25% of overall rack throughput. This further explains why no two different color devices are connected. Once a packet traverses a uniquely colored link or device, it will only further traverse identically colored links or devices. This applies to all traffic whether intra-fabric i.e rack-to-rack traffic or traffic exiting the fabric destined to networks external to the fabric, this is further illustrated in figure #8 and figure #9 respectively.

Intra-fabric packet traversal:

Figure 8: Intra-fabric, inter-pod packet traversal

Networks external to fabric:

Figure 9: Traffic destined to networks external to the fabric

We built the fabric entirely on BGP, specifically running eBGP between any two devices. As compared to using iBGP and route-reflectors in the legacy design, eBGP helped us keep things simple and offers deeper visibility into a routing-prefix while leveraging AS-Path information. As in our older designs, the fabric is also a pure Layer 3 network down to the rack switch, offering support for IPv4 and IPv6. The fabric relies on equal-cost multi-path routing (ECMP) & per-flow hashing resulting in equal distribution of flows equally across available links. Failure domain analysis Pod switch (PSW):

Figure10: A POD Switch failure

Traffic profile: Handles both east↔west and north↔south flows

If a PSW fails, all the rack switches in the pod lose upstream capacity by 25%, basically a failure domain of 75%

This provides better fault isolation and reduces the impact to only a single pod, compared to the existing four-post design where all racks would lose upstream capacity by 25% Spine switch (SSW):

Figure 11: A Spine Switch failure

Traffic profile: Handles both east↔west and north↔south flows

A spine switch (SSW) connects to 16 pod switches (PSW) and four spine connectors (SC). Losing a node out of 16 devices in a plane results in a failure domain of 93.75% which is really low as compared to a node failure in the four-post design wherein all racks in the cluster would lose upstream capacity by 25% Spine connector (SC):

Figure 12: A Spine Connector Failure

Traffic profile: Handles only north↔south flows

A spine connector tier is comprised of four devices. Failure of a single device results in 25% loss of north↔south capacity resulting in a failure domain of 75%

However, this failure only impacts flows entering/leaving the fabric. East↔west flows remain un-impacted!

Future scaling