When it comes to distributing load across a set of physical uplinks, which NIC teaming method reigns supreme?

Definitions

First, let me define the terms. Not being a networking guy, this was initially confusing to me because they all seem so similar. My biggest challenge when having these conversations is when the other party uses terms to mean something other than what I think they mean.

LAG: Link Aggregation Group. This is a generic term used to refer to any sort of bonding of links to achieve greater throughput potential. This would include both static EtherChannel and Link Aggregation Control Protocol (LACP).

Link Aggregation Group. This is a generic term used to refer to any sort of bonding of links to achieve greater throughput potential. This would include both static EtherChannel and Link Aggregation Control Protocol (LACP). EtherChannel: The Cisco Systems Inc. proprietary link aggregation scheme, which accomplishes more or less the same goal as the IEEE standard 802.1AX-2014. It groups up to eight physical links into a single logical link. This provides increased throughput potential while avoiding loops. This aggregation can be done statically by configuring both sides of the links in play, or it can be set up dynamically by either LACP or Port Aggregation Protocol (PAgP), which is the proprietary method of Cisco to accomplish the same thing as LACP.

The Cisco Systems Inc. proprietary link aggregation scheme, which accomplishes more or less the same goal as the IEEE standard 802.1AX-2014. It groups up to eight physical links into a single logical link. This provides increased throughput potential while avoiding loops. This aggregation can be done statically by configuring both sides of the links in play, or it can be set up dynamically by either LACP or Port Aggregation Protocol (PAgP), which is the proprietary method of Cisco to accomplish the same thing as LACP. LACP: Link Aggregation Control Protocol. This is also defined in the 802.1AX standard, and provides a method for automating LAG configurations. LACP-capable devices discover each other by sending LACP packets to the Slow_Protocols_Multicast address 01-80-c2-00-00-02. They then negotiate the forming (or not forming, perhaps) of the LAG. Dynamic configuration is often desirable because it helps avoid configuration issues.

Link Aggregation Control Protocol. This is also defined in the 802.1AX standard, and provides a method for automating LAG configurations. LACP-capable devices discover each other by sending LACP packets to the Slow_Protocols_Multicast address 01-80-c2-00-00-02. They then negotiate the forming (or not forming, perhaps) of the LAG. Dynamic configuration is often desirable because it helps avoid configuration issues. LBT: Load-Based Teaming. In this context, LBT refers specifically to the VMware Inc. implementation of the Load-Based Teaming load-balancing policy, available only on a vSphere Distributed Switch (VDS) — virtual standard switches (vSS) get no love. This policy is also known as “Route based on physical NIC load.” It’s important to note that no bonding of physical uplinks is done as far as the upstream switch is concerned.

The Confusion

With that out of the way, it’s time to dig into where folks get confused. There’s an assumption (partially correct) that an LAG balances utilization across all links in the group. If a vSwitch has four physical uplinks in a Port Channel (the single logical entity created by the EtherChannel protocol) to the upstream switch, the understanding is that traffic is evenly balanced between all uplinks. In my experience, most people have the same understanding of LBT: If four uplinks exist on the vSwitch, traffic will be evenly distributed across all four links.

Actually, both of these understandings are false. Neither load-balancing method “evenly distributes” traffic to achieve uniform amount of utilization across all links. That’s not necessarily a bad thing; it’s just important to understand what’s actually happening, as it does impact design decisions. I’ll take a look at both.

LAG Does Not = Load Balancing

I’m saying LAG because it’s trendy, but I’m going to be specifically discussing EtherChannel. As many have mentioned, “load balancing” is really the wrong phrase to describe an LAG because it’s inaccurate. Load is not actually “balanced;” it’s distributed. There’s a distinct difference.

To balance load would mean to dynamically select uplinks based on the current utilization of all links in the group, in an attempt to utilize the same amount of each one. For example, if 20 percent of each of the four links is utilized, I’d call that balanced.

Load distribution, on the other hand, means to algorithmically assign sessions to a given uplink (based on a hash value that the algorithm has calculated). It’s then the algorithm’s job to distribute the sessions as evenly as possible.

The list of hashing possibilities is long. You can choose to hash based on MAC, IP or port; and Source, Destination or both. A given combination of those “inputs” will be chewed up by the algorithm and spit out to be used in selecting a link. This output is called the Result Bundle Hash, or RBH. The RBH value, which is 0 through 7, will be used to select the link. Each link is responsible for a given number of the possible RBH values, which depends on how many uplinks are in the group.

Note that with any number of uplinks not a factor of 2, distribution becomes imbalanced. Because the hashing algorithm can’t evenly divide the 8 possible values, it must assign extras to the first few links. Because of this, it’s recommended to not use Port Channels with a number of links other than 2, 4 or 8 if you care about distribution.

With an understanding of how link selection is performed, I’ll walk through an example. For this example, the hashing mechanism is the source and destination IP address, or src-dst-ip. This means that a session from 10.0.0.2 to 10.0.1.3 might compute to an RBH of 0x4, and another session from 10.0.0.2 to 10.0.1.4 might compute to 0x1.

As you can see, even though the algorithm may be good at evenly distributing sessions based on hash (two possible values to each link in the example), it’s completely unaware of the actual utilization of the uplink.

Sessions are distributed solely based on the correlation of their RBH and the uplink responsible for that RBH. Therefore, it’s feasible (however unlikely it may be) that due to widely varying traffic needs from one VM to another and the luck of the draw with the algorithm, one link could be 100 percent utilized and dropping packets, while the other three links sit idle. My networking buddies tell me that if you look at real-world distribution across the links in a port channel, it’s usually reasonable. So I’m not trying to say that this worst-case scenario should be expected; it just needs to be considered during design, because it’s a possibility.

There are two key takeaways from this:

1. A Port Channel does not evenly balance utilization across the links in a group; it evenly distributes sessions based on the specified hashing policy.

2. A single session will still never be given more bandwidth than a single uplink, due to the way traffic is distributed.

Load-Based Teaming

Now, on to the second misunderstanding: LBT. It’s as common as the first, and rightly so. If I didn’t know better, I’d expect this mechanism to work just like any unsuspecting administrator thinks it does. The assumption is that because LBT is aware of the physical NIC load, it evenly distributes sessions across all available uplinks. As in the earlier example, you’d assume that all uplinks would be balanced at the same percentage of utilization. Sadly, again this is not the case.

The truth about LBT is that it select uplinks the same way as Route Based on Originating Virtual Port ID initially. When a VM boots, the vNICs are assigned to a dvPort. That port is used to determine which uplink the traffic will use. The LBT mechanism comes into play every 30 seconds, when it polls the uplinks. If an uplink is more than 75 percent utilized during that polling period, LBT will move that dvPort to a less utilized uplink.

LBT takeaways:

1. LBT does have awareness of the link utilization and ensures that no link is utilized more than 75 percent before all the others are, as well.

2. LBT does not evenly balance traffic across uplinks when saturation is not occuring. This may explain the confusion for some folks looking at ESXTOP metrics. It will only move an assignment to another uplink once saturation occurs.

Recommendations

If you only have a vSphere Standard Switch to work with due to environment constraints or a lack of Enterprise Plus licensing, you’re stuck with Port Channel and “Route based on IP hash.” And, frankly, unless you really need to squeeze out some performance, I’d avoid the Port Channel all together and stick with the vanilla “Route based on originating Port ID.” It’s reliable, requires no configuration outside of the vSwitch, and performs well.

If, however, you’re blessed to be in possession of a functioning VDS, then in almost all cases I’d recommend using LBT. The awareness of actual utilization is comforting, and although I don’t feel comfortable saying that it “evenly distributes load,” I do like that it distributes load in such as way as to avoid contention until all links have been utilized. It’s just a bonus that it also doesn’t require any configuration of the physical switches.