Options for Managing vSphere Replication Traffic on Cisco UCS

I was recently designing a vSphere Replication and SRM solution for a client and I stated we would use static routes on the ESXi hosts. When asked why, I was able to 1. discuss why the default gateway on the management network wouldn’t work and 2. present some options as to how we could separate the vSphere Replication traffic in a way that would allow flexibility in throttling its bandwidth usage.

You won’t see listed here Network I/O Control because this particular client didn’t have Enterprise Plus licensing and therefore wasn’t using a vDS. In addition, this client was using a fibre channel SAN on top of Cisco UCS with only a single VIC in his blades. This configuration doesn’t work well with NIOC because it doesn’t take into account FC traffic which is sharing bandwidth with all the Ethernet traffic NIOC *is* managing.

As an addendum to the original write-up, I’d like to add that another option would be the use of vSphere’s built in Traffic Shaping. Traffic Shaping would be performed at the Portgroup-level and I’d still recommend creating a new VMkernel port for vSphere Replication for this reason. That is, instead of using the ESXi management VMkernel port, by default, for replication traffic.

Why do we need a static route on the hosts? If they have a default gateway, then they can route just fine. Is this a requirement?

An ESXi host can only have one default gateway and that is assigned to the management network. Most other IP networks normally used on an ESXi host, such as IP storage (NFS and iSCSI), vMotion, Fault Tolerance, don’t need to leave their own subnet – their traffic stays local to their VLAN. vSphere Replication, when used across a L3 network like a WAN (this doesn’t apply to a stretched Layer 2 network between sites), needs to send traffic outside of its subnet and that requires a static route configured on the ESXi host. This assumes that the vSphere Replication traffic does not share the same VLAN with the management network. If both Management and vSphere Replication shared a VLAN, a static route would not be required because the vSphere Replication traffic would use the existing default gateway on the management network because the replication traffic originates on the management network. This, then, leaves several options.

OPTION 1

Assume for clarity Management VLAN 1 and vSphere Replication VLAN 2

Create 2 new vNICs at the UCS level and add them to each ESXi host (requires reboots of hosts for vNICs to show up).

Configure similar traffic shaping at UCS level on new vNICs as the existing management vNICs: maximum 1 Gbps bandwidth.

Add static routes to ESXi hosts

This configuration allows for maximum traffic separation (dedicated vNICs and VLAN), bandwidth and replication traffic throttling, but requires reboots of the hosts and static routes on every ESXi host. Upon fabric failure, maximum bandwidth for replication is retained.

OPTION 2

Assume Management VLAN 1 and vSphere Replication VLAN 2

Configure vSphere Replication traffic to use same vNICs as Management traffic, but opposite Active/Standby configuration. For example, if the Management port has vNIC1 Active and vNIC2 Standby, configure vSphere Replication to use vNIC2 as Active and vNIC1 as standby. This allows for logical separation of traffic using VLANs (but not dedicated vNICs) and keeps Management traffic and replication traffic on separate links under normal operating conditions. Upon a fabric failing, both traffic flows will share the same vNIC and a maximum of 1 Gbps of bandwidth instead of each kind of traffic having a maximum of 1 Gbps.

Since vSphere Replication shares the Management vNICs, it inherits the Management traffic shaping and caps traffic at 1 Gbps

Add static routes to ESXi hosts

This configuration is a compromise between option 1 and option 3. Under normal operating conditions, replication traffic will be treated the same as option 1. There’s maximum separation between replication traffic and other traffic (somewhat of a dedicated vNIC because nothing else is using it outside of a failure, and it’s in its own VLAN). Another benefit is that no ESXi host reboots are required. Drawbacks are that under fabric failure conditions, Management traffic and replication traffic will share a vNIC *and* 1 Gbps of bandwidth instead of each getting its own vNIC and 1 Gbps. And finally, the same as option 1, static routes are required.

OPTION 3

Assume Management and vSphere Replication share VLAN 1

The key attribute of this option is that the vSphere Replication vmkernel ports share the same subnet as the Management traffic. vSphere Replication shares the same vNICs with Management traffic, but still can be configured to use the opposite Active/Standby vNICs. Under normal operation, replication traffic will get a dedicated 1 Gbps but under fabric failure conditions, replication traffic will walk on Management traffic both at the vNIC-level and VLAN-level.

* NO * static routes are required

* static routes are required The benefit of this is that there are *no* ESXi host reboots and no static routes to configure or manage. Under normal conditions, replication will get dedicated bandwidth; under fabric failure conditions, replication and Management will share a vNIC. At all times, replication traffic and Management traffic share the same VLAN, thus receiving and processing the same broadcasts, unknown unicasts, and reside in the same Layer 2 failure domain (think spanning tree problems affecting both Management and replication at the same time instead of one or the other).

I recommended option 2 because we wouldn’t have to add any more vNICs and no host reboots would be required.