With the advent of VXLAN being used in today’s data centers I wanted to provide a different design perspective which achieves a pure layer 3 network from host to host. There are many aspects to this design which could be justifiably debated but the purpose of this document is just to provide the audience with one example of how to achieve a data center network without any layer 2 between the hardware and still have the flexibility to move virtual instances from host to host.

Figure 1-1 provides an overall view of the basic network that will be discussed.

Figure 1-1 Basic Network Example

I built this network virtually using the following components:

Virtualbox – Virtual environment CumulusVX 3.2.1 – Spines and Leaves Ubuntu 16.04.2 LTS – Servers LXC/LXD – OS based containers Cumulus Quagga – Host based routing BGP – Routing protocol between all hardware

This document will not focus on the installation and prep configuration of each of these components but will instead focus on the routing design at each level (Spine, Leaf, Server, Container) in order to achieve layer 3 routing from end to end and the flexibility to move containers between the two servers.

In a typical network design, the routing responsibility lives within each of the network nodes and the servers do not participate in the routing domain. In my example this is where things begin to deviate from the norm. Figure 1-2 shows the BGP design for my network. In this network each network node and each server are running BGP. To be more specific, each hardware node is configured with their own ASN and running eBGP.

Figure 1-2 eBGP ASN Assignments

For most readers I would assume that the first question is “How is the server running BGP?”. The server is able to run BGP with the installation of the quagga package on the Ubuntu OS. In my network example I have installed the Cumulus Networks version of the quagga package. The quagga package in essence turns a Linux host into a router with the ability to run routing protocols like BGP and OSPF just like a normal everyday router. The easiest way to correlate this design to what an average network engineer/architect may be used to is to think of each server as a router with each virtualized instance on that server as being a directly connected interface/network.

Now that we have layer 3 all the way down to each of the servers we need to configure our virtualized instances, in this case LXD containers, so that they will communicate properly. The main point here is that each virtual instance needs to send all of their traffic to their default gateway which in this design is the server that the virtual instance resides on. In order to achieve this, we need to modify the route table of each container. By default, when an OS is installed it will create, at a minimum, two routes in its local route table. The first route is usually a default route to its gateway and the second route is a route for its local network which points to its interface. Figure 1-3 shows the route table of an Ubuntu container just after I created and launched the container.

Figure 1-3 Ubuntu initial route table

By deleting the second route on the Ubuntu container it will force the virtual instance to send all of its traffic to its gateway which is the server running eBGP. The server can then take the container’s traffic and route it accordingly. One thing to note here, the container network is the exact same across both servers and would be the exact same no matter how many servers were included in this design. As you can see in Figure 1-3, the Ubuntu virtual instance believes that it is on the 10.160.160.0/24 network. In this example all of the containers would be configured with a unique IP in the 10.160.160.0/24 network and all servers would be configured with 10.160.160.1/24. Figure 1-4 shows the IP allocation that I used in my example environment.

Figure 1-4 IP and Network Allocations

The next step in this design is to configure each server to advertise a /32 route for each container that resides locally. This is achieved by placing a /32 route for each container into a separate custom route table on each server. Figure 1-5 shows the route table “100” on Server02 which displays a host specific route for both Container03 and Container04.

Figure 1-5 Route table 100

The servers will import this table and then redistribute this table into BGP thereby allowing each server to advertise a /32 route of each container residing locally into the network. Figure 1-6 shows a snippet of the router configuration on the server which shows the BGP configuration along with the import and redistribution of table “100”.

Figure 1-6 Server01 router configuration

At a high level, for a container to successfully move from one server to another server there are several things that need to happen. In my testing I did not use a shared storage source for my containers and therefore I was not able to conduct live migration tests. Instead, I moved containers between the servers which required the containers to stop, be moved, and then start back up. I believe live migration would be successful if two things were present in my network.

Shared storage – Each server would need to be able to access the storage resource. Shared server MAC address – A container would retain the same local arp cache and therefore the gateway MAC address would need to be configured the same across all servers. In my example network this would be interface br0.

Excluding a live migration, for a container to successfully move here are the high level steps that would need to occur. I was able to test this out repeatedly without issues.

Stop the container. Remove the /32 route from the custom table. Move the container. Start the container. Delete the network route from the container OS. Install the /32 route into the custom table.

Figure 1-7 shows Spine01’s BGP route table before I moved Container03 from Server02 to Server01. In an effort to not turn this document into a lengthy book I did not discuss in depth the Spine and Leaf configurations. Briefly, the BGP configuration is using IPv6 link-local addressing to build the BGP peers but passing IPv4 NLRI. This is why my BGP route tables show IPv6 link-local addresses as next hops for each of the /32 routes.

Figure 1-7 Spine01 BGP route table

Spine01’s BGP route table output shows that Container01 & Container02 are reachable via Leaf01 while Container03 & Container04 are reachable via Leaf02. Figure 1-8 shows Spine01’s BGP route table after I moved Container03 from Server02 to Server01. Spine01 is now learning the IP address of Container03 via Leaf01.

Figure 1-8 Spine01 BGP route table after Container03 moved to Server01

I ended up creating several shell scripts to automate the various tasks and I was even able to automate the buildout of IPTables as a container started/stopped/moved to provide security between containers and what is now commonly referred to as microsegmentation.

In this network environment I have limited the routing tables at each level of the network (Spine, Leaf, Server) by only advertising and receiving what is necessary. In general, the Spines need to know about all the /32 routes, the Leaves need to know a default upward and only locally significant (those directly attached to the Leaf) /32s downward, and the Servers only need to know a default upward and its local virtual instances.

In conclusion, I was able to create a layer 3 routable network design that stretched all the way down into each host. Each virtual instance on the hosts were able to be assigned an IP address across a common network and each virtual instance had the freedom to move between hosts without being re-addressed or lose network reachability. The details of the specific configurations for my network are not what is important as the end results can be achieved in a multitude of ways but the high level changes needed to support a pure layer 3 network design are what need to be noted:

Each host needs to run a routing protocol and be a part of the overall routing domain. Each virtual instance needs only a single default route to the gateway. Each host needs the capability to advertise a /32 route for each virtual instance residing on the local host. Each host contains the same network, gateway, and even MAC address to assist with virtual instance migration.

https://www.virtualbox.org/

https://www.cumulusnetworks.com/

https://www.ubuntu.com/

https://www.ubuntu.com/containers/lxd

https://github.com/CumulusNetworks/quagga