This blog talks about Software Defined Networking as an architecture, the flexibilities around it’s system-design and how it is purpose built to simplify network and traffic management. This won’t be a discussion about the best/the most efficient system architecture, but I hope at the end of this, you get an idea about how to design an SDN based product/solution.

Prerequisites — Basic SDN knowledge, Openflow, some idea about SDN controllers, exposure to SDN marketing buzzwords

SDN is an architecture, not a protocol. Period.

Many people starting off with SDN believe that it is supposed to be closely related or even equivalent to Openflow, but that’s just partially true. The researchers from Stanford who laid the foundation of the technology proposed Openflow as the protocol that would drive Software Defined Networking. But SDN is much more than that.

Software defined networking is a concept by which we can achieve network programmability at scale, and it can work over any communication protocol depending on the use-case. Openflow is a communication protocol, purposely built for communication between the control plane (controllers) and the data plane (switches, firewalls, load balancers). It should be seen as only one of the ways to design an SDN based product and must not be confused with SDN itself.

Openflow is not an answer for all, solution to everything. There are some note-worthy drawbacks in Openflow’s design, primarily w.r.t consulting the controller for every IP wildcard, which can add unnecessary overhead in some situations. That is more of a design decision. Does your use-case require this much control to sit on the controller, or would you want only certain special routing decisions to come from above, and other trivial packet forwarding rules to stay in the data plane. Designing the most efficient packet forwarding mechanism is an engineering task. I mean, go nuts with system design on this one. There can be hundreds of ways to implement this, depending on your use-cases.

I’m really fascinated by this fact, that after so many years of algorithmic evolution and numerous protocol advancements, what brings the biggest step-up in the networking industry is system design and architecture.

Control Plane separation in other domains

Before all the ride hailing services, the decision to go to a specific location (for pickups) was being made by the drivers themselves. If you think about it, what Uber provided was nothing but a control plane for all the cabs on the road. Now the decision to go for a pickup, the route to choose in case of carpooling (multiple pickups and drops), is being made by an entity that sits away from the physical boxes on wheels. Moreover, Uber gets the visibility over all it’s cars, the routes they take and the riders they serve.

Figure 1 — Control plane segregation in other domains

Something very similar happened in the networking world. A central intelligence was provided to control all the packet forwarders, and then the switches were stripped off of this intelligence, to make them more lightweight. The design decision to implement this is still in the hands of the service providers. Just like Uber might follow a different algorithm, Lyft might use something else. But the idea still remains the same.

What car model comes to pick you up doesn’t matter. You generally say that “I booked a cab”, and not “I booked a Chevrolet XYZ”, this is (in a way) analogous to network function virtualization (NFV). Earlier you used to say “I use Cisco IOS 2960x” for routing/switching, but now all you need is any x86 arch hardware to run a Linux OS.

Now, Uber provides you an interface to call cabs, schedule pickups, track rides etc. for which it has certain business logic, automation, database design, which controls it’s cars on the road. This is (again, in a way) analogous to software defined networking.

Policies and Rules. Responsibility segregation

An SDN inspired design would require a controller appliance to have interfaces to ingest user-defined policies, traditionally via something known as northbound interfaces. These policies must then be mapped to flow table rules which can be understood by the underlying data plane engines. These rules are communicated to the networking appliances like switches, load balancers and firewalls, over communication channels called southbound interfaces.

Visually put, the SDN controller sits between the users managing the network infrastructure and hundreds to thousands of networking appliances. To make the infrastructure easily manageable, the controller takes in simple policies, and pushes out routing instructions/flow table rules from the other end.

Figure 2 — SDN northbound and southbound interfaces

The user could say, “all my application X packets should be routed to one of these servers which are part of a backend ServerPool”, which can be noted as a policy in the controller. The controller’s job would be to generate rules (can be dynamic/changing over time) for the corresponding policy, and send those across to the data plane.

Figure 3 — policy to rule mapping

What forwarding decisions can be taken by the switch independent of the controller, and for what all decisions it should consult it’s overlord, is again debatable. If the time taken by the switches to figure out a solution, is less than the time it’ll take for them to consult the controller, then it simply doesn’t make sense to do that, right!

Take for example Figure 3 above. Let’s say at t=0, the load balancer (LB) is forwarding all incoming packets to App1. Now any anomaly, be it the application going down or increased packet latency, can be detected at the LB, and it could be designed in such a way that it automatically switches the flow table rule to point to one of the other available applications from the defined ServerPool.

By all means, it could have consulted the controller at t=5 or t=15, report the anomaly and fetch the updated flow table entry, but if the appliance was purpose built for load balancing and knows it’s options, it can easily skip that part, and update it’s flow table entry itself.

I mean, the more I think about my ride hailing analogy, the more sense it makes. Uber driver’s can really roam around anywhere they think they have more chances of getting ride requests — near offices during closing hours, popular public spots during weekends etc. The point being, despite Uber’s centralized control plane, it allows it’s cabs on the road to have an intelligence of their own. If they were to rely on the app for these things as well, it would have been 1. Unfair and 2. Infeasible.

The segregation of intelligence is key here. Having all the things on the data plane from the start is faster since there is absolutely zero communication overhead. But, without the controller, the convergence of knowledge would have been a huge pain, and you would worry about configuration management at that scale. This is part of the classic centralized vs distributed architecture debate. Both have it’s pros and cons, but we certainly have the flexibility to try and have the best of both worlds.

SDN as a Distributed Systems problem

The entities involved in the system are -

The SDN Controller Switches/Packet Forwarders

The switches have been running in a distributed fashion since ages, and are “distributed systems” by definition. They talk among themselves, reach a conclusion and deliver, to maintain a certain state of the infrastructure. Having 5 nodes talking to each other is manageable, but considering a mesh topology with n (insert large number) interconnected switches, the convergence time is expected to be more.

Offloading the compute workload and segregating control plane traffic from the data plane, theoretically should decrease convergence time, and get the switches ready for packet forwarding faster.

Quick math — In a mesh topology with n nodes, the number of links in the topology would be n(n-1)/2. So the control plane packets in a legacy network infrastructure would flood over these n(n-1)/2 links. This is likely to affect the data plane performance.

Now consider the same mesh topology but with a centralized controller. Because of the separation of control and data planes, there will be n(n-1)/2 data plane connections (same as before) but only n links for the control plane communication. So there has to be a decrease in the number of links being used for control plane packets to wander, in case of SDN.

Some research papers talking about improved convergence times as part of the SDN architecture — link1, link2

Having said all that, there are complexities that come along with centralizing the control plane including — single point of failure (fault tolerance), performance bottleneck due to controller-switch latency (efficiency), the controller’s ability to handle n switches (scalability) and a single point of security vulnerabilities.

*Enter the Distributed Controller Architecture*. Some solutions around this:

Availability/Fault tolerance — Run a controller cluster, with leader and worker nodes. Consensus algorithms like Paxos come into play here.

Scalability — Partitioning services on the controller cluster. There are multiple approaches to partitioning load and distributing applications/processes running on the controller cluster.

Security — A never-ending problem.

There has been a lot of development in the area of scaling up SDN controllers and making the whole SDN architecture faster and more efficient. This is a solid indication that people realize the downsides of a centralized controller, as far as network performance and the other aforementioned points are concerned, but also realize that these problems are solvable. A few examples:

Kandoo — Proposed by researchers at the University of Toronto, talks about a hierarchical system of controllers (global and local), to handle frequent and rare events at separate levels.

Devoflow — Proposes ways to reduce control plane traffic to the controller, worked on by researchers from HP Labs and University of Waterloo. It argues to provide more intelligence to the data plane appliances.

In the end, managing n stateful switches and m stateful controllers forces us to incorporate distributed systems design patterns in SDN solutions to add reliability, availability, fault tolerance, scalability, efficiency and robustness. The initial argument stands true — SDN, as of today, in fact, is a distributed systems problem.

Going beyond just Routing

Software defined networking was initially designed to make routing decisions on behalf of the datapath engines, so that the datapath can just be responsible for routing packets across interfaces. This segregation ushered a number of fancy marketing buzzwords, but more than that enabled those things to become a reality, relatively quickly.

Multi cloud networking —It won’t take much time for you to realize that in order for your product to get the multi-cloud tag, you need a way to talk to various cloud offerings. All cloud platforms have their own lingo, popularized as part of the ecosystem they provide (Azure’s AKS, Google’s GKE and Amazon’s EKS for Kubernetes-aaS).

Now let’s say you have a product, and you want to deploy it on AWS and GCP. The product remains the same but now you need to juice up your deployment and integration code to be able to talk to both AWS and GCP infrastructure components.

There’s a separate way to integrate to AWS’s compute and network infrastructure, and a separate one for GCP. Both provide APIs to build things, so you end up writing something like a connector-module inside your app which consists of business logic for doing the same things differently on different cloud platforms.

It’s mostly the same case with SDN controllers and datapath engines. Now, it’s not just about the controller, but the way the datapath engine fits in and integrates to various cloud components on AWS will differ from GCP, because the model-definitions for both the cloud platforms are different. Moreover, it makes more of a difference in case of networking appliances, because networking engines have the power to enable cloud distributed applications to interconnect with each other.

Have a look at the way you can create a Virtual Network on AWS, Azure and GCP. You’ll be able to achieve almost the same thing in the end, but these guys call it different things, and their data models are different. The controller’s job would be to provide a single interface to contact these folks and talk their talk.

Same goes for VMWare cloud, Openshift/Kubernetes (Container orchestration) etc.

Figure 4