Key Takeaways Developed by Datawire, Ambassador is an open source API gateway designed specifically for use with the Kubernetes container orchestration framework.

At its core, Ambassador is a control plane tailored for edge/API configuration for managing the Envoy Proxy “data plane”.

Envoy itself is a cloud native Layer 7 proxy and communication bus used for handling “edge” ingress and service-to-service networking communication.

This article provides an insight into the creation of Ambassador, and discusses the technical challenges and lessons learned from building a developer-focused control plane for managing ingress traffic within microservice-based applications that are deployed into a Kubernetes cluster.

Migrating Ambassador to the Envoy v2 configuration and Aggregated Discovery Service (ADS) APIs was a long and difficult journey that required lots of architecture and design discussions, and plenty of coding, but early feedback from the community has been positive.



Developed by Datawire, Ambassador is an open source API gateway designed specifically for use with the Kubernetes container orchestration framework. At its core, Ambassador is a control plane tailored for edge/API configuration for managing the Envoy Proxy “data plane”. Envoy itself is a cloud native Layer 7 proxy and communication bus used for handling “edge” ingress and service-to-service networking communication. Although originating from Lyft, Envoy is rapidly becoming the de facto proxy for modern networking, and can be found with practically all of the public cloud vendors offerings, as well as bespoke usage by many large end-user organisations like eBay, Pinterest and Groupon.

This article provides an insight into the creation of Ambassador, and discusses the technical challenges and lessons learned from building a developer-focused control plane for managing ingress traffic within microservice-based applications that are deployed into a Kubernetes cluster.

The Emerging “Cloud Native” Fabric: Kubernetes and Envoy

Although the phrase “cloud native” is becoming as much of an overloaded term as “DevOps” and “microservices”, it is increasingly gaining traction throughout the IT industry. According to Gartner, the 2018 worldwide public cloud service revenue forecast was in the region of $175 Billion U.S. Dollars, and this could grow by over 15% next year. Although the current public cloud market is dominated by only a few key players that offer mostly proprietary technologies (and increasingly, and sometimes controversially, open source-as-service), the Cloud Native Computing Foundation (CNCF) was founded in 2015 by the Linux foundation to provide a place for discussion and hosting of "open source components of a full stack cloud native environment".

Possibly learning from the journey previously undertaken by the OpenStack community, the early projects supported by the CNCF were arguably less ambitious in scope, provided clearer (opinionated) abstractions, and were also proven in real world usage (or inspired by real world usage in the case of Kubernetes). Two key platform components that have emerged from the CNCF are the Kubernetes container orchestration framework, originally contributed by Google, and the Envoy proxy for edge and service-to-service networking, originally donated by Lyft. Even when combined, the two specific technologies don’t provide a full Platform-as-a-Service (PaaS) offering that many developers want. However, Kubernetes and Envoy are being included within many PaaS-like offerings.

Many PaaS vendors, and also end-user engineering teams, are treating these technologies as the “data plane” for cloud native systems: i.e. the part of the system that does the “heavy-lifting”, such as orchestrating containers and routing traffic based on Layer 7 metadata (such as HTTP URIs and headers, or MongoDB protocol metadata). Accordingly, a lot of innovation and commercial opportunities are focused on creating an effective “control plane”, which is where the end-user interacts with the technology, specifies configuration to be enacted by the data plane, and observes any metrics or logging.

The Kubernetes control plane is largely focused around a series of well-specified REST-like APIs (known simply as “the Kubernetes API”), and the associated ‘kubectl’ CLI tool provides a human-friendly abstraction over these APIs. The Envoy v1 control plane was initially based around JSON config loaded within files, with several loosely-defined APIs that allowed selective updating. These APIs have subsequently evolved into the Envoy v2 API, which provides a series of gRPC-based APIs that are strongly typed via the use of Protocol Buffers. However, initially there wasn’t an Envoy analogy to the Kubernetes kubectl tool, and this led to challenges in adoption by some teams. Where there are challenges, though, there are also opportunities within the implementation of a human-friendly control plane.

“Service Mesh-all-the-things”...Maybe?

If we focus on the networking control plane, it would be hard to miss the emergence of the concept of the “service mesh”. Technologies like Istio, Linkerd and Consul Connect are aiming to manage cross-cutting service-to-service (“east-west”) traffic within a microservices systems. Indeed, Istio itself is effectively a control plane that enables a user to manage Envoy Proxy as the underlying data plane for managing Layer 7 networking traffic across the mesh. Linkerd offers its own (now Rust-based) proxy as the data plane, and Consul Connect offers both a bespoke proxy and, more recently, support for Envoy.

Istio architecture, showing the Envoy Proxy data plane at the top half of the diagram, and the control plane below (image courtesy of Istio documentation)

The important thing to remember with a service mesh is that the assumption is that you typically exert a high-degree of ownership and control on both parties that are communicating over the mesh. For example, two services may be built by separate engineering departments but they will typically work for the same organisation, or one service may be a third-party application but it is deployed within your trusted network boundary (which may span multiple data centers or Virtual Private Clouds). Here your operations team will typically agree on sensible communication defaults, and service teams will independently configure inter-service routing. In these scenarios you may not fully trust each service, and you most certainly will want to implement protections like rate limiting and circuit breaking, but fundamentally you can investigate and change any bad behaviour detected. This is not true, however, for managing edge or ingress (“north-south”) traffic that originates from outside your network boundary.

Cluster “ingress” traffic generally originates from sources outside of your direct control

Any communication originating from outside your trusted network can be from a bad actor, with motivations that are intentional (e.g. cyber criminals) or otherwise (e.g. broken client library within a mobile app), and therefore you must put appropriate defenses in place. Here the operations team will specify sensible system defaults, and also adapt these in real-time based on external events. In addition to rate limiting, you probably also want the ability to configure global and API-specific load shedding, for example, if the backend services or datastores become overwhelmed, and also implement DDoS protection (which may also be time- or geographically-specified). Service development teams also want access to the edge to configure routing for a new API, to test or release a new service via traffic shadowing or canary releasing, or other tasks.

As a quick aside, for further discussion on the (sometimes confusing) role of API gateways, Christian Posta has recently published an interesting blog post, “API Gateways Are Going Through an Identity Crisis”. I have also written articles about the role of an API gateway during a cloud/container migration or digital transformation, and how API gateways can be integrated with modern continuous delivery patterns.

Although at first glance these service mesh and edge/API gateway use cases may appear very similar, we believe there are subtle (and not so subtle) differences, and this impacts the design of the associated inter-service and edge control planes.

Designing a Edge Control Plane

The choice of control plane is influenced heavily by the scope of control required, and the persona(s) of the primary people using it. My colleague Rafael Schloming has talked about this before at QCon San Francisco, where he discussed how the requirements to centralise or decentralise control and also the development/operation lifecycle stage in which a service is currently at (prototype, mission critical etc) impacts the implementation of the control plane.

As mentioned above, taking an edge proxy control plane as the example, a centralised operations or SRE team may want to specify globally sensible defaults and safeguards for all ingress traffic. However, the (multiple) decentralised product development teams working at the front line and releasing functionality will want fine-grained control for their services in isolation, and potentially (if they are embracing the “freedom and responsibility” model) the ability to override global safeguards locally.

A conscious choice that was made by the Ambassador community was that the primary persona targeted by the Ambassador control plane is the developeror application engineer, and therefore the focus on the control plane was on decentralised configuration. Ambassador was built to be Kubernetes-specific, and so a logical choice for specifying edge configuration was close to the Kubernetes Service specifications that were contained within YAML files and loaded into Kubernetes via kubectl.

Options for specifying Ambassador configuration included using the Kubernetes Ingress object, writing custom Kubernetes annotations or defining Custom Resource Definitions (CRDs). Ultimately the use of annotations was chosen, as they were simple and presented a minimal learning curve for the end-user. Using Ingress may have appeared to be the most obvious first choice, but unfortunately the specification for Ingress has been stuck in perpetual beta, and other than the “lowest common denominator” functionality for managing ingress traffic, not much else has been agreed upon.

An example of an Ambassador annotation that demonstrates a simple endpoint-to-service routing on a Kubernetes Service can be seen here:

kind: Service apiVersion: v1 metadata: name: my-service annotations: getambassador.io/config: | --- apiVersion: ambassador/v0 kind: Mapping name: my_service_mapping prefix: /my-service/ service: my-service spec: selector: app: MyApp ports: - protocol: TCP port: 80 targetPort: 9376

The configuration within the getambassador.io/config should be relatively self-explanatory to anyone who has configured an edge proxy, reverse proxy or API gateway before. Traffic sent to the prefix endpoint will be “mapped” or routed to the “my-service” Kubernetes service. As this article is primarily focused on the designing and implementation of Ambassador, we won’t cover all of the functionality that can be configured, such as routing (including traffic shadowing), canarying (with integration with Prometheus for monitoring) and rate limiting. Although Ambassador is focused on the developer persona, there is also extensive support for operators, and centralised configuration can be specified for authentication, TLS/SNI, tracing and service mesh integration.

Let’s now turn our attention back onto the evolution of Ambassador over the past two years.

Ambassador < v0.40: Envoy v1 APIs, Templating, and Hot Restarts

Ambassador itself is deployed within a container as a Kubernetes service, and uses the annotations added to Kubernetes Services as its core configuration model. This approach enables application developers to manage routing as part of their Kubernetes service definition workflow process (perhaps as part of a “GitOps” approach). Translating the simple Ambassador annotation config into valid Envoy v1 config is not a trivial task. By design, Ambassador’s configuration isn’t based on the same conceptual model as Envoy’s configuration -- we deliberately wanted to aggregate and simplify operations and config -- and herefore, a fair amount of logic within Ambassador translates between one set of concepts to the other.

Specifically when a user applies a Kubernetes manifest containing Ambassador annotations, the following steps occur:

Ambassador is asynchronously notified by the Kubernetes API of the change. Ambassador translates the configuration into an abstract intermediate representation (IR). An Envoy configuration file is generated from the IR. The Envoy configuration file is validated by Ambassador (using Envoy in validation mode). Assuming the file is valid configuration, Ambassador uses Envoy's hot restart mechanism to deploy the new configuration and properly drain connections. Traffic flows through the restarted Envoy process.

There were many benefits with this initial implementation: the mechanics involved were fundamentally simple, the transformation of Ambassador config into Envoy config was reliable, and the file-based hot restart integration with Envoy was dependable.

However, there were also notable challenges with this version of Ambassador. First, although the hot restart was effective for the majority of use cases, it was not very fast, and some users (particularly those with large application deployments) found it was limiting the frequency with which they could change their configuration. Hot restart can also inappropriately drop connections, especially long-lived connections like WebSockets or gRPC streams.

More crucially, though, the first implementation of the Ambassador-to-Envoy intermediate representation (IR) allowed rapid prototyping but was primitive enough that it proved very difficult to make substantial changes. While this was a pain point from the beginning, it became a critical issue as Envoy shifted to the Envoy v2 API. It was clear that the v2 API would offer Ambassador many benefits -- as Matt Klein outlined in his blog post, “The universal data plane API” -- including access to new features and a solution to the connection-drop problem noted above, but it was also clear that the existing IR implementation was not capable of making the leap.

Ambassador Now: Envoy v2 APIs (with ADS), Intermediate Representations, and Testing with KAT

In consultation with the Ambassador community, the Datawire team (stewarded by Flynn, lead engineer for Ambassador) undertook a redesign of the internals of Ambassador in 2018. This was driven by two key goals. First, we wanted to integrate Envoy’s v2 configuration format, which would enable the support of features such as Server Name Indication (SNI), label-based rate limiting, and improved authentication. Second, we also wanted to do much more robust semantic validation of Envoy configuration, due to its increasing complexity (which was particularly when configuring Envoy for use with large-scale application deployments).

We started by restructuring the Ambassador internals more along the lines of a multipass compiler. The class hierarchy was made to more closely mirror the separation of concerns between the Ambassador configuration resources, the IR, and the Envoy configuration resources. Core parts of Ambassador were also redesigned to facilitate contributions from the community outside Datawire. We decided to take this approach for several reasons. First, Envoy Proxy is a very fast moving project, and we realised that we needed an approach where a seemingly minor Envoy configuration change didn’t result in days of reengineering within Ambassador. In addition, we wanted to be able to provide semantic verification of configuration.

As we started working more closely with Envoy v2, a testing challenge was quickly identified. As more and more features were being supported in Ambassador, more and more bugs appeared in Ambassador’s handling of less common but completely valid combinations of features. This drove to creation of a new testing requirement that meant Ambassador’s test suite needed to be reworked to automatically manage many combinations of features, rather than relying on humans to write each test individually. Moreover, we wanted the test suite to be fast in order to maximise engineering productivity.

This meant that as part of the Ambassador re-architecture, we also created the Kubernetes Acceptance Test (KAT) framework. KAT is an extensible test framework that:

Deploys a bunch of services (along with Ambassador) to a Kubernetes cluster Run a series of verification queries against the spun up APIs Perform a bunch of assertions on those query results

KAT is designed for performance -- it batches test setup upfront, and then runs all the queries in step 3 asynchronously with a high performance HTTP client. The traffic driver in KAT runs locally using one of other open source tools, Telepresence, which makes it easier to debug issues.

With the KAT test framework in place, we quickly ran into some issues with Envoy v2 configuration and hot restart, which presented the opportunity to switch to using Envoy’s Aggregated Discovery Service (ADS) APIs instead of hot restart. This completely eliminated the requirement for a process restart upon configuration changes, which previously we had found could lead to dropped connections under high loads or long-lived connections. We decided to use the Envoy Go control plane to interface to the ADS. This did, however, introduce a Go-based dependency to the previously predominantly Python-based Ambassador codebase.

With a new test framework, new IR generating valid Envoy v2 configuration, and the ADS, the major architectural changes in Ambassador 0.50 were complete. Now when a user applies a Kubernetes manifest containing Ambassador annotations, the following steps occur:

Just before release we hit one more issue. On the Azure Kubernetes Service, Ambassador annotation changes were no longer being detected. Working with the highly-responsive AKS engineering team, we were able to identify the issue -- namely, the Kubernetes API server in AKS is exposed through a chain of proxies that was dropping some requests. The proper mitigation for this was to support calling the FQDN of the API server, which is provided through a mutating webhook in AKS. Unfortunately, support for this feature was not available in the official Kubernetes Python client. We therefore elected to switch to the Kubernetes Golang client -- introducing yet another Go-based dependency.

Key Takeaways from Building an Envoy Control Plane (Twice!)

As Matt Klein mentioned at the inaugural EnvoyCon, with the current popularity of the Envoy Proxy in the cloud native technology domain, it’s often easier to ask who isn’t using Envoy. We know that Google’s Istio has helped raise the profile of Envoy with Kubernetes users, and all of the other major cloud vendors are investing in Envoy, for example, within AWS App Mesh and Azure Service Fabric Mesh. At EnvoyCon we also heard how several big players such as eBay, Pinterest and Groupon are migrating to using Envoy as their primary edge proxy. There are also several other open source Envoy-based edge proxy control planes emerging, such as Istio Gateway, Solo.io Gloo, and Heptio Contour. I would argue that Envoy is indeed becoming the universal data plane of cloud native communications, but there is much work still to be done within the domain of the control plane.

In this article we’ve discussed how the Datawire team and Ambassador open source community have successfully migrated the Ambassador edge control plane to use the Envoy v2 configuration and ADS APIs. We’ve learned a lot in the process of building Ambassador 0.50, and we are keen to highlight our key takeaways as follows:

Kubernetes and Envoy are very powerful frameworks, but they are also extremely fast moving targets -- there is sometimes no substitute for reading the source code and talking to the maintainers (who are fortunately all quite accessible!)

The best supported libraries in the Kubernetes / Envoy ecosystem are written in Go. While we love Python, but we have had to adopt Go so that we’re not forced to maintain too many components ourselves.

Redesigning a test harness is sometimes necessary to move your software forward. Often the real cost in redesigning a test harness is often in porting your old tests to the new harness implementation.

Designing (and implementing) an effective control plane for the edge proxy use case has been challenging, and the feedback from the open source community around Kubernetes, Envoy and Ambassador has been extremely useful.

Migrating Ambassador to the Envoy v2 configuration and ADS APIs was a long and difficult journey that required lots of architecture and design discussions, and plenty of coding, but early feedback from results have been positive. Ambassador 0.50 is available now, so you can take it for a test run and share your feedback with the community on our Slack channel or on Twitter.

About the Author

Daniel Bryant is leading change within organisations and technology, and currently works as a freelance consultant, of which Datawire is a client. His current work includes enabling agility within organisations by introducing better requirement gathering and planning techniques, focusing on the relevance of architecture within agile development, and facilitating continuous integration/delivery. Daniel’s current technical expertise focuses on ‘DevOps’ tooling, cloud/container platforms and microservice implementations. He is also a leader within the London Java Community (LJC), contributes to several open source projects, writes for well-known technical websites such as InfoQ, DZone and Voxxed, and regularly presents at international conferences such as QCon, JavaOne and Devoxx.