On September 14, 2016 we announced Envoy, our L7 proxy and communication bus. In a nutshell, Envoy is a “service mesh” substrate that provides common utilities such as service discovery, load balancing, rate limiting, circuit breaking, stats, logging, tracing, etc. to polyglot (heterogeneous) application architectures.

We knew that we had built a compelling product that was central to Lyft’s ability to scale its service-oriented architecture; however, we were surprised by the industry wide interest in Envoy following its launch. It’s been an exciting (and overwhelming) 7 months! We are thrilled by the positive reception and wide uptake Envoy has since received.

Why all the interest?

As it turns out, almost every company with a moderately-sized service oriented architecture is having the same problems that Lyft did prior to the development and deployment of Envoy:

An architecture composed of a variety of languages, each containing a half-baked RPC library, including partial (or zero) implementations of rate limiting, circuit breaking, timeouts, retries, etc.

Differing or partial implementations of stats, logging, and tracing across both owned services as well as infrastructure components such as ELBs.

A desire to move to SoA for the decompositional scaling benefits, but an on-the-ground reality of chaos as application developers struggle to make sense of an inherently unreliable network substrate.

In summary: an operational and reliability headache.

Though Envoy contains an abundance of features, the industry appears to view the following design points as the most compelling:

High performance native code implementation : Like it or not, most large organizations still have a “performance checkmark” for system components like sidecar proxies, which can only be satisfied by native code, especially regarding CPU usage, memory usage, and tail latency properties. Historically, HAProxy and NGINX (including the paid Plus version) have dominated this category. HAProxy has not sustained the feature velocity required for a modern service mesh, and so is starting to fall by the wayside. NGINX has focused most of their development efforts in this space on their paid Plus product. Furthermore, NGINX is known to have a somewhat opaque development process. These points have culminated in a desire within the industry for a community-first, high performance, well-designed and extensible modern native code proxy. This desire was much larger than we realized when we first open sourced Envoy, and Envoy fills the gap.

: Like it or not, most large organizations still have a “performance checkmark” for system components like sidecar proxies, which can only be satisfied by native code, especially regarding CPU usage, memory usage, and tail latency properties. Historically, HAProxy and NGINX (including the paid Plus version) have dominated this category. HAProxy has not sustained the feature velocity required for a modern service mesh, and so is starting to fall by the wayside. NGINX has focused most of their development efforts in this space on their paid Plus product. Furthermore, NGINX is known to have a somewhat opaque development process. These points have culminated in a desire within the industry for a community-first, high performance, well-designed and extensible modern native code proxy. This desire was much larger than we realized when we first open sourced Envoy, and Envoy fills the gap. Eventually consistent service discovery: Historically, most SoAs have used fully consistent service discovery systems that are hard to run at scale. Envoy treats service discovery as eventually consistent and lossy. At Lyft, this has lead to extremely high reliability without the maintenance headache of systems typically used for this purpose such as etcd, Zookeeper, etc.

Historically, most SoAs have used fully consistent service discovery systems that are hard to run at scale. Envoy treats service discovery as eventually consistent and lossy. At Lyft, this has lead to extremely high reliability without the maintenance headache of systems typically used for this purpose such as etcd, Zookeeper, etc. API driven configuration: Fundamentally, we view Envoy as a universal dataplane for SoAs. However, every deployment is different and it makes little sense to be opinionated about all of the ancillary components that are required for Envoy to function. To this end, we are clearly documenting all of the APIs that Envoy uses to interact with control plane components and other services. For example, Envoy documents and implements the Service Discovery Service (SDS), Cluster Discovery Service (CDS), and Route Discovery Service (RDS) REST APIs that can be implemented by management systems to dynamically configure Envoy. Other defined APIs include a global rate limiting service as well as client TLS authentication. More are on the way, including gRPC variants of the REST APIs. Using the published APIs, integrators can build systems that are simultaneously extremely complex and user friendly, tailored to a particular deployment. We have open sourced the discovery and ratelimit services that we use in production as reference implementations.

Fundamentally, we view Envoy as a universal dataplane for SoAs. However, every deployment is different and it makes little sense to be opinionated about all of the ancillary components that are required for Envoy to function. To this end, we are clearly documenting all of the APIs that Envoy uses to interact with control plane components and other services. For example, Envoy documents and implements the Service Discovery Service (SDS), Cluster Discovery Service (CDS), and Route Discovery Service (RDS) REST APIs that can be implemented by management systems to dynamically configure Envoy. Other defined APIs include a global rate limiting service as well as client TLS authentication. More are on the way, including gRPC variants of the REST APIs. Using the published APIs, integrators can build systems that are simultaneously extremely complex and user friendly, tailored to a particular deployment. We have open sourced the discovery and ratelimit services that we use in production as reference implementations. Filter based L4 core: Envoy is an L4 (TCP) proxy with an extensible filter chain mechanism. This allows it to be used for a variety of use cases, including transparent TLS proxying (stunnel replacement), MongoDB sniffing, Redis proxying, as well as complex HTTP-based filtering and routing. We look forward to community contributions that add support for different protocols.

Envoy is an L4 (TCP) proxy with an extensible filter chain mechanism. This allows it to be used for a variety of use cases, including transparent TLS proxying (stunnel replacement), MongoDB sniffing, Redis proxying, as well as complex HTTP-based filtering and routing. We look forward to community contributions that add support for different protocols. HTTP/2 first: Envoy was designed from the start to be a transparent HTTP/1 to HTTP/2 proxy in both directions. Most production proxies still do not have this capability, which means that they cannot be used for gRPC, the increasingly popular RPC protocol from Google.

Envoy was designed from the start to be a transparent HTTP/1 to HTTP/2 proxy in both directions. Most production proxies still do not have this capability, which means that they cannot be used for gRPC, the increasingly popular RPC protocol from Google. It’s all about observability: From an operational and reliability standpoint, having consistent observability within an SoA is by far the most important objective to obtain. A deployed Envoy mesh immediately provides consistent stats, logs, and traces that transform SoA networking from an intractable problem to something that can be reasoned about and acted upon.

Partnerships and ecosystem

When we made the decision to open source Envoy, we were committed to developing a community and an ecosystem; however, we had no idea if anyone would show up! As it turns out, companies both large and small have arrived in substantial numbers, drawn by all of the reasons laid out above. Ultimately, it’s our hope that Envoy is a powerful tool for meeting the SoA needs of many organizations — not just Lyft.

We are excited to announce that we are working in partnership with both Google and IBM to bring Envoy to Kubernetes. Fun fact: there are now more people working on Envoy at Google than there are at Lyft! We have a lot of other things planned with Google that we will be able to share more about in the coming months.

We are eager to move Envoy forward within the open source community. In only 7 months Envoy has amassed over 40 contributors, with substantial contributions from both Google and IBM. Next week, we will be providing Google with commit access, making Envoy a true multi-organizational project. Over the next 6 months we expect to see many more public announcements about large companies using Envoy, startups beginning to coalesce around offering commercial Envoy integrations, as well as other large companies joining the project as full committers. It’s very rewarding for us to see the momentum around this project and we are excited to discover what the future holds (potentially including donation to the CNCF).

Roadmap

We have an ambitious roadmap planned over the coming months between Lyft, Google, and IBM. Some of the major features include:

Sharded Redis support with cluster discovery, consistent hashing, health checking, and self healing.

Full end-to-end flow control across both HTTP/1 and HTTP/2.

IPv6 support.

Further investment in centralized control plane APIs include a Listener Discovery Service (LDS) API which will allow fully dynamic filter instantiation, a Health Discovery Service (HDS) API which will allow Envoy to be used as a distributed health checker, and load/failure reporting APIs which will allow Envoy runtime data to be fed back into global control plane and load balancing systems.

Bidirectional streaming gRPC variants of all of the control plane APIs, allowing for higher performance, a strongly typed IDL, and faster reaction times to updates.

Zipkin tracing support.

Further dynamic outlier detection including latency variance analysis.

Tighter integration with gRPC including gRPC to JSON transcoding.

More expressive global rate limiting via IP tagging.

More rigorous stress, performance, and fuzz testing.

Please reach out to us and let us know if there are other items that you would like to see!

Moving forward

Before we open sourced Envoy 7 months ago, never in our wildest dreams would we have imagined that we might have the chance to start a project that has the potential to become a building block of the modern internet. We were simply trying to meet the needs at Lyft as best as we possibly could. The trajectory that Envoy is now on is thrilling and daunting at the same time. There is still a ton of work to do to make service mesh networking transparently available, and it will take the efforts of many talented developers and organizations to bring Envoy’s full potential to fruition.

The next 6–12 months are likely to see a further increase in community interest, commercial availability, and adoption. If we haven’t heard from you yet and you are interested in learning more about Envoy or participating in its development, please reach out via GitHub, email, or Gitter. Onward!