Today we are incredibly excited to open source Envoy, our high performance C++ distributed proxy and communication bus designed for large service oriented architectures. The project was born out of the belief that:

The network should be transparent to applications. When network and application problems do occur it should be easy to determine the source of the problem.

Envoy runs on every host and abstracts the network by providing common features (load balancing, circuit breaking, service discovery, etc.) in a platform-agnostic manner. When all service traffic in an infrastructure flows via an Envoy mesh, it becomes easy to visualize problem areas, tune overall performance, and add substrate features in a single place.

Use at Lyft

Envoy has been in development at Lyft for around 1.5 years. Before Envoy existed, Lyft’s networking setup was fairly standard for a company of our size. We used Amazon’s ELBs for service discovery and load balancing, and a mishmash of different libraries across both PHP and Python. In a few places we deployed HAProxy for increased performance.

At the time, we had about 30 services and even at that level of scale we faced continuous issues with sporadic networking and service call failures to the extent that most developers were afraid to have high volume service calls in critical paths. It was incredibly difficult to understand where the problems were occurring. In the service code? In EC2 networking? In the ELB? Who knew? We relied on whatever statistics each application and HAProxy provided as well the extremely primitive CloudWatch ELB statistics and logging.

Envoy is influenced by years of experience observing how different companies attempt to make sense of a confusing situation. Initially we used it as our front proxy, and gradually replaced our usage of ELBs across the infrastructure with direct mesh connections and local Envoys running on every service node.

About the project

In practice, achieving complete network transparency is difficult. Envoy attempts to do so by providing the following high level features:

Out of process architecture: Envoy is a self contained process that is designed to run alongside every application server. All of the Envoys form a transparent communication mesh in which each application sends and receives messages to and from localhost and is unaware of the network topology. The out of process architecture has two substantial benefits over the traditional library approach to service to service communication:

Envoy works with any application language. A single Envoy deployment can form a mesh between Java, C++, Go, PHP, Python, etc. It is becoming increasingly common for service oriented architectures to use multiple application frameworks and languages. Envoy transparently bridges the gap.

As anyone that has worked with a large service oriented architecture knows, deploying library upgrades can be painful. Envoy can be deployed and upgraded quickly across an entire infrastructure transparently.

Modern C++11 code base: Envoy is written in C++11. Native code was chosen because we believe that an architectural component such as Envoy should get out of the way as much as possible. Modern application developers already deal with tail latencies that are difficult to understand due to deployments in shared cloud environments and the use of very productive but not particularly well performing languages such as PHP, Python, Ruby, Scala, etc. Native code provides generally excellent latency properties that don’t add additional confusion to an already confusing situation. Unlike other native code proxy solutions written in C, C++11 provides both excellent developer productivity and performance.

L3/L4 filter architecture: At its core, Envoy is an L3/L4 network proxy. A pluggable filter chain mechanism allows filters to be written to perform different L3/L4 proxy tasks and inserted into the main server. Filters have already been written to support various tasks such as raw TCP proxy, HTTP proxy, TLS client certificate authentication, etc.

HTTP L7 filter architecture: HTTP is such a critical component of modern application architectures that Envoy supports an additional HTTP L7 filter layer. HTTP filters can be plugged into the HTTP connection management subsystem that perform different tasks such as buffering, rate limiting, routing/forwarding, sniffing Amazon’s DynamoDB, etc.

First class HTTP/2 support: When operating in HTTP mode, Envoy supports both HTTP/1.1 and HTTP/2. Envoy can operate as a transparent HTTP/1.1 to HTTP/2 proxy in both directions. This means that any combination of HTTP/1.1 and HTTP/2 clients and target servers can be bridged. Our recommended service to service configuration uses HTTP/2 between all Envoys to create a mesh of persistent connections that requests and responses can be multiplexed over.

HTTP L7 routing: When operating in HTTP mode, Envoy supports a routing subsystem that is capable of routing and redirecting requests based on path, authority, content type, runtime values, etc. This functionality is most useful when using Envoy as a front/edge proxy but is also leveraged when building a service to service mesh.

GRPC support: GRPC is a new RPC framework from Google that uses HTTP/2 as the underlying multiplexed transport. Envoy supports all of the HTTP/2 features required to be used as the routing and load balancing substrate for GRPC requests and responses. The two systems are very complementary.

MongoDB L7 support: MongoDB is a popular database used in modern web applications. Envoy supports L7 sniffing, statistics production, and logging for MongoDB connections. MongoDB lacks decent hooks for observability, and at Lyft we have found the statistics Envoy produces are invaluable when running sharded MongoDB clusters in production. In summary, Envoy makes MongoDB far more web scale.

DynamoDB L7 support: DynamoDB is Amazon’s hosted key/value NoSQL datastore. Envoy supports L7 sniffing and statistics production for DynamoDB connections. Similar to Envoy’s MongoDB support, having a single source of statistics for all DynamoDB connections from any application platform has been invaluable at Lyft.

Service discovery: Service discovery is a critical component of service oriented architectures. Envoy supports multiple service discovery methods including asynchronous DNS resolution and REST based lookup via a service discovery service.

Health checking: The recommended way of building an Envoy mesh is to treat service discovery as an eventually consistent process. Envoy includes a health checking subsystem which can optionally perform active health checking of upstream service clusters. Envoy then uses the union of service discovery and health checking information to determine healthy load balancing targets.

Advanced load balancing: Load balancing among different components in a distributed system is a complex problem. Because Envoy is a self contained proxy instead of a library, it is able to implement advanced load balancing techniques in a single place and have them be accessible to any application. Currently Envoy includes support for automatic retries, circuit breaking, global rate limiting via our standalone Go/GRPC rate limiting service (to be open sourced shortly), and request shadowing. Future support is planned for automatic bad host outlier ejection and request racing.

Front/edge proxy support: Although Envoy is primarily designed as a service to service communication system, there is benefit in using the same software at the edge (observability, management, identical service discovery and load balancing algorithms, etc.). Envoy includes enough features to make it usable as an edge proxy for most modern web application use cases. This includes TLS termination (including client certificate support with pinning), HTTP/1.1 and HTTP/2 support, HTTP L7 routing, as well as raw TCP/SSL proxying. Envoy’s TLS support earns Lyft an “A” on the SSL labs report.

Best in class observability: As stated above, the primary goal of Envoy is to make the network transparent. However, problems occur both at the network level and at the application level. Envoy includes robust statistics support for all subsystems. Envoy also supports distributed tracing via third party providers.

See the full Envoy documentation for a lot more information.

Discovery service

Today we are also open sourcing our discovery service. This is a reference implementation of the service discovery API that Envoy calls, written using Python and using DynamoDB as the backing store.

Looking towards the future

Today, we run Envoy on thousands of nodes and over one hundred services, which in aggregate process over 2 million requests per second, powering every system at Lyft, either real time or otherwise. At Lyft we use Envoy to proxy for Python, Go, C++, and PHP. The fact that we use the same software to proxy for every service means that we have consistent and reliable statistics of our entire distributed system. Developers can quickly see global network health, as well as the health of any individual service to service hop. Developers write their code without knowledge of network topology or whether it is running in development, staging, or production. And most importantly, developers are no longer scared to have service to service dependencies in high volume paths.

We believe that the problems that Envoy attempts to solve are faced by many different companies and organizations. We are extremely excited to share Envoy and build a community around its usage. In the near term we have a roadmap planned which includes more rate limiting features including open sourcing our global rate limiting service, advanced load balancing features such as request racing and outlier ejection, and consistent hash load balancing, redis support, among others. Beyond that we are eager to see what feature requests we get once the community starts to take a look.

Please reach out to us if you have questions about Envoy or are considering giving it a shot. We are happy to help and would love to hear from you.

Thanks

Envoy is the product of too many people to name here individually. From direct code contributions, design feedback, and ultimately usage by a large number of developers, many people have contributed to the product that we are releasing today. A huge thanks to everyone.