Oliver Gould, Linkerd product lead and CTO of Buoyant, spoke at the QCon New York 2019 Conference last week about the Linkerd service mesh, with a focus on traffic management capabilities.

Gould started the presentation with some background on Twitter's Finagle library that was somewhat of the first service mesh. Finagle is an extensible RPC system for the JVM, which includes a control plane that was operated along with ZooKeeper and a data plane. He said when you have lot of services, using a library-based solution is not a good approach, as upgrades are challenging to roll out.

The idea behind service mesh concept is to pull the capabilities out from the application and into the OSI Layer 7 architecture model. Linkerd framework version 0.1 was released back in 2016 and the project was adopted by the Cloud Native Computing Foundation (CNCF) in 2017. Version 2.0 was released last September.

Their team found out that a JVM-based sidecar solution was too resource heavy for some users, and it was also challenging to configure. Sidecars need to be lightweight, so the Linkerd 2.0 team used the Rust language for the data plane. Rust offers better performance, and also provides strong typing, which allows many bugs to be caught at compile time. There is no GC-related constraints, as Rust enforces RAII (Resource Acquisition Is Initialization), where resource initialization is performed in the constructor of an object. And whenever an object goes out of scope, its destructor is called and its owned resources are freed.

Gould discussed the Linkerd 2.x architecture, which includes the Rust-based data plane (called linkerd2-proxy), and the control plane (linkerd2), where most of the activity happens, was developed in Go language. Linkerd also includes other frameworks like Prometheus and Grafana for monitoring and visualization. If you have a functioning Kubernetes (K8s) application, you can just drop in Linkerd without any configuration to integrate it with your application.

Gould reiterated the goals of Linkerd, which is to move visibility, reliability, and security capabilities into the infrastructure layer, out of the application execution layer. Linkerd supports the following features under each of those three capabilities.

Visibility: Automatic golden metrics (success rates, latencies, and throughput)

Automatic golden metrics (success rates, latencies, and throughput) Reliability: Load balancing, retries, timeouts, circuit breaking, and deadlines

Load balancing, retries, timeouts, circuit breaking, and deadlines Security: Transparent mTLS, certificate validation, and policy enforcement

Observability comes is provided via the Grafana tool. There is also a feature called Tap that can be used for request inspection based on specific criteria. He suggested that we need to be careful in what we can expose in the request, such as sensitive headers (authorization data).

Reliability includes the latency-aware load balancing with configurable service retries and timeouts.

Security features include mutual, cryptographic identity, and TLS for ingress and egress steps. Security is transparent and is on by default. It bootstraps using Kubernetes Service Accounts.

Next, Gould talked about the "Trough of Disillusionment" for service mesh technologies. Service mesh can suffer from problems such as developers not being able to even get it working, and the service mesh trying to do too many things at once. In addition, if anything in the system goes wrong, the service mesh could be blamed by default. The Linkerd 2.0 team had to add several "check" (verify) commands to make sure Linkerd setup is done correctly in order to address some of these issues.

Gould discussed the importance of traffic management when using a service mesh. The new Service Mesh Interface specification (SMI) is a common standard API for service mesh. It covers the most common service mesh capabilities like traffic management, traffic policy, and traffic telemetry. SMI will be the common layer for integration participants in the service mesh landscape.

He concluded the presentation with details on the Linkerd roadmap. Version 2.3 will support telemetry, retries, timeouts, auto-inject, and mTLS on by default. Release 2.4 will offer traffic shifting (blue-green, canaries) and install split. There will also be other features provided in the future releases, like mesh expansion, which allows the service mesh to work across multi clusters, and distributed tracing.

If you are interested in learning more about SMI and Linkerd, check out this Q&A article on the new spec and this article on Linkerd v2.