Unknown army of requests sailing into our infrastructure

Construction has a productivity problem. PlanGrid’s mission is to solve this by giving our customers the tools they need to iterate on construction projects faster with less downtime and less organizational uncertainty.

Working for the Systems Engineering team at PlanGrid means bringing that same productivity innovation to our R&D teams. A common problem that affects velocity across feature teams at PlanGrid is having to correctly handle multiple types of authentication types on requests from the internet and other plangrid services. We historically launched an AWS ELB for every kubernetes service which meant fielding internet requests. As PlanGrid grew our APIs organically expanded to support different authentication methods that made sense at the time. We maintain APIs that use different authentication schemes.We also have frontend services that use cookies. Now this becomes a problem when each service tries to communicate with each other as the requesting user. How do services trust each other and the internet simultaneously?

Every app developer faces the following question when exposing any endpoint in any service:

“We got a request! But from who?”

This is what we’re trying to solve at the API gateway layer of a request to abstract that responsibility from application developers. To do this, we’re using a combination of an in-house centralized authentication service and Envoy HTTP filters. Translating the various headers we get into a standard plangrid user ID that can be used by services directly to perform authorization, auditing, etc. This abstracts the authentication logic from all of our services, improving security and focus for our feature teams.

We’re using Istio to get more observability and control of requests within our service-oriented architecture. You can read more about istio here. Istio can be used “à la carte” and we started with routing all requests to PlanGrid services through a single istio ingress Gateway. Having ownership over this gateway and routing all traffic through it to our services within an internet protected VPN allows all services behind it to be a part of a secure network. With header sanitization and schema enforcement at the ingress gateway we can let our engineers assume the following:

“All requests to my service come from either the ingress Gateway or another PlanGrid service. Assuming the Gateway sanitizes requests, I can trust all headers that are present.”

Multiple clients accessing our kubernetes services. A wildcard DNS entry maps all hosts to a single ELB. Authentication and traffic routing handled by the istio Gateway service

So how do we trust requests from the ingress gateway? By enforcing authentication and rejecting untrusted requests. Istio has tried to solve this by exposing a JWT based form of authentication. It’s very opinionated in how this authentication system works and doesn’t allow for integration with our existing authentication services.

Luckily, Istio’s usage of Lyft’s Envoy proxy allowed us to apply custom filter logic via an HTTP filter.

Our plan was to add an HTTP filter to Istio’s Gateway that would forward all headers on every request to the authentication service and have the authentication service populate a PlanGrid-UserID header if it’s able to decode the session from one of the request headers. Like giving an office badge to every request that can be used by any downstream service.

Request sanitization done at our ingress Gateway

Adding a custom configuration to Envoy was not trivial during the istio 0.x releases. Before the 1.0 release, there were only 2 ways to amend the configuration of Envoy within the istio-proxy containers.

Fork the istio-proxy container and bake in an HTTP filter for envoy. Deploy this custom istio-proxy container in the gateway pods. Use the webhook dynamic filter configuration option exposed by Envoy

Both of these were painful from an operations perspective. Maintaining our own custom istio-proxy container meant maintaining our own fork, and merging with new istio releases (which were very frequent pre istio 1.0). Loading the filter dynamically via webhook meant maintaining another web server whose sole purpose was to field requests from Pilot, an Istio component. This added a single point of failure to our entire system. It also meant there was no validation of configuration until runtime which opened the possibility of taking down our entire service mesh with one bad commit/release.

Enter the EnvoyFilter kubernetes type

EnvoyFilters allowed us to deploy custom HTTP filters as Kubernetes resources. We could bake in Lua code that gets executed on every request to the gateway. We used it to sanitize and forward requests to our authentication server. The filter then takes any headers in the response from the authentication server and forwards to the destination service as shown in the diagram above. We use Jinja to template the configuration file with environment specific variables. Below is a generated example.

Envoy filter YAML with Lua code that proxies requests through our authentication service embedded

Having all client traffic funneled through a single gateway and be vetted by our authentication service opens the door to many future infrastructure and security enhancements. We can now add L7 load balancing to support our microservices architecture. With routing rules already in place we can take advantage of the istio traffic management feature in a service mesh. Having a single domain name to route to makes immutable and repeatable clusters easier to failover to as well.

Istio is a young project and it’s production use at PlanGrid is even younger. Do you have experience with EnvoyFilters kubernetes types? Have stories of how to manage configuration for Lua scripts and deployment techniques? General experience managing Istio in production? We’d love to hear about it! And if you’re interested in helping us tackle these exciting problems, visit our careers page for opportunities :)