Photo by Theo Thomaidis on Unsplash

In an enterprise that enforces compliance controls on cloud infrastructure, EKS presents an interesting challenge.

Consider a organisational policy that prohibits developers from exposing services in development environments to the public internet. In Fargate or ECS environments this policy can be enforced during the CloudFormation review stage. However, the default configuration of an EKS cluster allows developers to define Ingresses or Services which lead to the creation of Elastic Load Balancers. Reviewing CloudFormation templates is no longer sufficient to ensure full infrastructural control and visibility.

To retain the required level of control it is necessary to move down a layer in the stack and enforce this at the Kubernetes level.

To solve this problem in an EKS cluster you have a few options, depending on your definition of “solve”:

- Manually review all objects deployed to the cluster to ensure they conform to standards

- Use Kubernetes Admission Controllers to enforce this policy at the cluster level

- Remove EKS’s ability to create and manage ELBs via IAM

The first option provides some protection and should be incorporated into the code review process, but introduces room for error and does not scale very well.

The second option is ultimately more secure as it ensures the policy is enforced even if someone bypasses the CICD pipeline. This allows developers to be provided with sandbox clusters to which they have kubectl access, while still being confident that their cluster configuration prevents them from exposing services to the public internet.

The third option is the sledgehammer approach but can be appropriate in some cases — if you never want EKS to create an ELB, why not take away the permission entirely?

My goal is to investigate the second option, and use Admission Controllers to implement the following scenarios:

Services can not create load balancers

Services can create load balancers if they have the required static annotations

Services can create load balancers if they have the required annotations for the target namespace

Open Policy Agent

Open Policy Agent is a general-purpose policy engine which can be used to enforce policy across a range of applications. It acts as a policy middleware application — policies are defined in OPA, and applications query OPA to make decisions. It essentially provides policy-as-a-service, and decouples policies from application configurations.

The OPA docs contain an example Validating Admission Controller which enforces the semantic validation of Kubernetes objects before they are created, deleted or updated.

Admission Controllers are executed after a request is authenticated but before the objects are persisted to storage. After authenticating the request Kubernetes sends a webhook to OPA. The webhook response controls whether the request is allowed to proceed.

Source: Banzai Cloud

For more detail on the Admission Controller architecture and workflow see this blog post.

Installing OPA

I started by installing and configuring OPA according to the documentation.

When running a local Kubernetes cluster in minikube or Docker for Mac everything works out of the box. If running in AWS EKS ensure that the cluster control plane security group is allowed to access the worker node security group on port 443, otherwise the Kubernetes masters will be unable to send the webhook to OPA and all kubectl commands will timeout.

Policies are written in the Rego language, which is somewhat unintuitive at first glance, and stored in ConfigMaps in the “opa” namespace. OPA will automatically load any ConfigMaps in this namespace.

Blocking all Load Balancer Services

In the example OPA is configured in “default allow” mode — all requests are allowed except those which are explicitly denied. The policy document will consist of rules which evaluate to true for non-compliant resources.

The first rule should block the creation of any Service of the type LoadBalancer. This can be written in Rego as:

In English, this rule says:

if the kind of object being created is a Service, and

if the request operation is Create, and

if the service type is LoadBalancer, then

deny the request.

I created an example Service to test the policy:

Attempting to create this service resulted in the error “LoadBalancer Services are not permitted”.

Enforcing an Annotation Policy

Disabling load balancers globally is quite a heavy-handed solution. What if we want to allow load balancers but only if they confirm to certain requirements?

Kubernetes uses service annotations to configure the load balancer. In AWS, these can be used to configure security groups, SSL certificates, and access logging. For example, an SSL certificate can be added to an ELB by annotating the service with service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:….

With this in mind we can define a policy which allows load balancers, but only if they are using a known security group. In this example we will ensure that load balancers use the security group sg-123.

This is expressed in Rego as:

I created an example service which is missing the security group annotation:

Attempting to create this service failed with “Services of type LoadBalancer must use Security Group sg-123”. After adding the missing annotation the service was created successfully.

Dynamic Annotations

The previous example introduces more finely-grained control over resource creation, but hard-coding references to resources such as security groups is not a scalable solution. Furthermore, it is likely that the required security group will vary depending on the cluster environment. The high level policy might say “development environments can only be exposed to the corporate IP range”. Achieving this in a scalable way requires policies which can make use of dynamic data — in this case, to look up the allowed security groups for a particular namespace.

Fortunately OPA can consume data from a variety of sources and include it in policy making decisions.

In the tutorial OPA was started with the flag `” — replicate-cluster=v1/namespaces”`. This causes OPA to continuously replicate the Kubernetes namespace state, making namespaces and their metadata available during policy evaluations.

We can store arbitrary data in a namespace’s annotations. We will use this to store a reference to the security groups which are allowed in the namespace. Services will only be allowed if they reference the correct security group for the target namespace.

I created two namespaces with the following configuration:

I then created the following policy:

This rule compares the security group given in the Service specification to the security group allowed for the target namespace. The request is denied if e.g. a service in the dev namespace references the security group used in the prod namespace.

To validate this, I created a load balanced service which uses the security group for the dev namespace:

I created this service in the dev namespace without error. Attempting to create the same service in the prod namespace failed as expected:

Success! Any attempt to deploy non-compliant infrastructure to the cluster will fail, and the deployment log will explain the cause of the failure.

Conclusion

Kubernetes provides the primitives needed to enforce policy at the cluster level but they require some configuration and management. As Kelsey Hightower says, Kubernetes is a platform for building platforms.



OPA has a steep learning curve but it provides a lot of power and flexibility in return. Having a generic decision-as-a-service tool introduces a range of interesting possibilities.

So far all of the examples have been in the context of the Kubernetes Admission Control cycle — YAML resource definitions are validated immediately prior to creation. If a developer attempts to deploy non-compliant infrastructure they will only find out when the deployment fails. It would be more efficient to bring this problem to the developer’s attention earlier in the build and release cycle, avoiding the cycle time increases caused by compliance errors.

At a technical level, the Kubernetes Admission Controller is simply handing the YAML definition to OPA and asking for approval to continue. OPA validates the YAML according to the defined policy and makes a decision.

Because the validation process is not tied to Admission Controllers, the YAML files could be submitted to OPA as part of the CICD process used for delivering Kubernetes resources.

This means the same policy ruleset can be validated and enforced at multiple stages in the pipeline. In the CICD process OPA acts as a quality check, informing developers at the earliest possible opportunity that their changes will not be allowed. In the Kubernetes Admission Control process OPA acts as an enforcement tool, preventing non-compliant infrastructure from being created.

Treating compliance as code means adopting best practices from the software development process. One of these is Don’t Repeat Yourself. Decoupling policy from applications, and reusing policy definitions in multiple locations, is a good implementation of this rule.

The final component in the cloud-native compliance monitoring triumvirate would be integration with a service like CloudTrail to provide real-time alerting if non-compliant infrastructure is detected. A single policy ruleset could then be used to achieve three goals:

1) Pre-release: the CICD process prevents developers from delivering non-compliant infrastructure

2) Pre-deploy: Kubernetes prevents non-compliant infrastructure from being deployed

3) Post-deploy: CloudTrail provides an audit trail proving all infrastructure was compliant at the time of creation, and identifies any infrastructure that becomes non-compliant due to a change in policy

The first makes it easy to deploy compliant infrastructure.

The second makes it impossible to deploy non-compliant infrastructure.

The third proves that the other two worked.

In summary, OPA is an interesting and flexible tool with an operating model perfectly suited for cloud-native environments. Kubernetes Admission Controllers represent the ideal interface for enforcing policy within a cluster.