Kiam Improvements

Kiam now offers:

Increased security by splitting the process into two: an agent and server. Only the server process needs to be permitted to perform sts:AssumeRole . Cluster operators can place user workloads on nodes with only essential IAM policy necessary for kubelet. This guards against privilege escalation from an application compromise.

by splitting the process into two: an agent and server. Only the server process needs to be permitted to perform . Cluster operators can place user workloads on nodes with only essential IAM policy necessary for kubelet. This guards against privilege escalation from an application compromise. Prefetching credentials from AWS . This reduces response times as observed by SDK clients which, given very restrictive default timeouts, would otherwise cause clients to fail.

. This reduces response times as observed by SDK clients which, given very restrictive default timeouts, would otherwise cause clients to fail. Load-balancing requests across multiple servers. This helps deploy updates without breaking agents. We observed SDK behaviour in production where applications would fail as soon as the proxy was restarted even when applications held valid, unexpired credentials. It also protects against errors interacting with an individual server.

These changes are the most significant parts where Kiam deviates from kube2iam. Most are also largely a result/benefit of separating Kiam into two processes: server and agent. I’ll now go through each in a little more detail.

Agents and Servers

Kiam originally used exactly the same model as Kube2iam: a process deployed onto each machine via a DaemonSet. It was simple to deploy and reason about, and allowed us to quickly fix the data race issue that caused outages for our production cluster users.

This worked for a few months but, as the number of running Pods increased and, along with it, the number of pods requiring AWS credentials, we started to hit other client errors:

NoCredentialProviders: no valid providers in chain

These errors were caused by clients requesting credentials through the Kiam proxy and was either unable to identify the Pod either because 1) it hadn’t yet added the pod data to it’s cache or 2) was unable to fetch credentials from AWS quickly enough.

Before exploring more about how Kiam’s design was updated to reduce the likelihood of these errors it’s worth covering a little more about the AWS SDKs.

AWS SDKs and credentials

AWS SDK clients generally support a number of different providers that can fetch credentials to be used for subsequent API operations. These are composed together into a provider chain: providers are called in sequence until one is able to provide credentials.

In the early days of AWS and EC2 most users would’ve been familiar with doing something like:

$ export AWS_ACCESS_KEY_ID=AKXXXXXX

$ export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXX

$ aws ec2 ...

This works with one of the SDK’s credential providers: keys are read from environment variables to be presented to AWS during API calls.

One other credential provider is the Instance Metadata API: http://169.254.169.254 on an EC2 instance. This metadata API is what both kube2iam and Kiam proxy to provide seamless IAM integration for clients using AWS SDKs on Kubernetes; requests for credentials are processed by the proxies, everything else is forwarded to AWS.

Within the Java SDK the base credentials endpoint provider, which the instance credentials provider extends, has a default retry policy of CredentialsEndpointRetryPolicy.NO_RETRY . If a client experiences an error interacting with the metadata api it won’t be retried by the SDK and instead must be handled by the calling application.

This is problematic for systems like Kube2iam and Kiam that intercept and proxy the Instance Metadata API: when a client process attempts to request credentials via InstanceProfileCredentialsProvider they’re not talking directly to AWS but the HTTP proxy and there’s more reasons they may fail to respond successfully than Amazon’s own API. Further, the EC2CredentialsUtils package used by InstanceMetadataCredentialsEndpointProvider also specifies no retries.

It’s not just errors/failures that would cause the client to fail: there’s also timeouts specified within ConnectionUtils that are used within the SDK:

Connect Timeout: 2 seconds

Read Timeout: 5 seconds

Those timeouts may seem generous but remember that should either of them be exceeded, or a failure response is returned, the credential providers are not configured to retry. Credentials are obtained from Simple Token Service which, in our experience, can be amongst the slower APIs and can approach the timeouts I mention above.

Timeouts and lack of retries within the SDK clients force Kiam to do a lot to avoid transferring a fault to a calling application and causing an error. Given the high degree of fan-in to a service like Kiam this could cause a significant impact to a cluster’s applications.

AssumeRole on Every Node?!

Kiam’s original DaemonSet model was a proxy on each node and thus every node would need IAM policy to permit sts:AssumeRole for all roles used on the cluster.

We use per-application IAM roles to ensure that application processes only have access into AWS that they need. With every node able to sts:AssumeRole any role the end result is every node can still access any role and thus the union of AWS APIs used by the different application policies.

Although we could start associating subsets of nodes to groups of roles (per-team, for example) we didn’t want to have to get into managing different groups of nodes to improve our IAM security.

What makes Kiam novel

I’d now like to explain a little more about how Kiam’s novel design mitigates these issues.

The necessity of Prefetching

Kiam uses Kubernetes’ client-go cache package to create a process which uses two mechanisms (via the ListerWatcher interface) for tracking pods:

Watcher : the client tells the API server which resources it’s interested in tracking and the server will stream updates as they’re available. Think of these as deltas to some state.

: the client tells the API server which resources it’s interested in tracking and the server will stream updates as they’re available. Think of these as deltas to some state. Lister : this performs a (relatively expensive) List which retrieves details about all running pods. It takes longer to return but ensures you pick up details about all running pods, not just a delta.

As Kiam becomes aware of Pods they’re stored in a cache and indexed using the client libraries’ Indexers type. Kiam uses an index to be able to identify pods from their IP address: when an SDK client connects to Kiam’s HTTP proxy it uses the client’s IP address to identify the Pod in this cache.

It’s important then that when an SDK client attempts to connect to Kiam the Pod cache is filled with the details for the running Pod. Based on the Java client code we saw above Kiam has up to 5 seconds to respond with the configured role and so, by extension, Kiam has 5 seconds to track a running Pod.

If Kiam can’t find the Pod details in the cache it’s possible the details from the watcher haven’t yet been delivered (but may eventually be). Inside the agent we include some retry and backoff behaviour that will keep checking for the pod details in the cache up until the SDK client disconnects. Ideally the pod details will either be filled by the watcher or lister processes within time.

Kiam’s retries and backoffs use the Context package to propagate cancellation from the incoming HTTP request down through the chain of child calls that Kiam makes. This cancellable context lets us wait as long as possible for the operations to succeed and has been hugely helpful for writing a system that honours timeouts and retries.

Credential prefetching

Alongside maintaining the Pod cache the other responsibility of the Server process is to maintain a cache of AWS credentials retrieved from calling sts:AssumeRole on behalf of the runnings pods.

Originally, to keep things simple and obvious, Kiam used to request credentials upon request. When a client connected we would make a request to AWS in-band, store the fetched credentials in a cache and then keep refreshing them as long as the Pod was still running. But, as we saw above, the expectations from AWS SDK clients is that the metadata API returns very quickly. Kiam and Kube2iam both use Amazon STS to retrieve credentials which is quite a bit slower than the metadata API.