This chart bootstraps a kiam deployment on a Kubernetes cluster using the Helm package manager. kiam runs as an agent…
This chart bootstraps a kiam deployment on a Kubernetes cluster using the Helm package manager. kiam runs as an agent on each node in your Kubernetes cluster and allows cluster users to associate IAM roles to Pods.
Kiam bridges Kubernetes’ Pods with Amazon’s Identity and Access Management (IAM). It makes it easy to assign short-lived AWS security credentials to your application.
We created Kiam in 2017 to quickly address correctness issues we had running kube2iam in our production clusters. We’ve made a number of changes to it’s original design to make it more secure, reliable and easier to operate. This article covers a little of the story that led to us creating Kiam and more about what makes it novel.
Kiam now offers:
I’d now like to explain a little more about how Kiam’s novel design mitigates these issues.
As Kiam becomes aware of Pods they’re stored in a cache and indexed using the client libraries’ Indexers type. Kiam uses an index to be able to identify pods from their IP address: when an SDK client connects to Kiam’s HTTP proxy it uses the client’s IP address to identify the Pod in this cache.
It’s important then that when an SDK client attempts to connect to Kiam the Pod cache is filled with the details for the running Pod. Based on the Java client code we saw above Kiam has up to 5 seconds to respond with the configured role and so, by extension, Kiam has 5 seconds to track a running Pod.
If Kiam can’t find the Pod details in the cache it’s possible the details from the watcher haven’t yet been delivered (but may eventually be). Inside the agent we include some retry and backoff behaviour that will keep checking for the pod details in the cache up until the SDK client disconnects. Ideally the pod details will either be filled by the watcher or lister processes within time.
Kiam’s retries and backoffs use the Context package to propagate cancellation from the incoming HTTP request down through the chain of child calls that Kiam makes. This cancellable context lets us wait as long as possible for the operations to succeed and has been hugely helpful for writing a system that honours timeouts and retries.
Alongside maintaining the Pod cache the other responsibility of the Server process is to maintain a cache of AWS credentials retrieved from calling sts:AssumeRole on behalf of the runnings pods.
Originally, to keep things simple and obvious, Kiam used to request credentials upon request. When a client connected we would make a request to AWS in-band, store the fetched credentials in a cache and then keep refreshing them as long as the Pod was still running. But, as we saw above, the expectations from AWS SDK clients is that the metadata API returns very quickly. Kiam and Kube2iam both use Amazon STS to retrieve credentials which is quite a bit slower than the metadata API.
We have metrics tracking the completion times for the sts:AssumeRole operation and generally it’s very stable: typical 99th percentile times of 550 milliseconds. Every now and then though it goes beyond that. Below is a plot from our monitoring showing an increase in both max (yellow) and 95th percentile (blue) to well over a few seconds. Although this was a few months ago this isn’t atypical (it’s just taken me a long time to write this article :).
It’s quite normal for such spikes to happen but it’s problematic if we hit slow responses from AWS when fetching credentials in-band: a slow response from STS would cause us to propagate a failure to our SDK clients given their strict retry and timeout policies. To mitigate this Kiam prefetches credentials.
When a Pod is tracked through an update from a Kubernetes API watcher or from a full sync it’s added to a buffered channel with the prefetcher on the other side. The prefetcher requests credentials and stores them in the credentials cache ahead of the client requesting them.
Prefetching is an optimisation: if a pod requests credentials for a role and the credentials cache doesn’t already have them they’ll be requested again. We also use a Future to wrap around the AssumeRole operations within the cache to avoid requesting credentials for the same role repeatedly while waiting for credentials to be issued.
Prefetching helps us to smooth out interactions with the STS API and return responses to clients far quicker. To see just how much of a difference there’s another plot a little later that covers the same period as above.
Prefetching was one of the drivers for why we chose to split Kiam into server and agent processes: it would’ve been too costly for us, in terms of time and volume of AWS API calls, to do when running as a homogenous process where each process on each node would request credentials for all roles.
In the original homogenous model the number of sts:AssumeRole API calls would grow as Nodes * Roles: adding an additional node or role results in more than 1 additional call and, on a large dynamic cluster, this could be quite significant. Historically we’d also seen our STS operation durations suffer when we hammered it.
One solution is for each node to only requests credentials that it’s currently using. If the proxy process restarted (because of an upgrade, failure etc.) it would also lose all credentials and refilling the caches could be slow. We’d observed previously that SDK clients had immediately raised exceptions in such a situation- causing clients to error.
Kiam’s novel Agent/Server separation causes sts:AssumeRole calls to grow as Servers * Roles but given Servers are normally pinned to a subset of nodes, with a near constant number of replicas, API growth can be simplified to linear with regards to Roles.
Fetching credentials out-of-band before a client requests them reduces the impact of a slow AWS response. Running multiple servers that fetch the same credentials provides a degree of resilience to an individual server failure to track a pod/fetch credentials: the agent can take credentials from the first server that responds successfully and can retry the operation safely across servers.
Running multiple servers with redundant caches and prefetching credentials improves the chances the Kiam HTTP proxy performs within the expectations of the AWS SDK client.
Earlier I showed a plot highlighting a period of severely slow responses from the STS API that would cause any SDK client with default timeouts to fail. The plot below covers the same period as before but shows the Kiam agent’s response times. The max (yellow) is just over 1 second (when AWS’ response was over 10 seconds) and the 95th percentile (blue) time is closer to 47ms. Both are well within the limits of the SDK clients.
As mentioned earlier, Kiam and Kube2iam both followed the same deployment model at the beginning: a DaemonSet that installed itself via iptables on every node in the cluster. Such a deployment requires all nodes running user pods to have IAM policy attached that permit the sts:AssumeRole call. Having all nodes able to assume any role though is problematic in the event of a node being exploited: a host exploit would open access to any role. This is especially undesirable when we want, as far as possible, to run soft multi-tenant, multi-environment clusters.
Carving out a separate Server process meant we could run the server processes on a subset of machines and only they would need sts:AssumeRole. Nodes running user workloads would only require IAM permissions needed for kubelet and nodes running the server processes wouldn’t be permitted to run any user processes.
gRPC communications between the agents and servers are protected with TLS and mutual verification to allow only the agent, and the server’s health checker, to interact with the server.
When kiam was still deployed as a single-process DaemonSet we had problems rolling out updates.
Pods would, via their SDK client, have fetched temporary credentials successfully but, because they’re session-based credentials, would also continually attempt to refresh them in the background. When this background refresh failed because the Kiam process was unavailable the SDK’s would immediately invalidate the credentials being used and throw errors.
Applications would immediately see their AWS operations fail despite having valid credentials. This problem hit us a number of times and caused a lot of pain for the various service teams that use the clusters.
Running separate Server and Agent processes made it easier for us to independently deploy changes. If a server is updated we can often deploy them without affecting the agent processes at all. If the agent needs to be updated, however, it can restart far quicker as it no longer needs any caches with credentials or pod data- any such requests are immediately forwarded to the server processes.
Tell us about a new Kubernetes application
Never miss a thing! Sign up for our newsletter to stay updated.
Discover and learn about everything Kubernetes