• AWS
  • 13. Sep
  • 1.4K
  • 0

k8s-spot-rescheduler

K8s Spot rescheduler is a tool that tries to reduce load on a set of Kubernetes nodes. It was designed with the purpose of moving Pods scheduled on AWS on-demand instances to AWS spot instances to allow the on-demand instances to be safely scaled down (By the Cluster Autoscaler).

In reality the rescheduler can be used to remove load from any group of nodes onto a different group of nodes. They just need to be labelled appropriately.

For example, it could also be used to allow controller nodes to take up slack while new nodes are being scaled up, and then rescheduling those pods when the new capacity becomes available, thus reducing the load on the controllers once again.

Attribution

This project was inspired by the Critical Pod Rescheduler and takes portions of code from both the Critical Pod Rescheduler and the Cluster Autoscaler.

Motivation

AWS spot instances are a great way to reduce the cost of your infrastructure running costs. They do however come with a significant drawback; at any point, the spot price for the instances you are using could rise above your bid and your instances will be terminated. To solve this problem, you can use an AutoScaling group backed by on-demand instances and managed by the Cluster Autoscaler to take up the slack when spot instances are removed from your cluster.

The problem however, comes when the spot price drops and you are given new spot instances back into your cluster. At this point you are left with empty spot instances and full, expensive on-demand instances.

By tainting the on-demand instances with the Kubernetes PreferNoSchedule taint, we can ensure that, if at any point the scheduler needs to choose between spot and on-demand instances, it will choose the preferred spot instances to schedule the new Pods onto.

However, the scheduler won't reschedule Pods that are already running on on-demand instances, blocking them from being scaled down. At this point, the K8s Spot Rescheduler is required to start the process of moving Pods from the on-demand instances back onto the spot instances.

Usage

Deploy to Kubernetes

A docker image is available at quay.io/pusher/k8s-spot-rescheduler. These images are currently built on pushes to master. Releases will be tagged as and when releases are made.

Sample Kubernetes manifests are available in the deploy folder.

To deploy in clusters using RBAC, please apply all of the manifests (Deployment, ClusterRole, ClusterRoleBinding and ServiceAccount) in the deploy folder but uncomment the serviceAccountName in the deployment

Requirements

For the K8s Spot Rescheduler to process nodes as expected; you will need identifying labels which can be passed to the program to allow it to distinguish which nodes it should consider as on-demand and which it should consider as spot instances.

For instance you could add labels node-role.kubernetes.io/worker and node-role.kubernetes.io/spot-worker to your on-demand and spot instances respectively.

You should also add the PreferNoSchedule taint to your on-demand instances to ensure that the scheduler prefers spot instances when making it's scheduling decisions.

 

Operating logic

 

The rescheduler logic roughly follows the below:

  1. Gets a list of on-demand and spot nodes and their respective Pods
  • Builds a map of nodeInfo structs
    • Add node to struct
    • Add pods for that node to struct
    • Add requested and free CPU fields to struct
  • Map these structs based on whether they are on-demand or spot instances.
  • Sort on-demand instances by least requested CPU
  • Sort spot instances by most free CPU
  1. Iterate through each on-demand node and try to drain it
  • Iterate through each pod
    • Determine if a spot node has space for the pod
    • Add the pod to the prospective spot node
    • Move onto next node if no spot node space available
  • Drain the node
    • Iterate through pods and evict them in turn
      • Evict pod
      • Wait for deletion and reschedule
    • Cancel all further processing

This process is repeated every housekeeping-interval seconds.

The effect of this algorithm should be, that we take the emptiest nodes first and empty those before we empty a node which is busier, thus resulting in the highest number of 'empty' nodes that can be removed from the cluster.

Tell us about a new Kubernetes application

Newsletter

Never miss a thing! Sign up for our newsletter to stay updated.

About

Discover and share new Kubernetes applications

Navigation