Run your production Kubernetes Cluster on AWS Spot Instances

5 min readAug 26, 2020

One of the best features of Kubernetes is its robustness and ability to scale capacity up or down seamlessly. Because of this flexibility, it's possible to use servers with uncertain uptimes (spot instances) in order to save costs usually to the extent of 70% to 90%.

While the idea of using servers that can go down unexpectedly might sound like a no-brainer and a big risk, Kubernetes along with a few bells and whistles in place, makes it quite a stable solution.

In this post, we'll see how to make use of the AWS Spot Instance Pools in order to run a production-grade cluster with virtually zero downtime.

Requirements & Assumptions:

You are running and maintaining your cluster using KOPS. But you should be able to figure out how to implement this on any other tool you used for your cluster (eg. KubeSpray, EKS etc) fairly easily
This tutorial was written while using Kubernetes version 1.17. Same procedures are not guaranteed for versions below 1.17

Adding Spot Instances to the Cluster Configuration

The first step is to configure a mixedInstancesPolicy for your worker node Instance Group. So inside InstanceGroup.spec, add the following configuration

spec:
  image: kope.io/k8s-1.17-debian-stretch-amd64-hvm-ebs-2020-01-17
  machineType: t3.xlarge
  maxSize: 10
  minSize: 3
  mixedInstancesPolicy:
    instances:
    - t3.xlarge
    - t2.xlarge
    - t3a.xlarge
    - m5a.xlarge
    - m5.xlarge
    - m4.xlarge
    - t3a.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    KubeClusterAutoscaling: Enabled
    kops.k8s.io/instancegroup: nodes
  role: Node
  subnets:
  - eu-central-1a
  - eu-central-1b

The configuration above

specifies a group of instances types to be used under mixedInstancesPolicy.instances
Sets the number of On-Demand instances to 0 as indicated by mixedInstancesPolicy.onDemandBase. Which means the clusters will only be served by Spot Instances.
Sets the number of On-Demand instances for autoscaling to 0 as indicated by mixedInstancesPolicy.onDemandAboveBase. Which means any capacity that has to be provisioned about the base instances, will also be provisioned only using Spot Instances
Sets the allocation strategy for Spot Instances to capacity-optimized— which means that the Spot Instances with maximum number of servers available will be chosen. The alternative to this is lowest-price which chooses, as the name suggests, the cheapest Spot Instances available. However, the capacity of the cheapest server might be less leading to Spot Interruptions.

That’s basically all you need to get started and run your cluster using Spot Instances. Once you apply the above changes to your cluster, it will replace all your current instances with Spot Instances.

Now to handle the important part of stability, take a look at the following measures

Safeguard I: Cluster Overprovisioner

When dealing with a volatile server setup like Spot Instances, it is important to have backup capacity to address situations where a server might go down. For this we install Cluster Overprovisioner using the Helm chart

What Cluster Overprovisioner (COP) does, is create two PriorityClasses of 0 and -1. The PriorityClass of 0 is named default and is assigned to every pod running on the cluster.

PriorityClass -1 is assiged to the pod created by the ClusterOverprovisioner Deployment. Now, when there is a need for additional pods to be created, the COP pod with priority -1 will be terminated and replaced with the new pods.

Thus the lower priority pods act as a placeholder in the nodes and fulfil the requirement for overprovisioning.

The chart doesn't work out of the box and needs a values.yaml to be supplied for some basic config. Here's an example

deployments:
  - name: cluster-op
    annotations: {}
    replicaCount: 2
    nodeSelector: {}
    resources:
      requests:
        cpu: 1500m
        memory: 6Gi
    tolerations: []
    affinity: {}
    labels: {}

This will deploy two pods with priority -1 each with the resources specified. Ideally, you want to allocate half the resources of the Node to the pod so that it forces the cluster to launch a new Node rather than assign the COP pod within the existing capacity.

After installing this add-on you will notice that a new Node is created with the COP pod running on it.

Safeguard II: Spot Termination Handler

AWS sends a notice to the instances 2 minutes before it is scheduled to be terminated. If the server is terminated with pods still running on it, it can lead to downtimes and other problems. This is where Spot Termination Handler comes in. It polls the Nodes to check for any termination notices. When it detects one, it starts draining the nodes, which means it starts to move the pods to a different Node.

This is where our setup of Cluster Overprovisioner comes into picture. The pods will be moved to the spare Node created by Cluster Overprovisioner and the COP pod will be overwritten with our application pods. In this way, we handle the Spot Termination gracefully.

Spot Termination Handler can be installed out of the box by running helm install spot-termination-handler stable/k8s-spot-termination-handler

Safeguard III: Kubernetes Descheduler

Kubernetes Descheduler fills the de-scheduling gap that exists in Kubernetes. Kubernetes Scheduling only concerns itself with the inital scheduling of pods and does not adapt the scheduling dynamically based on resource availability. This can lead to duplicate pods running, some nodes being overutilized and some being underutilized and so on.

Kubernetes Descheduler can be installed using the Helm chart.

Add the repo helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
Install the chart. It doesn't work out of the box, so supply the values.yaml file provided below helm install my-release descheduler/descheduler-helm-chart

Values.yaml

image:
  repository: us.gcr.io/k8s-artifacts-prod/descheduler/descheduler
  tag: ""
  pullPolicy: IfNotPresent

nameOverride: ""
fullnameOverride: ""

schedule: "*/2 * * * *"

cmdOptions:
  v: 3

deschedulerPolicy:
  strategies:
    RemoveDuplicates:
      enabled: true
    RemovePodsViolatingNodeTaints:
      enabled: true
    RemovePodsViolatingNodeAffinity:
      enabled: true
      params:
         nodeAffinityType:
         - requiredDuringSchedulingIgnoredDuringExecution
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: true
    LowNodeUtilization:
      enabled: true
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu : 80
            memory: 80
            pods: 80
priorityClassName: system-cluster-critical

rbac:
  create: true

serviceAccount:
  create: true
  name:

The most important part is thresholds and targetThresholds. thresholds specifies the value to be used to mark a Node as "underutilized". Any value above targetThresholds marks the Node as "overutilized". Pods are then moved around accordingly to bring the Node usage between thresholds and targetThresholds

Conclusion

With these three safeguard add-ons, you are now ready to make use of Spot Instances and take advantage of the significant cost benefits. The Cluster Overprovisioner keeps a spare Node ready, the Spot Termination Handler will drain pods gracefully in case of a termination notice and the Kube Descheduler will dynamically schedule pods around making sure that the cluster resources are being optimally used