Using the Cluster Autoscaler

This section applies only to worker Machines. Cluster Autoscaler is a tool that automatically adjusts the size of the Kubernetes cluster based on the utilization of Pods and Nodes in your cluster. For more general information about the Cluster Autoscaler, please see the project documentation.

The following instructions are a reproduction of the Cluster API provider specific documentation from the Autoscaler project documentation.

Cluster Autoscaler on Cluster API

The cluster autoscaler on Cluster API uses the cluster-api project to manage the provisioning and de-provisioning of nodes within a Kubernetes cluster.

Kubernetes Version
Starting the Autoscaler
Configuring node group auto discovery
Connecting cluster-autoscaler to Cluster API management and workload Clusters
Enabling Autoscaling
- Scale from zero support
Specifying a Custom Resource Group
Specifying a Custom Resource Version
Sample manifest
- A note on permissions
Autoscaling with ClusterClass and Managed Topologies
Special note on GPU instances
Special note on balancing similar node groups

Kubernetes Version

The cluster-api provider requires Kubernetes v1.16 or greater to run the v1alpha3 version of the API.

Starting the Autoscaler

To enable the Cluster API provider, you must first specify it in the command line arguments to the cluster autoscaler binary. For example:

cluster-autoscaler --cloud-provider=clusterapi

Please note, this example only shows the cloud provider options, you will most likely need other command line flags. For more information you can invoke cluster-autoscaler --help to see a full list of options.

Configuring node group auto discovery

You must configure node group auto discovery to inform cluster autoscaler which cluster in which to find for scalable node groups.

Limiting cluster autoscaler to only match against resources in the blue namespace

--node-group-auto-discovery=clusterapi:namespace=blue

Limiting cluster autoscaler to only match against resources belonging to Cluster test1

--node-group-auto-discovery=clusterapi:clusterName=test1

Limiting cluster autoscaler to only match against resources matching the provided labels

--node-group-auto-discovery=clusterapi:color=green,shape=square

These can be mixed and matched in any combination, for example to only match resources in the staging namespace, belonging to the purple cluster, with the label owner=jim:

--node-group-auto-discovery=clusterapi:namespace=staging,clusterName=purple,owner=jim

Connecting cluster-autoscaler to Cluster API management and workload Clusters

[!IMPORTANT] --cloud-config is the flag for specifying a mount volume path to the kubernetes configuration (ie KUBECONFIG) to the cluster-autoscaler for communicating with the cluster-api management cluster for the purpose of scaling machines.

[!IMPORTANT] ``--kubeconfig` is the flag for specifying a mount volume path to the kubernetes configuration (ie KUBECONFIG) to the cluster-autoscaler for communicating with the cluster-api workload cluster for the purpose of watching Nodes and Pods. This flag can be affected by the desired topology for deploying the cluster-autoscaler, please see the diagrams below for more information.

You will also need to provide the path to the kubeconfig(s) for the management and workload cluster you wish cluster-autoscaler to run against. To specify the kubeconfig path for the workload cluster to monitor, use the --kubeconfig option and supply the path to the kubeconfig. If the --kubeconfig option is not specified, cluster-autoscaler will attempt to use an in-cluster configuration. To specify the kubeconfig path for the management cluster to monitor, use the --cloud-config option and supply the path to the kubeconfig. If the --cloud-config option is not specified it will fall back to using the kubeconfig that was provided with the --kubeconfig option.

Autoscaler running in a joined cluster using service account credentials

+-----------------+
| mgmt / workload |
| --------------- |
|    autoscaler   |
+-----------------+

Use in-cluster config for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi

Autoscaler running in workload cluster using service account credentials, with separate management cluster

+--------+              +------------+
|  mgmt  |              |  workload  |
|        | cloud-config | ---------- |
|        |<-------------+ autoscaler |
+--------+              +------------+

Use in-cluster config for workload cluster, specify kubeconfig for management cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --cloud-config=/mnt/kubeconfig

Autoscaler running in management cluster using service account credentials, with separate workload cluster

+------------+             +----------+
|    mgmt    |             | workload |
| ---------- | kubeconfig  |          |
| autoscaler +------------>|          |
+------------+             +----------+

Use in-cluster config for management cluster, specify kubeconfig for workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/kubeconfig \
                   --clusterapi-cloud-config-authoritative

Autoscaler running anywhere, with separate kubeconfigs for management and workload clusters

+--------+               +------------+             +----------+
|  mgmt  |               |     ?      |             | workload |
|        |  cloud-config | ---------- | kubeconfig  |          |
|        |<--------------+ autoscaler +------------>|          |
+--------+               +------------+             +----------+

Use separate kubeconfigs for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/workload.kubeconfig \
                   --cloud-config=/mnt/management.kubeconfig

Autoscaler running anywhere, with a common kubeconfig for management and workload clusters

+---------------+             +------------+
| mgmt/workload |             |     ?      |
|               |  kubeconfig | ---------- |
|               |<------------+ autoscaler |
+---------------+             +------------+

Use a single provided kubeconfig for both management and workload cluster:

cluster-autoscaler --cloud-provider=clusterapi \
                   --kubeconfig=/mnt/workload.kubeconfig

Enabling Autoscaling

To enable the automatic scaling of components in your cluster-api managed cloud there are a few annotations you need to provide. These annotations must be applied to either MachineSet, MachineDeployment, or MachinePool resources depending on the type of cluster-api mechanism that you are using.

There are two annotations that control how a cluster resource should be scaled:

cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size - This specifies the minimum number of nodes for the associated resource group. The autoscaler will not scale the group below this number. Please note that the cluster-api provider will not scale down to, or from, zero unless that capability is enabled (see Scale from zero support).
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size - This specifies the maximum number of nodes for the associated resource group. The autoscaler will not scale the group above this number.

The autoscaler will monitor any MachineSet, MachineDeployment, or MachinePool containing both of these annotations.

Note: The cluster autoscaler does not enforce the node group sizes. If a node group is below the minimum number of nodes, or above the maximum number of nodes, the cluster autoscaler will not scale that node group up or down. The cluster autoscaler can be configured to enforce the minimum node group size by enabling the --enforce-node-group-min-size flag. Please see this entry in the Cluster Autoscaler FAQ for more information.

Note: MachinePool support in cluster-autoscaler requires a provider implementation that supports the “MachinePool Machines” feature.

Scale from zero support

The Cluster API community has defined an opt-in method for infrastructure providers to enable scaling from zero-sized node groups in the Opt-in Autoscaling from Zero enhancement. As defined in the enhancement, each provider may add support for scaling from zero to their provider, but they are not required to do so. If you are expecting built-in support for scaling from zero, please check with the Cluster API infrastructure providers that you are using.

If your Cluster API provider does not have support for scaling from zero, you may still use this feature through the capacity annotations. You may add these annotations to your MachineDeployments, or MachineSets if you are not using MachineDeployments (it is not needed on both), to instruct the cluster autoscaler about the sizing of the nodes in the node group. At the minimum, you must specify the CPU and memory annotations, these annotations should match the expected capacity of the nodes created from the infrastructure.

Note: The scale from zero annotations will override any capacity information supplied by the Cluster API provider in the infrastructure machine templates. If both the annotations and the provider supplied capacity information are present, the annotations will take precedence.

For example, if my MachineDeployment will create nodes that have “16000m” CPU, “128G” memory, “100Gi” ephemeral disk storage, 2 NVidia GPUs, and can support 200 max pods, the following annotations will instruct the autoscaler how to expand the node group from zero replicas:

apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
    capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "100Gi"
    capacity.cluster-autoscaler.kubernetes.io/maxPods: "200"
    // Device Plugin
    // Comment out the below annotation if DRA is enabled on your cluster running k8s v1.32.0 or greater
    capacity.cluster-autoscaler.kubernetes.io/gpu-type: "nvidia.com/gpu"
    // Dynamic Resource Allocation (DRA)
    // Uncomment the below annotation if DRA is enabled on your cluster running k8s v1.32.0 or greater
    // capacity.cluster-autoscaler.kubernetes.io/dra-driver: "gpu.nvidia.com"
    // Common in Device Plugin and DRA
    capacity.cluster-autoscaler.kubernetes.io/gpu-count: "2"

Note: the maxPods annotation will default to 110 if it is not supplied. This value is inspired by the Kubernetes best practices Considerations for large clusters.

Note: User should select the annotation for GPU either gpu-type or dra-driver depends on whether using Device Plugin or Dynamic Resource Allocation(DRA). gpu-count is a common parameter in both.

RBAC changes for scaling from zero

If you are using the opt-in support for scaling from zero as defined by the Cluster API infrastructure provider, you will need to add the infrastructure machine template types to your role permissions for the service account associated with the cluster autoscaler deployment. The service account will need permission to get, list, and watch the infrastructure machine templates for your infrastructure provider.

For example, when using the Kubemark provider you will need to set the following permissions:

rules:
  - apiGroups:
    - infrastructure.cluster.x-k8s.io
    resources:
    - kubemarkmachinetemplates
    verbs:
    - get
    - list
    - watch

Pre-defined labels and taints on nodes scaled from zero

Taints for scale from zero can be configured in two ways, listed below in order of precedence (highest first):

1. Capacity annotation (highest priority)

The capacity.cluster-autoscaler.kubernetes.io/taints annotation accepts a comma-separated list of taints and always takes precedence over taints defined in the scalable resource spec.

apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
    capacity.cluster-autoscaler.kubernetes.io/labels: "key1=value1,key2=value2"
    capacity.cluster-autoscaler.kubernetes.io/taints: "key1=value1:NoSchedule,key2=value2:NoExecute"

2. Scalable resource spec (requires CAPI v1.12+ with MachineTaintPropagation feature gate enabled)

When the MachineTaintPropagation feature gate is enabled in Cluster API, taints defined in spec.template.spec.taints of a MachineSet, MachineDeployment, or MachinePool are read directly by the cluster autoscaler. If an annotation taint has the same key and effect as a spec taint, the annotation value takes precedence.

apiVersion: cluster.x-k8s.io/v1beta2
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
    # Override the value of the "dedicated" taint defined in spec below.
    capacity.cluster-autoscaler.kubernetes.io/taints: "dedicated=gpu-override:NoSchedule"
spec:
  template:
    spec:
      taints:
        - key: dedicated
          value: gpu
          effect: NoSchedule
          propagation: Always
        - key: node-setup
          value: "true"
          effect: NoSchedule
          propagation: Always

Note: For labels, the capacity annotation values are merged with the labels propagated from the scalable Cluster API resource. If the same label key is defined in both, the annotation value takes precedence. Please see the Cluster API Book chapter on Metadata propagation for more information.

For taints, annotation taints are merged with spec taints. If the same key and effect is defined in both, the annotation value takes precedence. Spec taints without a matching annotation taint are preserved.

Pre-defined csi driver information on nodes scaled from zero

To provide CSI driver information for scale from zero, the optional capacity annotation may be supplied as a comma separated list of driver name and volume limit key/value pairs, as demonstrated in the example below:

apiVersion: cluster.x-k8s.io/v1alpha4
kind: MachineDeployment
metadata:
  annotations:
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "5"
    cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
    capacity.cluster-autoscaler.kubernetes.io/memory: "128G"
    capacity.cluster-autoscaler.kubernetes.io/cpu: "16"
    capacity.cluster-autoscaler.kubernetes.io/csi-driver: "ebs.csi.aws.com=25,efs.csi.aws.com=16"

Note: The CSI driver information supplied through the capacity annotation specifies which CSI drivers will be installed on nodes scaled from zero, along with their respective volume limits. The format is driver-name=volume-limit with multiple drivers separated by commas.

Per-NodeGroup autoscaling options

Custom autoscaling options per node group (MachineDeployment/MachinePool/MachineSet) can be specified as annoations with a common prefix:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  annotations:
    # overrides --scale-down-utilization-threshold global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-scaledownutilizationthreshold: "0.5"
    # overrides --scale-down-gpu-utilization-threshold global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-scaledowngpuutilizationthreshold: "0.5"
    # overrides --scale-down-unneeded-time global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-scaledownunneededtime: "10m0s"
    # overrides --scale-down-unready-time global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-scaledownunreadytime: "20m0s"
    # overrides --max-node-provision-time global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-maxnodeprovisiontime: "20m0s"
    # overrides --max-node-startup-time global value for that specific MachineDeployment
    cluster.x-k8s.io/autoscaling-options-maxnodestartuptime: "20m0s"

CPU Architecture awareness for single-arch clusters

Users of single-arch non-amd64 clusters who are using scale from zero support should also set the CAPI_SCALE_ZERO_DEFAULT_ARCH environment variable to set the architecture of the nodes they want to default the node group templates to. The autoscaler will default to amd64 if it is not set, and the node group templates may not match the nodes’ architecture, specifically when the workload triggering the scale-up uses a node affinity predicate checking for the node’s architecture.

Specifying a Custom Resource Group

By default all Kubernetes resources consumed by the Cluster API provider will use the group cluster.x-k8s.io, with a dynamically acquired version. In some situations, such as testing or prototyping, you may wish to change this group variable. For these situations you may use the environment variable CAPI_GROUP to change the group that the provider will use.

Please note that setting the CAPI_GROUP environment variable will also cause the annotations for minimum and maximum size to change. This behavior will also affect the machine annotation on nodes, the machine deletion annotation, and the cluster name label. For example, if CAPI_GROUP=test.k8s.io then the minimum size annotation key will be test.k8s.io/cluster-api-autoscaler-node-group-min-size, the machine annotation on nodes will be test.k8s.io/machine, the machine deletion annotation will be test.k8s.io/delete-machine, and the cluster name label will be test.k8s.io/cluster-name.

Specifying a Custom Resource Version

When determining the group version for the Cluster API types, by default the autoscaler will look for the latest version of the group. For example, if MachineDeployments exist in the cluster.x-k8s.io group at versions v1alpha1 and v1beta1, the autoscaler will choose v1beta1.

In some cases it may be desirable to specify which version of the API the cluster autoscaler should use. This can be useful in debugging scenarios, or in situations where you have deployed multiple API versions and wish to ensure that the autoscaler uses a specific version.

Setting the CAPI_VERSION environment variable will instruct the autoscaler to use the version specified. This works in a similar fashion as the API group environment variable with the exception that there is no default value. When this variable is not set, the autoscaler will use the behavior described above.

Sample manifest

A sample manifest that will create a deployment running the autoscaler is available. It can be deployed by passing it through envsubst, providing these environment variables to set the namespace to deploy into as well as the image and tag to use:

export AUTOSCALER_NS=kube-system
export AUTOSCALER_IMAGE=registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
envsubst < examples/deployment.yaml | kubectl apply -f-

A note on permissions

The cluster-autoscaler-management role for accessing cluster api scalable resources is scoped to ClusterRole. This may not be ideal for all environments (eg. Multi tenant environments). In such cases, it is recommended to scope it to a Role mapped to a specific namespace.

Autoscaling with ClusterClass and Managed Topologies

For users using ClusterClass and Managed Topologies the Cluster Topology controller attempts to set MachineDeployment replicas based on the spec.topology.workers.machineDeployments[].replicas field. In order to use the Cluster Autoscaler this field can be left unset in the Cluster definition.

The below Cluster definition shows which field to leave unset:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: "my-cluster"
  namespace: default
spec:
  clusterNetwork:
    services:
      cidrBlocks: ["10.128.0.0/12"]
    pods:
      cidrBlocks: ["192.168.0.0/16"]
    serviceDomain: "cluster.local"
  topology:
    class: "quick-start"
    version: v1.24.0
    controlPlane:
      replicas: 1
    workers:
      machineDeployments:
        - class: default-worker
          name: linux
       ## replicas field is not set.
       ## replicas: 1

If the replica field is unset in the Cluster definition Autoscaling can be enabled as described above

Special note on GPU instances

As with other providers, if the device plugin on nodes that provides GPU resources takes some time to advertise the GPU resource to the cluster, this may cause Cluster Autoscaler to unnecessarily scale out multiple times.

To avoid this, you can configure kubelet on your GPU nodes to label the node before it joins the cluster by passing it the --node-labels flag. For the CAPI cloudprovider, the label format is as follows:

cluster-api/accelerator=<gpu-type>

<gpu-type> is arbitrary.

It is important to note that if you are using the --gpu-total flag to limit the number of GPU resources in your cluster that the <gpu-type> value must match between the command line flag and the node labels. Setting these values incorrectly can lead to the autoscaler creating too many GPU resources.

For example, if you are using the autoscaler command line flag --gpu-total=gfx-hardware:1:2 to limit the number of gfx-hardware resources to a minimum of 1 and maximum of 2, then you should use the kubelet node label flag --node-labels=cluster-api/accelerator=gfx-hardware.

Special note on balancing similar node groups

The Cluster Autoscaler feature to enable balancing similar node groups (activated with the --balance-similar-node-groups flag) is a powerful and popular feature. When enabled, the Cluster Autoscaler will attempt to create new nodes by adding them in a manner that balances the creation between similar node groups. With Cluster API, these node groups correspond directly to the scalable resources associated (usually MachineDeployments and MachineSets) with the nodes in question. In order for the nodes of these scalable resources to be considered similar by the Cluster Autoscaler, they must have the same capacity, labels, and taints for the nodes which will be created from them.

To help assist the Cluster Autoscaler in determining which node groups are similar, the command line flags --balancing-ignore-label and --balancing-label are provided. For an expanded discussion about balancing similar node groups and the options which are available, please see the Cluster Autoscaler FAQ.

Because Cluster API can address many different cloud providers, it is important to configure the balancing labels to ignore provider-specific labels which are used for carrying zonal information on Kubernetes nodes. The Cluster Autoscaler implementation for Cluster API does not assume any labels (aside from the well-known Kubernetes labels) to be ignored when running. Users must configure their Cluster Autoscaler deployment to ignore labels which might be different between nodes, but which do not otherwise affect node behavior or size (for example when two MachineDeployments are the same except for their deployment zones). The Cluster API community has decided not to carry cloud provider specific labels in the Cluster Autoscaler to reduce the possibility for labels to clash between providers. Additionally, the community has agreed to promote documentation and the use of the --balancing-ignore-label flag as the preferred method of deployment to reduce the extended need for maintenance on the Cluster Autoscaler when new providers are added or updated. For further context around this decision, please see the Cluster API Deep Dive into Cluster Autoscaler Node Group Balancing discussion from 2022-09-12.

The following table shows some of the most common labels used by cloud providers to designate regional or zonal information on Kubernetes nodes. It is shared here as a reference for users who might be deploying on these infrastructures.

Cloud Provider	Label to ignore	Notes
Alibaba Cloud	`topology.diskplugin.csi.alibabacloud.com/zone`	Used by the Alibaba Cloud CSI driver as a target for persistent volume node affinity
AWS	`alpha.eksctl.io/instance-id`	Used by `eksctl` to identify instances
AWS	`alpha.eksctl.io/nodegroup-name`	Used by `eksctl` to identify node group names
AWS	`eks.amazonaws.com/nodegroup`	Used by EKS to identify node groups
AWS	`k8s.amazonaws.com/eniConfig`	Used by the AWS CNI for custom networking
AWS	`lifecycle`	Used by AWS as a label for spot instances
AWS	`topology.ebs.csi.aws.com/zone`	Used by the AWS EBS CSI driver as a target for persistent volume node affinity
Azure	`topology.disk.csi.azure.com/zone`	Used as the topology key by the Azure Disk CSI driver
Azure	`agentpool`	Legacy label used to specify to which Azure node pool a particular node belongs
Azure	`kubernetes.azure.com/agentpool`	Used by AKS to identify to which node pool a particular node belongs
GCE	`topology.gke.io/zone`	Used to specify the zone of the node
IBM Cloud	`ibm-cloud.kubernetes.io/worker-id`	Used by the IBM Cloud Cloud Controller Manager to identify the node
IBM Cloud	`vpc-block-csi-driver-labels`	Used by the IBM Cloud CSI driver as a target for persistent volume node affinity
IBM Cloud	`ibm-cloud.kubernetes.io/vpc-instance-id`	Used when a VPC is in use on IBM Cloud

Defaulting of the MachineDeployment, MachineSet replicas field

Please note that the MachineDeployment and MachineSet replicas field has special defaulting logic to provide a smooth integration with the autoscaler. The replica field is defaulted based on the autoscaler min and max size annotations.The goal is to pick a default value which is inside the (min size, max size) range so the autoscaler can take control of the replicase field.

The defaulting logic is as follows:

if the autoscaler min size and max size annotations are set:
- if it’s a new MachineDeployment or MachineSet, use min size
- if the replicas field of the old MachineDeployment or MachineSet is < min size, use min size
- if the replicas field of the old MachineDeployment or MachineSet is > max size, use max size
- if the replicas field of the old MachineDeployment or MachineSet is in the (min size, max size) range, keep the value from the oldMD or oldMS
otherwise, use 1

The Cluster API Book