Skip to content

Storage constraints on GKE Autopilot

Description

GKE Autopilot imposes a hard limit on ephemeral storage requests: the sum of storage requests of all containers in a pod may not exceed 10 GiB (specifying higher limits results in the runtime clamping the value to the request, specifying a higher request fails the pod creation with Kubernetes rejecting the pod). This does mean that in some cases, the build container may run out of ephemeral storage, which leads to a hard kill of the entire pod.
In my case, this manifests in the inability to build certain Docker containers to run builds via Kaniko, because Kaniko has to unpack the layer FS, manipulate it via the instructions in the Dockerfile, then snapshot the FS into the new layer, which causes the build container to exceed the hard storage limit.

The Google-recommended workaround for this situation is the use of Generic Ephemeral Volumes, which are not subject to the same restriction, but require a different volume key in the pod descriptor.

Proposal

This proposal aims to extend the existing choice of Kubernetes volumes that can be configured with that of generic ephemeral volumes, allowing the volume to replace the existing empty directory volume for the repo.

Making use of the new storage class would require changing the eventual pod YAML from the current

volumes:
  - emptyDir: {}
    name: repo

to something along the lines of

  volumes:
  - name: repo
    ephemeral:
      volumeClaimTemplate:
        metadata:
          labels:
            type: ephemeral
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: "ephemeral-ssd"
          resources:
            requests:
              storage: 16Gi

As far as I understand, this will require adding a new KubernetesVolumes struct member, as well as wiring the Manager to create the runner pod with the appropriate volume configuration to host the repo.

Important

There should also be an understanding that proper use of the new storage class would require the user to create an appropriate storage class. The Manager should not be responsible for this, as it's not a Runner-specific task.

Links to related issues and merge requests / references

N/A