Kubernetes: autoscaler for idle capacity via pause pods

What does this MR do?

Adds pause pod autoscaling for the Kubernetes executor to pre-warm cluster capacity.

Problem

Jobs running on the Kubernetes executor may experience delays waiting for pods to be scheduled when node capacity is exhausted. The cluster autoscaler needs time to provision new nodes, causing job startup latency.

Solution

Implements pause pod management based on [[runners.kubernetes.autoscaler.policy]] configuration (matching the existing fleeting/taskscaler pattern). Pause pods reserve cluster capacity that can be quickly preempted when real jobs arrive, reducing job startup latency.

Key components:

  • Policy & Scheduling: Cron-based policy selection using gitlab.com/gitlab-org/fleeting/taskscaler/cron (requires gitlab-org/fleeting/taskscaler!71 (merged) to be merged first)
  • Pause Pod Manager: Manages a Deployment of pause pods, reconciling replica count based on active policy
  • Provider Integration: Wraps the Kubernetes executor provider to add ManagedExecutorProvider lifecycle hooks

Configuration

[runners.kubernetes.autoscaler]
  max_pause_pods = 10
  pause_pod_image = "registry.k8s.io/pause:3.10"  # optional, this is the default

  [[runners.kubernetes.autoscaler.policy]]
    idle_count = 5
    idle_time = "30m"
    periods = ["* 8-17 * * mon-fri"]
    timezone = "UTC"

  [[runners.kubernetes.autoscaler.policy]]
    idle_count = 0
    periods = ["* * * * *"]  # default fallback

How it works

  1. The pause pod manager runs a reconciliation loop every 10 seconds
  2. It evaluates which policy is active based on current time and cron periods
  3. Calculates desired replica count using idle_count and optional scale_factor
  4. Creates/updates a Deployment to maintain the target number of pause pods
  5. Pause pods use low priority class so they get preempted when real jobs need capacity
  6. On shutdown, the deployment is cleaned up

Relates to gitlab-com/gl-infra/production-engineering#28168

Dependencies

Author's checklist

  • Tests added for new functionality
  • Documentation added
  • RBAC docs generator updated to support apps API group
Edited by Igor

Merge request reports

Loading