Add slot-based cgroup support for Docker executor

What does this MR do?

Add slot-based cgroup support for Docker executor

Add configuration options and implementation to use taskscaler slot numbers for dynamic cgroup naming, enabling persistent resource pools per slot.

  • Add UseSlotCgroups, SlotCgroupTemplate, ServiceSlotCgroupTemplate config fields
  • Implement getCgroupParent() and getServiceCgroupParent() with slot resolution
  • Add GetAcquisition() method to autoscaler AcquisitionRef for slot access
  • Update createHostConfig methods to use dynamic cgroup resolution
  • Add unit tests for all slot cgroup functionality

Why was this MR needed?

Enables persistent resource isolation by placing all containers for a job (build and services) into the same slot-derived cgroup. This allows administrators to pre-create cgroups for each slot, providing consistent resource allocation and isolation for entire jobs across executions.

What's the best way to test this MR?

This MR has unit tests. But I also ran a load test with the static fleeting plugin to verify I could constrain a job to a single slots worth of resources.

Test Environment

  • VM: 8 CPUs, 31GB RAM
  • Target: 4 slots with 2 CPUs each
  • GitLab Runner: docker-autoscaler executor with slot-based cgroup feature

Cgroup Configuration

Created systemd slices with CPU affinity and resource limits:

# Slot assignments
sudo systemctl set-property --runtime runner-slot-0.slice CPUQuota=200% AllowedCPUs=0,1
sudo systemctl set-property --runtime runner-slot-1.slice CPUQuota=200% AllowedCPUs=2,3
sudo systemctl set-property --runtime runner-slot-2.slice CPUQuota=200% AllowedCPUs=4,5
sudo systemctl set-property --runtime runner-slot-3.slice CPUQuota=200% AllowedCPUs=6,7

# Memory limits (2GB per slot)
for i in {0..3}; do
    sudo systemctl set-property --runtime runner-slot-$i.slice MemoryMax=2G
done

GitLab Runner Configuration

[runners.docker]
  use_slot_cgroups = true
  slot_cgroup_template = "runner-slot-{slot}.slice"
  service_slot_cgroup_template = "runner-slot-{slot}.slice"

[runners.autoscaler]
  plugin = "fleeting-plugin-static"
  capacity_per_instance = 4  # 4 slots per VM
  max_instances = 1          # Single VM with 4 slots

  [runners.autoscaler.plugin_config]
    path = "/path/to/instances.json"

Static Plugin Instance Configuration

{
    "cgroup-slot-test": {
        "os": "linux",
        "arch": "amd64",
        "protocol": "ssh",
        "username": "josephburnett",
        "key_path": "/home/josephburnett/.ssh/id_ed25519",
        "internal_addr": "10.128.0.33"
    }
}

CPU Burn Test

Used GitLab CI job that attempts to use 8 CPUs for 15 seconds:

job:
  script: for i in {1..8}; do timeout 15s yes > /dev/null & done ; wait
# Monitor CPU usage during job execution
htop

Results:

  • Without cgroup: Used all 8 CPUs
  • In slot-0 cgroup: Limited to CPUs 0,1 only

What are the relevant issue numbers?

N/A

Edited by Joe Burnett

Merge request reports

Loading