Add slot-based cgroup support for Docker executor
What does this MR do?
Add slot-based cgroup support for Docker executor
Add configuration options and implementation to use taskscaler slot numbers for dynamic cgroup naming, enabling persistent resource pools per slot.
- Add UseSlotCgroups, SlotCgroupTemplate, ServiceSlotCgroupTemplate config fields
- Implement getCgroupParent() and getServiceCgroupParent() with slot resolution
- Add GetAcquisition() method to autoscaler AcquisitionRef for slot access
- Update createHostConfig methods to use dynamic cgroup resolution
- Add unit tests for all slot cgroup functionality
Why was this MR needed?
Enables persistent resource isolation by placing all containers for a job (build and services) into the same slot-derived cgroup. This allows administrators to pre-create cgroups for each slot, providing consistent resource allocation and isolation for entire jobs across executions.
What's the best way to test this MR?
This MR has unit tests. But I also ran a load test with the static fleeting plugin to verify I could constrain a job to a single slots worth of resources.
Test Environment
- VM: 8 CPUs, 31GB RAM
- Target: 4 slots with 2 CPUs each
- GitLab Runner: docker-autoscaler executor with slot-based cgroup feature
Cgroup Configuration
Created systemd slices with CPU affinity and resource limits:
# Slot assignments
sudo systemctl set-property --runtime runner-slot-0.slice CPUQuota=200% AllowedCPUs=0,1
sudo systemctl set-property --runtime runner-slot-1.slice CPUQuota=200% AllowedCPUs=2,3
sudo systemctl set-property --runtime runner-slot-2.slice CPUQuota=200% AllowedCPUs=4,5
sudo systemctl set-property --runtime runner-slot-3.slice CPUQuota=200% AllowedCPUs=6,7
# Memory limits (2GB per slot)
for i in {0..3}; do
sudo systemctl set-property --runtime runner-slot-$i.slice MemoryMax=2G
done
GitLab Runner Configuration
[runners.docker]
use_slot_cgroups = true
slot_cgroup_template = "runner-slot-{slot}.slice"
service_slot_cgroup_template = "runner-slot-{slot}.slice"
[runners.autoscaler]
plugin = "fleeting-plugin-static"
capacity_per_instance = 4 # 4 slots per VM
max_instances = 1 # Single VM with 4 slots
[runners.autoscaler.plugin_config]
path = "/path/to/instances.json"
Static Plugin Instance Configuration
{
"cgroup-slot-test": {
"os": "linux",
"arch": "amd64",
"protocol": "ssh",
"username": "josephburnett",
"key_path": "/home/josephburnett/.ssh/id_ed25519",
"internal_addr": "10.128.0.33"
}
}
CPU Burn Test
Used GitLab CI job that attempts to use 8 CPUs for 15 seconds:
job:
script: for i in {1..8}; do timeout 15s yes > /dev/null & done ; wait
# Monitor CPU usage during job execution
htop
Results:
- Without cgroup: Used all 8 CPUs
- In slot-0 cgroup: Limited to CPUs 0,1 only
What are the relevant issue numbers?
N/A