Stabilize build job EC2 instances
Since end of 2022-11, we are seeing multiple connected problems:
- kernel build jobs based on spot instances are nearly all terminated before they get anything done (c5d.4xlarge)
- spot instance requests for kernel build jobs fail with no-capacity-available (c5ad.4xlarge, c6id.4xlarge)
- launching non-spot instances for kernel build jobs fail with insufficient-instance-capacity (c5ad.4xlarge)
At higher cost, this has been stabilized for now by using non-spot c5d.4xlarge instances for build jobs.
Spot instance interruptions:
<cki_bot> 🤠 P529946989 J2406340516 build aarch64 failed: Detected System failure, not retrying: job restarted externally as J2406496758 https://l.cki-project.org/DyhJS59f
Spot instances not available:
<cki_bot> 👻 Terminated 40 old spot instance request(s)
Normal instances not available (journald):
gitlab-runner[903]: ERROR: Error creating machine: Error in driver during machine creation: Error launching instance: InsufficientInstanceCapacity: We currently do not have sufficient c5ad.4xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5ad.4xlarge capacity by not specifying an Availability Zone in your request or choosing ...
In the job logs, it looks like https://gitlab.com/redhat/red-hat-ci-tools/kernel/cki-internal-pipelines/cki-trusted-contributors/-/jobs/2406340516:
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-c6bb62f6 ...
WARNING: Failed to pull image with policy "always": Cannot connect to the Docker daemon at tcp://10.29.162.188:2376. Is the docker daemon running? (manager.go:203:0s)
ERROR: Failed to cleanup volumes
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at tcp://10.29.162.188:2376. Is the docker daemon running?
Issues to investigate
- docker-machine only supports
RequestSpotInstances
(deprecated), and notCreateFleet
(new and shiny);CreateFleet
allows to specify multiple instance types and subnets and lets AWS figure out where to launch what to satisfy your need for cheap computing: gitlab-org/ci-cd/docker-machine#79 - the build jobs use machines with an SSD as fast scratch space; the SSD seems to be the bottleneck in all of this
- all build jobs run on the internal network segments, while the only thing the cki-trusted-contributors ones need is the internal Sentry (afaict); moving them to a dedicated external vpc with a lot of subnets in all AZs should increase the chance of getting a machine, esp. with
CreateFleet
: - GitLab is working on a successor to docker-machine, but this might still take some time: gitlab-org/gitlab-runner#29219 (closed)
Some potential TODOs
- determine if we really need SSDs on those machines
- setup a disabled non-spot runner for emergencies
🚑 - hack up docker-machine to use
CreateFleet
instead ofRequestSpotInstances
: https://gitlab.com/cki-project/docker-machine/-/merge_requests/4 - to improve metrics/logging, export CloudTrail events to Loki, setup alerts for
BidEvictedEvent
📨 - We can use boto3 to access cloudtrail apparently https://github.com/claick-oliveira/cloudtrail-export-logs
Edited by Michael Hofmann