Stabilize build job EC2 instances

Since end of 2022-11, we are seeing multiple connected problems:

kernel build jobs based on spot instances are nearly all terminated before they get anything done (c5d.4xlarge)
spot instance requests for kernel build jobs fail with no-capacity-available (c5ad.4xlarge, c6id.4xlarge)
launching non-spot instances for kernel build jobs fail with insufficient-instance-capacity (c5ad.4xlarge)

At higher cost, this has been stabilized for now by using non-spot c5d.4xlarge instances for build jobs.

Spot instance interruptions:

<cki_bot> 🤠 P529946989 J2406340516 build aarch64 failed: Detected System failure, not retrying: job restarted externally as J2406496758 https://l.cki-project.org/DyhJS59f

Spot instances not available:

<cki_bot> 👻 Terminated 40 old spot instance request(s)

Normal instances not available (journald):

gitlab-runner[903]: ERROR: Error creating machine: Error in driver during machine creation: Error launching instance: InsufficientInstanceCapacity: We currently do not have sufficient c5ad.4xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get c5ad.4xlarge capacity by not specifying an Availability Zone in your request or choosing ...

In the job logs, it looks like https://gitlab.com/redhat/red-hat-ci-tools/kernel/cki-internal-pipelines/cki-trusted-contributors/-/jobs/2406340516:

Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-c6bb62f6 ...
WARNING: Failed to pull image with policy "always": Cannot connect to the Docker daemon at tcp://10.29.162.188:2376. Is the docker daemon running? (manager.go:203:0s)
ERROR: Failed to cleanup volumes
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at tcp://10.29.162.188:2376. Is the docker daemon running?

Issues to investigate

docker-machine only supports RequestSpotInstances (deprecated), and not CreateFleet (new and shiny); CreateFleet allows to specify multiple instance types and subnets and lets AWS figure out where to launch what to satisfy your need for cheap computing: gitlab-org/ci-cd/docker-machine#79
the build jobs use machines with an SSD as fast scratch space; the SSD seems to be the bottleneck in all of this
all build jobs run on the internal network segments, while the only thing the cki-trusted-contributors ones need is the internal Sentry (afaict); moving them to a dedicated external vpc with a lot of subnets in all AZs should increase the chance of getting a machine, esp. with CreateFleet:
- sentry.io: https://gitlab.cee.redhat.com/cki-project/deployment-all/-/merge_requests/2064
GitLab is working on a successor to docker-machine, but this might still take some time: gitlab-org/gitlab-runner#29219 (closed)

Some potential TODOs

determine if we really need SSDs on those machines
setup a disabled non-spot runner for emergencies 🚑
hack up docker-machine to use CreateFleet instead of RequestSpotInstances: https://gitlab.com/cki-project/docker-machine/-/merge_requests/4
to improve metrics/logging, export CloudTrail events to Loki, setup alerts for BidEvictedEvent 📨
- We can use boto3 to access cloudtrail apparently https://github.com/claick-oliveira/cloudtrail-export-logs

Edited Dec 20, 2022 by Michael Hofmann

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information