Add AWS ARM spot runners to supplement local CI (#1) · Issues · mson / runners

Add AWS ARM spot runners to supplement local CI

## Context CI currently runs on a local GitLab Runner (currently on my workstation, eventually moving to the home lab). Cloud runners would supplement this — picking up jobs when local is offline, busy, or unsuitable. Pipeline usage over the past 2 weeks: 163 pipelines, 1,682 jobs, ~29 CPU-hours total. Projects to ~62 CPU-hours/month, with spiky days (May 2 was 5h; quiet days <1h). Cloud cost numbers below assume *all* jobs went to the cloud — actual spend will be lower in proportion to what local handles. Goal: self-hosted cloud runners on cheap spot capacity, scaled to zero when idle, registered alongside local. ## Architecture `gitlab-runner` with the [`docker-autoscaler`][docker-autoscaler] executor and the AWS plugin (`fleeting-plugin-aws`, installed via `gitlab-runner fleeting install`). - **Manager**: always-on host running `gitlab-runner`. Polls GitLab via long-polling; no webhook setup needed. All configuration in `config.toml`. - **Workers**: ephemeral EC2 spot instances provisioned by the plugin when jobs queue. Jobs run in Docker on the worker; workers stay warm for `idle_time` after their last job, then terminate. - **Scale-to-zero**: `idle_count = 0`, `idle_time = "20m"`. First job after idle pays ~45-90s cold start; subsequent jobs reuse warm workers. Config sketch: ```toml [[runners]] executor = "docker-autoscaler" [runners.autoscaler] plugin = "aws" capacity_per_instance = 1 max_instances = 10 [runners.autoscaler.plugin_config] name = "my-asg" region = "us-east-1" [[runners.autoscaler.policy]] idle_count = 0 idle_time = "20m" ``` ## Cost projection (AWS ARM spot) Based on usage above and typical us-east-1 spot prices (verify with `aws ec2 describe-spot-price-history`): | Instance | vCPU/RAM | ~Spot $/hr | Per 100 VM-hr/mo | |---|---|---|---| | `t4g.medium` | 2 / 4GB | ~$0.012 | $1.20 | | `t4g.large` | 2 / 8GB | ~$0.024 | $2.40 | | `c7g.large` (sustained) | 2 / 4GB | ~$0.029 | $2.90 | 100 VM-hr/mo assumes ~30-50% idle overhead on top of 62 CPU-hours actual work. Mixed-instance fleet: default `t4g.medium` for short jobs (linters, codespell, govulncheck — most of the 1,682 jobs), tagged `c7g.large` for `mutation-test` and `integration-test`. Realistic landed cost: **$2-5/month** for compute. ## Egress AWS provides 100GB/month free egress across all services. Estimated runner→GitLab egress is 5-15GB/month (cache uploads, artifacts). Comfortably under free tier; even if exceeded, $0.09/GB. Container image pulls and cache downloads are ingress, which is free. ## Container image caching Three complementary layers, ordered by where they reduce cost: 1. **Bake into the worker image.** `docker pull` each hot image during AMI build (Packer or similar). Zero cold-start pull cost for the bundled set. Travels across clouds — same Packer config targets Hetzner/Oracle snapshots. 2. **Pull-through cache for the rest.** AWS has a [native ECR pull-through cache for Docker Hub][ecr-ptc]. On a VPS, run a `registry:2` instance in mirror mode (~$3/mo on Hetzner). Solves Docker Hub rate limits and outages. 3. **Pin digests on slow-moving images.** `image: postgres:18@sha256:...` for service images. Renovate manages updates. `pull_policy: if-not-present` becomes correct-by-construction. Leave tool tags unpinned (`golang:1.26`) with `pull_policy: always` so CVE patches flow. `always` is cheap on warm workers — HEAD request, 304 if unchanged. ### Image discovery for AMI bake Automate the list: - Scan `.gitlab-ci.yml` and any `include:` files for `image:` and `services:` references. - Combine with a manual allowlist for images used by *generated* pipelines (matrix child pipelines, ad-hoc jobs) that don't appear in source files. - Deduplicate, feed to Packer's `docker pull` step at AMI build time. Rebake periodically (weekly?) to refresh CVE patches in baked images. ## Cross-compilation vs x86 runners ARM workers can build linux/amd64 binaries via `GOOS=linux GOARCH=amd64 go build`. Go cross-compiles cleanly with no toolchain installation for pure-Go binaries. A dedicated x86 runner just to compile binaries is overkill. Recommendation: cross-compile from ARM by default; add x86 capacity only if CGO or arch-specific testing becomes a requirement. ## Multi-cloud (future) Single cloud (AWS) for initial implementation to keep complexity down. Worth revisiting later: - **Oracle Cloud Free Tier**: 4 ARM cores + 24GB RAM free forever. Viable as manager host or always-warm runner. Boxes are reportedly hard to provision. - **Hetzner Cloud**: ~€4/mo CX22 manager, cheap workers, no spot market. Community Fleeting plugin (not first-party). - **GCP preemptible**: similar tradeoffs to AWS spot. A multi-cloud fleet would distribute spot-interruption risk and avoid lock-in. ## Open questions - Manager host: home lab, Oracle free tier, or small AWS instance. Lab is always-on so availability isn't a factor; tradeoff is between zero-cost (lab/Oracle) and co-location with workers (AWS). - Region: us-east-1 has cheapest spot; check egress patterns to GitLab.com. - Container registry: ECR (same-region, free pulls) vs GitLab Container Registry (cross-cloud egress). - Routing: do cloud runners pick up any queued job (overflow model), or only jobs tagged for cloud (explicit opt-in)? Overflow is simpler but means a flaky cloud worker can block any pipeline. Explicit tags keep local as primary and let you choose what runs in the cloud. - Cloud credentials for the manager: target deployment is Talos (k8s), so v1 is static AWS access keys in a k8s Secret. Longer term, a local Vault instance with the AWS secrets engine can issue short-lived STS credentials and provide an audit trail. ## Implementation steps 1. AWS IAM role for Fleeting (least-privilege EC2 + spot permissions). 2. Bake AMI: install docker, run `docker pull` for each image discovered by scanning `.gitlab-ci.yml` plus manual allowlist (see "Container image caching"). 3. Stand up manager (Oracle free tier or local to start). 4. Configure Fleeting AWS plugin: `IdleCount=0`, `IdleTime=20m`, mixed instance types. 5. Register runner, tag mutation/integration jobs for `c7g.large`, default `t4g.medium`. 6. Cut over `.gitlab-ci.yml` to use self-hosted runner tags. 7. Monitor for a week, adjust `IdleTime` based on actual idle ratio. [docker-autoscaler]: https://docs.gitlab.com/runner/executors/docker_autoscaler/ [ecr-ptc]: https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html *Generated with Claude Code*

issue