Add AWS ARM spot runners to supplement local CI
## Context
CI currently runs on a local GitLab Runner (currently on my workstation, eventually moving to the home lab).
Cloud runners would supplement this — picking up jobs when local is offline, busy, or unsuitable.
Pipeline usage over the past 2 weeks: 163 pipelines, 1,682 jobs, ~29 CPU-hours total.
Projects to ~62 CPU-hours/month, with spiky days (May 2 was 5h; quiet days <1h).
Cloud cost numbers below assume *all* jobs went to the cloud — actual spend will be lower in proportion to what local handles.
Goal: self-hosted cloud runners on cheap spot capacity, scaled to zero when idle, registered alongside local.
## Architecture
`gitlab-runner` with the [`docker-autoscaler`][docker-autoscaler] executor and the AWS plugin (`fleeting-plugin-aws`, installed via `gitlab-runner fleeting install`).
- **Manager**: always-on host running `gitlab-runner`.
Polls GitLab via long-polling; no webhook setup needed.
All configuration in `config.toml`.
- **Workers**: ephemeral EC2 spot instances provisioned by the plugin when jobs queue.
Jobs run in Docker on the worker; workers stay warm for `idle_time` after their last job, then terminate.
- **Scale-to-zero**: `idle_count = 0`, `idle_time = "20m"`.
First job after idle pays ~45-90s cold start; subsequent jobs reuse warm workers.
Config sketch:
```toml
[[runners]]
executor = "docker-autoscaler"
[runners.autoscaler]
plugin = "aws"
capacity_per_instance = 1
max_instances = 10
[runners.autoscaler.plugin_config]
name = "my-asg"
region = "us-east-1"
[[runners.autoscaler.policy]]
idle_count = 0
idle_time = "20m"
```
## Cost projection (AWS ARM spot)
Based on usage above and typical us-east-1 spot prices (verify with `aws ec2 describe-spot-price-history`):
| Instance | vCPU/RAM | ~Spot $/hr | Per 100 VM-hr/mo |
|---|---|---|---|
| `t4g.medium` | 2 / 4GB | ~$0.012 | $1.20 |
| `t4g.large` | 2 / 8GB | ~$0.024 | $2.40 |
| `c7g.large` (sustained) | 2 / 4GB | ~$0.029 | $2.90 |
100 VM-hr/mo assumes ~30-50% idle overhead on top of 62 CPU-hours actual work.
Mixed-instance fleet: default `t4g.medium` for short jobs (linters, codespell, govulncheck — most of the 1,682 jobs), tagged `c7g.large` for `mutation-test` and `integration-test`.
Realistic landed cost: **$2-5/month** for compute.
## Egress
AWS provides 100GB/month free egress across all services.
Estimated runner→GitLab egress is 5-15GB/month (cache uploads, artifacts).
Comfortably under free tier; even if exceeded, $0.09/GB.
Container image pulls and cache downloads are ingress, which is free.
## Container image caching
Three complementary layers, ordered by where they reduce cost:
1. **Bake into the worker image.**
`docker pull` each hot image during AMI build (Packer or similar).
Zero cold-start pull cost for the bundled set.
Travels across clouds — same Packer config targets Hetzner/Oracle snapshots.
2. **Pull-through cache for the rest.**
AWS has a [native ECR pull-through cache for Docker Hub][ecr-ptc].
On a VPS, run a `registry:2` instance in mirror mode (~$3/mo on Hetzner).
Solves Docker Hub rate limits and outages.
3. **Pin digests on slow-moving images.**
`image: postgres:18@sha256:...` for service images.
Renovate manages updates.
`pull_policy: if-not-present` becomes correct-by-construction.
Leave tool tags unpinned (`golang:1.26`) with `pull_policy: always` so CVE patches flow.
`always` is cheap on warm workers — HEAD request, 304 if unchanged.
### Image discovery for AMI bake
Automate the list:
- Scan `.gitlab-ci.yml` and any `include:` files for `image:` and `services:` references.
- Combine with a manual allowlist for images used by *generated* pipelines (matrix child pipelines, ad-hoc jobs) that don't appear in source files.
- Deduplicate, feed to Packer's `docker pull` step at AMI build time.
Rebake periodically (weekly?) to refresh CVE patches in baked images.
## Cross-compilation vs x86 runners
ARM workers can build linux/amd64 binaries via `GOOS=linux GOARCH=amd64 go build`.
Go cross-compiles cleanly with no toolchain installation for pure-Go binaries.
A dedicated x86 runner just to compile binaries is overkill.
Recommendation: cross-compile from ARM by default; add x86 capacity only if CGO or arch-specific testing becomes a requirement.
## Multi-cloud (future)
Single cloud (AWS) for initial implementation to keep complexity down.
Worth revisiting later:
- **Oracle Cloud Free Tier**: 4 ARM cores + 24GB RAM free forever.
Viable as manager host or always-warm runner.
Boxes are reportedly hard to provision.
- **Hetzner Cloud**: ~€4/mo CX22 manager, cheap workers, no spot market.
Community Fleeting plugin (not first-party).
- **GCP preemptible**: similar tradeoffs to AWS spot.
A multi-cloud fleet would distribute spot-interruption risk and avoid lock-in.
## Open questions
- Manager host: home lab, Oracle free tier, or small AWS instance.
Lab is always-on so availability isn't a factor; tradeoff is between zero-cost (lab/Oracle) and co-location with workers (AWS).
- Region: us-east-1 has cheapest spot; check egress patterns to GitLab.com.
- Container registry: ECR (same-region, free pulls) vs GitLab Container Registry (cross-cloud egress).
- Routing: do cloud runners pick up any queued job (overflow model), or only jobs tagged for cloud (explicit opt-in)?
Overflow is simpler but means a flaky cloud worker can block any pipeline.
Explicit tags keep local as primary and let you choose what runs in the cloud.
- Cloud credentials for the manager: target deployment is Talos (k8s), so v1 is static AWS access keys in a k8s Secret.
Longer term, a local Vault instance with the AWS secrets engine can issue short-lived STS credentials and provide an audit trail.
## Implementation steps
1. AWS IAM role for Fleeting (least-privilege EC2 + spot permissions).
2. Bake AMI: install docker, run `docker pull` for each image discovered by scanning `.gitlab-ci.yml` plus manual allowlist (see "Container image caching").
3. Stand up manager (Oracle free tier or local to start).
4. Configure Fleeting AWS plugin: `IdleCount=0`, `IdleTime=20m`, mixed instance types.
5. Register runner, tag mutation/integration jobs for `c7g.large`, default `t4g.medium`.
6. Cut over `.gitlab-ci.yml` to use self-hosted runner tags.
7. Monitor for a week, adjust `IdleTime` based on actual idle ratio.
[docker-autoscaler]: https://docs.gitlab.com/runner/executors/docker_autoscaler/
[ecr-ptc]: https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html
*Generated with Claude Code*
issue