Recommendations for shipping AWS account metrics to Mimir
<!-- This template is for GitLab Team Members seeking support from the [Production Engineering::Observability team](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/production-engineering/observability/) Please first look at our [Documentation Hub](https://gitlab-com.gitlab.io/gl-infra/observability/docs-hub/) to see if your question is answered there. If it isn't, please fill out the details below. --> ## General Information: - Point of contact for this request: @joe-shaw - Related issue for context (if applicable): https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/issues/46+ ## Details In runners platform we use AWS to host our [macOS machines](https://docs.gitlab.com/ci/runners/hosted_runners/macos/) - this is the only reliable cloud option at the moment. We currently have 3 AWS accounts; one for staging/testing, one for `saas-macos-m1-medium` hosts and one for `saas-macos-m2pro-large` hosts. The runners that control these resources exist in GCP, where most of our runners reside. Therefore we can get some metrics through the runner, when it comes to autoscaling, and we have plenty of dashboarding etc. to monitor this. However, we have no visibility into the state of these AWS accounts without first logging into the AWS console and looking at various resources (e.g. Autoscaling groups). Ideally we would be able to ship metrics from the AWS accounts into our Mimir instance so that we can create dashboards and alerts for these signals. We have considered having the runner itself act as a probe here instead of a cloud-to-cloud data collection; this might turn out to be a good option. Here's a particular use-case: we maintain a fleet of macOS dedicated hardware (bare metal instances). It would be very useful to see how many of these are being used, how many are free, and be able to predict capacity from this. **NB** we're not necessarily asking for a full implementation here. With enough guidance the runners platform team can make the changes required to make this work. I suspect this will require site-to-site VPN connections (we do this for runners already), but this might not be so easy to do with the Mimir infrastructure. ## Priority Please check one and assign the appropriate label: - [ ] Very urgent, blocking significant other work: ~"Production Engineering::P1" - [ ] A blocker, but we have workarounds: ~"Production Engineering::P2" - [ ] Not currently a blocker but will be soon: ~"Production Engineering::P3" - [x] Not likely to be a blocker, this is a nice-to-have improvement or suggestion: ~"Production Engineering::P4" - [ ] Unsure <!-- please do not edit the below -->
issue