Investigate collecting Opstrace metrics with the .com Thanos deployment
Summary
In the Observability Team Sync we discussed the possibility of improving our monitoring of Opstrace infrastructure by collecting metrics using our existing Thanos deployment. This would allow us to create a proper service dashboard for Opstrace, configure SLOs and alerts. While there is a longer term plan to monitor Opstrace with Opstrace, for error tracking readines it will be essential that we can monitor the service before it leaves Beta.
Thanos components
See Thanos components and GitLab.com monitoring docs for how Thanos works and what we have deployed.
Opstrace Environments
-
Ensure that we do not have subnet overlap with the opstrace subnet, we will need to pick an available range from https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/subnet-allocations.md -
A new bucket named gitlab-<env>-prometheus -
In the Opstrace cluster, thanos-sidecar. Colocated with each prometheus instance, uploads metrics from TSDB disk to object storage buckets and answers queries from thanos-query, including external labels on metrics so that they can be attributed to an environment / shard. -
In the Opstrace cluster, thanos-stor. Provides a gateway to the metrics buckets populated bythanos-sidecar -
In the same account (environment) one deployment of thanos-compact, this is a background component that builds downsampled metrics and applies retention lifecycle rules.
GitLab.com Ops Environment
Thanos Query runs in the gitlab-ops environment, it will need to query recent metrics from all Prometheus instances (via thanos-sidecar) and longer-term metrics from thanos-store.
-
Peer the ops environment with the Opstrace environment, with firewall rules that limit access to monitoring infrastructure -
Configure thanos-query to grab metrics from thanos-sidecarandthanos-storby updating the list of stores in tanka-deployments
Edited by John Jarvis