Investigate collecting Opstrace metrics with the .com Thanos deployment

Summary

In the Observability Team Sync we discussed the possibility of improving our monitoring of Opstrace infrastructure by collecting metrics using our existing Thanos deployment. This would allow us to create a proper service dashboard for Opstrace, configure SLOs and alerts. While there is a longer term plan to monitor Opstrace with Opstrace, for error tracking readines it will be essential that we can monitor the service before it leaves Beta.

Thanos components

See Thanos components and GitLab.com monitoring docs for how Thanos works and what we have deployed.

Opstrace Environments

Ensure that we do not have subnet overlap with the opstrace subnet, we will need to pick an available range from https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/uncategorized/subnet-allocations.md
A new bucket named gitlab-<env>-prometheus
In the Opstrace cluster, thanos-sidecar. Colocated with each prometheus instance, uploads metrics from TSDB disk to object storage buckets and answers queries from thanos-query, including external labels on metrics so that they can be attributed to an environment / shard.
In the Opstrace cluster, thanos-stor. Provides a gateway to the metrics buckets populated by thanos-sidecar
In the same account (environment) one deployment of thanos-compact, this is a background component that builds downsampled metrics and applies retention lifecycle rules.

GitLab.com Ops Environment

Thanos Query runs in the gitlab-ops environment, it will need to query recent metrics from all Prometheus instances (via thanos-sidecar) and longer-term metrics from thanos-store.

Peer the ops environment with the Opstrace environment, with firewall rules that limit access to monitoring infrastructure
Configure thanos-query to grab metrics from thanos-sidecar and thanos-stor by updating the list of stores in tanka-deployments

Edited Aug 15, 2022 by John Jarvis