Migrate monitoring to dedicated Prometheus servers on GKE

As described at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13710#note_637454201, while working on private shard we've decided to peer networks and re-use existing Prometheus servers for scraping the new runner managers. This solution is however not ideal and no one (Runner, Infra, Security groups) like it. It was chosen as a temporary solution to make our work on testing private infrastructure possible.

The best solution would be to have a dedicated Prometheus server installed in every project where we need it and connected with our Thanos cluster. Additional benefit of it is that having this, we can deploy almost exactly same setup to all gitlab-ci-plan-free-X projects, prepare new google_sd_configs to autodiscover ephemeral VMs and we would have the monitoring of job execution environments back and complete.

Design

The following design is still a work-in-progress

Monitoring will be defined in almost the same configuration in all CI related projects. It will be deployed using GKE.

GKE will be using a group of reserved CIDRs, to avoid conflicts with other peered networks where monitored resources will be hosted. It will be configured to use a network with hardcoded name gke with a hardcoded gke subnetwork. For that purpose the following CIDRs were chosen:

CIDR	Range type	Purpose
`10.9.4.0/24`	Primary	GKE nodes range
`10.8.0.0/16`	Secondary `gke-pods`	GKE pods range
`10.9.0.0/22`	secondary `gke-services`	GKE services range

Prometheus will be deployed in at least two replicas. Both will have a Thanos Sidecar running alongside. Sidecar's gRPC endpoint will be exposed as publicly accessible (we don't want to peer the ops or gprd network here) and the GCP Firewall will limit access to it to only Thanos Query public IPs. gRPC communication should be wrapped into TLS.

We will also have a GCS bucket for long-term storage created per CI project. Thanos Sidecar will be configured with access to write to this bucket.

Apart of the Sidecar we will have also Thanos Store Gateway and Thanos Compact deployed and configured to use the same GCS bucket. Store Gateway's gRPC endpoint will be exposed similarly to the Sidecar's one.

Traefik will be used as the ingress and load-balancing mechanism, which will expose the gRPC services on given ports (using TCP routing) Prometheus UI and own dashboard. HTTP endpoints will be automatically redirected to HTTPS, and Let's Encrypt certificates will be used for TLS.

The K8S deployment configuration will be managed fully from CI. One project should cover all monitoring clusters in different CI projects that we will maintain.

Architecture Graph

TODO

The list here is still a work-in-progress

Edited Nov 02, 2021 by Tomasz Maczukin