Migrate monitoring to dedicated Prometheus servers on GKE
As described at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13710#note_637454201, while working on `private` shard we've decided to peer networks and re-use existing Prometheus servers for scraping the new runner managers. This solution is however not ideal and no one (Runner, Infra, Security groups) like it. It was chosen as a temporary solution to make our work on testing `private` infrastructure possible.
The best solution would be to have a dedicated Prometheus server installed in every project where we need it and connected with our Thanos cluster. Additional benefit of it is that having this, we can deploy almost exactly same setup to all `gitlab-ci-plan-free-X` projects, prepare new `google_sd_configs` to autodiscover ephemeral VMs and we would have the monitoring of job execution environments back and complete.
## Design
_The following design is still a work-in-progress_
Monitoring will be defined in almost the same configuration in all CI related projects. It will be deployed using GKE.
GKE will be using a group of reserved CIDRs, to avoid conflicts with other peered networks where monitored resources will be hosted. It will be configured to use a network with hardcoded name `gke` with a hardcoded `gke` subnetwork. For that purpose the following CIDRs [were chosen](https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3771/diffs):
| CIDR | Range type | Purpose |
|---------------|---------------------------|--------------------|
| `10.9.4.0/24` | Primary | GKE nodes range |
| `10.8.0.0/16` | Secondary `gke-pods` | GKE pods range |
| `10.9.0.0/22` | secondary `gke-services` | GKE services range |
Prometheus will be deployed in at least two replicas. Both will have a Thanos Sidecar running alongside. Sidecar's gRPC endpoint will be exposed as publicly accessible (we don't want to peer the `ops` or `gprd` network here) and the GCP Firewall will limit access to it to only Thanos Query public IPs. gRPC communication should be wrapped into TLS.
We will also have a GCS bucket for long-term storage created per CI project. Thanos Sidecar will be configured with access to write to this bucket.
Apart of the Sidecar we will have also Thanos Store Gateway and Thanos Compact deployed and configured to use the same GCS bucket. Store Gateway's gRPC endpoint will be exposed similarly to the Sidecar's one.
[Traefik](https://traefik.io/traefik) will be used as the ingress and load-balancing mechanism, which will expose the gRPC services on given ports (using TCP routing) Prometheus UI and own dashboard. HTTP endpoints will be automatically redirected to HTTPS, and Let's Encrypt certificates will be used for TLS.
The K8S deployment configuration will be managed fully from CI. One project should cover all monitoring clusters in different CI projects that we will maintain.
### Architecture Graph

## TODO
_The list here is still a work-in-progress_
- [x] Decide on the GKE subnetwork CIDR (as per https://gitlab.com/gitlab-com/runbooks/-/blob/update-runners-networking-documentation/docs/ci-runners/README.md#networking-layout-design) :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3771
- [x] Prepare GKE terraform module that will be shared across CI proejcts :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2837
- [x] Create a network [dedicated for GKE](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/ci-runners#ephemeral-runner-vms-networking)
- [x] Reserve a static IP for Load Balancer Service endpoint
- [x] Register a DNS A record for Load Balancer Service endpoint IP
- [x] Register a DNS A record for Prometheus service :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2870
- [x] Prepare the management service account that will be used by CI to update configuration of the K8S deployments
- [x] Prepare the Service Account for accessing the GCS bucket
- [x] Prepare GCS bucket for metrics storage
- [x] Fix static IP reservation :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2867
- [x] Create GKE in the `ci` project :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2838
- [x] Create GKE with the module
- [x] Add required network peering
- [x] Give the default cluster service account permissions to list instances in GCE (needed for autodiscovery on deployed Prometheus) :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2846
- [x] Adjust firewall rules :point_right: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2847
- [x] Prepare project for CI Runners monitoring configuration management :point_right: https://ops.gitlab.net/ci-runners/gke-workloads
- [x] Create the project
- [x] Retrieve the keys for service accounts created with terraform and configure CI/CD variablesservers)
- [x] Prepare K8S deployment configuration
- [x] Deploy Prometheus Operator
- [x] Prometheus configuration
- [x] "standard" K8S monitoring
- [x] GCP SD for runner managers - runner exporter (to be used in `ci` project)
- [x] GCP SD for runner managers - node exporter (to be used in `ci` project) :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/4
- [x] GCP SD for ephemeral runners (to be used in `ci` and in future in the `ci-plan-free-X` projects)
- [x] Add `deployment` label to targets :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/5
- [x] Thanos configuration
- [x] Configure Thanos Sidecar
- [x] Configure Thanos Store Gateway
- [x] Configure Thanos Compact
- [x] Ingress/LoadBalancer configuration to expose the services
- [x] Secure access to Sidecar and Store ports on load-balancer with firewall rules :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7
- [x] Expose Thanos Sidecar on a given port with access limited by the firewall :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7
- [x] Expose Thanos Store Gateway on a given port with access limited by the firewall :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7
- [x] Expose Prometheus on a 443 port with access limited by oAuth2 with Google provider :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/13
- [x] Add CI automation for K8S deployments handling
- [x] Add `danger review` job :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/9
- [x] Guard configuration deployments with runner change lock :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/10
- [x] Define external labels for Prometheus and Thanos componenets :point_right: https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/15
- [x] Update `ci-runners` service documentation in the runbooks:
- [x] Update the GKE CIDR reservation :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3826
- [x] Add the description of monitoring stack in GKE to the runbook
- [x] :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3845
- [x] :point_right: gitlab-com/runbooks#67
- [x] Test `ci` deployment with a Thanos Query instance (not the production one)
- [ ] Investigate how to handle alerting rules with Thanos and new CI monitoring
- [ ] Last steps
- [ ] Find and use an e-mail address for GCP Consent Screen and Let's Encrypt ACME communication
- [x] Decide on a final path for https://ops.gitlab.net/ci-runners/gke-workloads
- [x] Prepare the project with ops mirroring :point_right: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14338
- [x] Merge the current project with the template
- [ ] Fix Prometheus configuration
- [x] Fix the `runner_manager` filter :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/4025
- [x] Configure metrics relabeling to not conflict with the GitLab Monitoring general labels like `stage`:
- [x] https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/6
- [x] https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/7
- [x] Add recording rules to record some custom metrics used by ~"Service::CI Runners" dashboards
- [x] https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/8
- [x] https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/9
- [x] Add missing `env` external label :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/10
- [x] Add missing `fqdn` label :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/11
- [x] Add project info to `fqdn` and `instance` labels :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/12
- [x] Set specific `type` label for ephemeral-vms metrics :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/13
- [x] Define `type` label manually in all places :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/15
- [x] Change shard filtering and labeling in service discovery :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/18
- [ ] Add saturation recording rules
- [ ] Move the `gitlab-com-ci.yml` alert rules that target runner metrics from `rules/` to `thanos-rules/`
- [ ] Scale GKE cluster nodes up :point_right: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3126
- [x] Remove some less-important metrics from ephemeral VMs :point_right: https://gitlab.com/gitlab-com/gl-infra/ci-runners/k8s-workloads/-/merge_requests/21
- [ ] Thanos Query integration (handle within a change management issue)
- [x] preparation
- [x] Update dashboards to include `environment` and `stage` labels in filters for all panels :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3951
- [x] Update CI/CD "old" alerting rules to include `environment` and `stage` labels in partitioning :point_right: https://gitlab.com/gitlab-com/runbooks/-/merge_requests/3964
- [x] "canary" test binding :point_right: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5670
This will be done to test that integration with Thanos Query and rest of the stack works, but to not make new Prometheus servers the source for metrics used by dashboards and alerting.
- prepare silencer for `job="runners-manager", environment="gprd", stage="cny"`
- set external label `stage: cny` on the K8S monitoring deployment
- Point Thanos Query to the new Prometheus servers
- to the Sidecar at `monitoring-lb.ci.ci-runners.gitlab.net:10901`
- to the Store Gateway at `monitoring-lb.ci.ci-runners.gitlab.net:10903`
- [ ] "production" test binding
This will be done to start using the new data source for dashboards and alerting, but to keep the old ones available for a quick revert if anything wrong will be happening.
- set external label `stage: main` on the K8S monitoring deployment
- set label `stage: cny` for the scraping rules of `private` runner managers on the existing Prometheus server
- [ ] Remove old configuration
Fully switch to the new monitoring stack and remove old configuration for `private` runner manager nodes
- remove old configuration for `private` runner managers from Prometheus server in GPRD
- remove the VPC peering between `gitlab-gprd/gprd` and `gitlab-ci/ci` networks.
issue