Migrate monitoring to dedicated Prometheus servers on GKE
As described at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13710#note_637454201, while working on private
shard we've decided to peer networks and re-use existing Prometheus servers for scraping the new runner managers. This solution is however not ideal and no one (Runner, Infra, Security groups) like it. It was chosen as a temporary solution to make our work on testing private
infrastructure possible.
The best solution would be to have a dedicated Prometheus server installed in every project where we need it and connected with our Thanos cluster. Additional benefit of it is that having this, we can deploy almost exactly same setup to all gitlab-ci-plan-free-X
projects, prepare new google_sd_configs
to autodiscover ephemeral VMs and we would have the monitoring of job execution environments back and complete.
Design
The following design is still a work-in-progress
Monitoring will be defined in almost the same configuration in all CI related projects. It will be deployed using GKE.
GKE will be using a group of reserved CIDRs, to avoid conflicts with other peered networks where monitored resources will be hosted. It will be configured to use a network with hardcoded name gke
with a hardcoded gke
subnetwork. For that purpose the following CIDRs were chosen:
CIDR | Range type | Purpose |
---|---|---|
10.9.4.0/24 |
Primary | GKE nodes range |
10.8.0.0/16 |
Secondary gke-pods
|
GKE pods range |
10.9.0.0/22 |
secondary gke-services
|
GKE services range |
Prometheus will be deployed in at least two replicas. Both will have a Thanos Sidecar running alongside. Sidecar's gRPC endpoint will be exposed as publicly accessible (we don't want to peer the ops
or gprd
network here) and the GCP Firewall will limit access to it to only Thanos Query public IPs. gRPC communication should be wrapped into TLS.
We will also have a GCS bucket for long-term storage created per CI project. Thanos Sidecar will be configured with access to write to this bucket.
Apart of the Sidecar we will have also Thanos Store Gateway and Thanos Compact deployed and configured to use the same GCS bucket. Store Gateway's gRPC endpoint will be exposed similarly to the Sidecar's one.
Traefik will be used as the ingress and load-balancing mechanism, which will expose the gRPC services on given ports (using TCP routing) Prometheus UI and own dashboard. HTTP endpoints will be automatically redirected to HTTPS, and Let's Encrypt certificates will be used for TLS.
The K8S deployment configuration will be managed fully from CI. One project should cover all monitoring clusters in different CI projects that we will maintain.
Architecture Graph
TODO
The list here is still a work-in-progress
-
Decide on the GKE subnetwork CIDR (as per https://gitlab.com/gitlab-com/runbooks/-/blob/update-runners-networking-documentation/docs/ci-runners/README.md#networking-layout-design) 👉 gitlab-com/runbooks!3771 (merged) -
Prepare GKE terraform module that will be shared across CI proejcts 👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2837-
Create a network dedicated for GKE -
Reserve a static IP for Load Balancer Service endpoint -
Register a DNS A record for Load Balancer Service endpoint IP -
Register a DNS A record for Prometheus service 👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2870 -
Prepare the management service account that will be used by CI to update configuration of the K8S deployments -
Prepare the Service Account for accessing the GCS bucket -
Prepare GCS bucket for metrics storage -
Fix static IP reservation 👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2867
-
-
Create GKE in the ci
project👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2838-
Create GKE with the module -
Add required network peering -
Give the default cluster service account permissions to list instances in GCE (needed for autodiscovery on deployed Prometheus) 👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2846 -
Adjust firewall rules 👉 https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2847
-
-
Prepare project for CI Runners monitoring configuration management 👉 https://ops.gitlab.net/ci-runners/gke-workloads-
Create the project -
Retrieve the keys for service accounts created with terraform and configure CI/CD variablesservers)
-
-
Prepare K8S deployment configuration -
Deploy Prometheus Operator -
Prometheus configuration -
"standard" K8S monitoring -
GCP SD for runner managers - runner exporter (to be used in ci
project) -
GCP SD for runner managers - node exporter (to be used in ci
project)👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/4 -
GCP SD for ephemeral runners (to be used in ci
and in future in theci-plan-free-X
projects) -
Add deployment
label to targets👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/5
-
-
Thanos configuration -
Configure Thanos Sidecar -
Configure Thanos Store Gateway -
Configure Thanos Compact
-
-
Ingress/LoadBalancer configuration to expose the services -
Secure access to Sidecar and Store ports on load-balancer with firewall rules 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7 -
Expose Thanos Sidecar on a given port with access limited by the firewall 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7 -
Expose Thanos Store Gateway on a given port with access limited by the firewall 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/7 -
Expose Prometheus on a 443 port with access limited by oAuth2 with Google provider 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/13
-
-
Add CI automation for K8S deployments handling -
Add danger review
job👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/9 -
Guard configuration deployments with runner change lock 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/10
-
-
Define external labels for Prometheus and Thanos componenets 👉 https://ops.gitlab.net/ci-runners/gke-workloads/-/merge_requests/15
-
-
Update ci-runners
service documentation in the runbooks:-
Update the GKE CIDR reservation 👉 gitlab-com/runbooks!3826 (merged) -
Add the description of monitoring stack in GKE to the runbook
-
-
Test ci
deployment with a Thanos Query instance (not the production one) -
Investigate how to handle alerting rules with Thanos and new CI monitoring -
Last steps -
Find and use an e-mail address for GCP Consent Screen and Let's Encrypt ACME communication -
Decide on a final path for https://ops.gitlab.net/ci-runners/gke-workloads -
Prepare the project with ops mirroring 👉 https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14338 -
Merge the current project with the template
-
-
-
Fix Prometheus configuration -
Fix the runner_manager
filter👉 gitlab-com/runbooks!4025 (merged) -
Configure metrics relabeling to not conflict with the GitLab Monitoring general labels like stage
: -
Add recording rules to record some custom metrics used by ServiceCI Runners dashboards -
Add missing env
external label👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!10 (merged) -
Add missing fqdn
label👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!11 (merged) -
Add project info to fqdn
andinstance
labels👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!12 (merged) -
Set specific type
label for ephemeral-vms metrics👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!13 (merged) -
Define type
label manually in all places👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!15 (merged) -
Change shard filtering and labeling in service discovery 👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!18 (merged) -
Add saturation recording rules -
Move the gitlab-com-ci.yml
alert rules that target runner metrics fromrules/
tothanos-rules/
-
Scale GKE cluster nodes up 👉 https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3126 -
Remove some less-important metrics from ephemeral VMs 👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!21 (merged)
-
-
Thanos Query integration (handle within a change management issue) -
preparation -
Update dashboards to include environment
andstage
labels in filters for all panels👉 gitlab-com/runbooks!3951 (merged) -
Update CI/CD "old" alerting rules to include environment
andstage
labels in partitioning👉 gitlab-com/runbooks!3964 (merged)
-
-
"canary" test binding 👉 production#5670 (closed)This will be done to test that integration with Thanos Query and rest of the stack works, but to not make new Prometheus servers the source for metrics used by dashboards and alerting.
- prepare silencer for
job="runners-manager", environment="gprd", stage="cny"
- set external label
stage: cny
on the K8S monitoring deployment - Point Thanos Query to the new Prometheus servers
- to the Sidecar at
monitoring-lb.ci.ci-runners.gitlab.net:10901
- to the Store Gateway at
monitoring-lb.ci.ci-runners.gitlab.net:10903
- to the Sidecar at
- prepare silencer for
-
"production" test binding This will be done to start using the new data source for dashboards and alerting, but to keep the old ones available for a quick revert if anything wrong will be happening.
- set external label
stage: main
on the K8S monitoring deployment - set label
stage: cny
for the scraping rules ofprivate
runner managers on the existing Prometheus server
- set external label
-
Remove old configuration Fully switch to the new monitoring stack and remove old configuration for
private
runner manager nodes- remove old configuration for
private
runner managers from Prometheus server in GPRD - remove the VPC peering between
gitlab-gprd/gprd
andgitlab-ci/ci
networks.
- remove old configuration for
-