Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Connect new ci-runners Prometheus servers to Thanos for tests

Production Change

Change Summary

As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13886 (and &456 (closed) in the higher scope) we've configured dedicated Prometheus servers that will monitor our "new" blue/green deployed ServiceCI Runners infrastructure - both runner managers and ephemeral runners.

To make the metrics available for our monitoring dashboards and alerting, we need to connect them with our Thanos Query - the main source of monitoring data.

This issue is a first step and will cover the preparation and deployment of the connection for test purpose. The metrics will be marked with stage=cny, which is not used by the runners monitoring stack, as - at least for now - everything that we have within the Runners infrastructure is production::main.

We will add alert silencer to not fire false alerts caused by potential failures in the configuration.

Such setup will be left for a week or two to confirm, that our new setup is working properly. After that we will continue with a second step and create another change management issue to switch the "stage" label from the new metrics source and the old ones. And again leave it for a week or two.

Finally, in the last step and last change management issue we will remove the old configuration and the VPC peering between the networks in gitlab-gprd and gitlab-ci GCP projects.

Change Details

Services Impacted - ServiceCI Runners ServiceThanos
Change Technician - @tmaczukin
Change Reviewer - @mwasilewski-gitlab
Time tracking - 1.5h
Downtime Component - no downtime expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 20 min

Set label changein-progress on this issue
prepare a silencer for tier="runners", environment="gprd", stage="cny" 👉 https://alerts.gitlab.net/#/silences/1e01ce5f-1bb5-4b0a-9609-1cc1984e03ca
set external label stage: cny on the K8S monitoring deployment 👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!2 (merged)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 20 min

Point Thanos Query to the new Prometheus stack 👉 gitlab-com/gl-infra/k8s-workloads/tanka-deployments!212 (merged)
- to the Sidecar at monitoring-lb.ci.ci-runners.gitlab.net:10901
- to the Store Gateway at monitoring-lb.ci.ci-runners.gitlab.net:10903

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 20 min

Check if the new stores were added 👉 https://thanos.gitlab.net/stores
Confirm that we can see GitLab Runner metrics in the stage="cny 👉 here
Confirm that number of example metric entries is same for both stage="main" and stage="cny" 👉 here

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 min

Revert the Thanos Query integration 👉 LINK_TO_THE_MR
Remove the silencer

Monitoring

Key metrics to observe

Metric: Metric Name
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?orgId=1&refresh=1m
- What changes to this metric should prompt a rollback: Make sure that the metrics from runner manager are not becoming doubled

Summary of infrastructure changes

~~[ ] Does this change introduce new compute instances?~~
~~[ ] Does this change re-size any existing compute instances?~~
~~[ ] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
This Change Issue is linked to the appropriate Issue and/or Epic
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents.

Edited Oct 26, 2021 by Tomasz Maczukin

Assignee

Time tracking