Connect new ci-runners Prometheus servers to Thanos for tests
Production Change
Change Summary
As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13886 (and &456 (closed) in the higher scope) we've configured dedicated Prometheus servers that will monitor our "new" blue/green deployed ServiceCI Runners infrastructure - both runner managers and ephemeral runners.
To make the metrics available for our monitoring dashboards and alerting, we need to connect them with our Thanos Query - the main source of monitoring data.
This issue is a first step and will cover the preparation and deployment of the connection for test purpose. The metrics will be marked with stage=cny
, which is not used by the runners monitoring stack, as - at least for now - everything that we have within the Runners infrastructure is production::main
.
We will add alert silencer to not fire false alerts caused by potential failures in the configuration.
Such setup will be left for a week or two to confirm, that our new setup is working properly. After that we will continue with a second step and create another change management issue to switch the "stage" label from the new metrics source and the old ones. And again leave it for a week or two.
Finally, in the last step and last change management issue we will remove the old configuration and the VPC peering between the networks in gitlab-gprd
and gitlab-ci
GCP projects.
Change Details
- Services Impacted - ServiceCI Runners ServiceThanos
- Change Technician - @tmaczukin
- Change Reviewer - @mwasilewski-gitlab
- Time tracking - 1.5h
- Downtime Component - no downtime expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 20 min
-
Set label changein-progress on this issue -
prepare a silencer for tier="runners", environment="gprd", stage="cny"
👉 https://alerts.gitlab.net/#/silences/1e01ce5f-1bb5-4b0a-9609-1cc1984e03ca -
set external label stage: cny
on the K8S monitoring deployment👉 gitlab-com/gl-infra/ci-runners/k8s-workloads!2 (merged)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20 min
-
Point Thanos Query to the new Prometheus stack 👉 gitlab-com/gl-infra/k8s-workloads/tanka-deployments!212 (merged)- to the Sidecar at
monitoring-lb.ci.ci-runners.gitlab.net:10901
- to the Store Gateway at
monitoring-lb.ci.ci-runners.gitlab.net:10903
- to the Sidecar at
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 20 min
-
Check if the new stores were added 👉 https://thanos.gitlab.net/stores -
Confirm that we can see GitLab Runner metrics in the stage="cny
👉 here -
Confirm that number of example metric entries is same for both stage="main"
andstage="cny"
👉 here
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 min
-
Revert the Thanos Query integration 👉 LINK_TO_THE_MR -
Remove the silencer
Monitoring
Key metrics to observe
- Metric: Metric Name
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?orgId=1&refresh=1m
- What changes to this metric should prompt a rollback: Make sure that the metrics from runner manager are not becoming doubled
Summary of infrastructure changes
[ ] Does this change introduce new compute instances?[ ] Does this change re-size any existing compute instances?[ ] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.