Cutover Thanos Datasource
Production Change
Change Summary
Cutover default "Global" datasource in Grafana from Chef to GKE deployment.
Change Details
- Services Impacted - Internal Dashboards
- Change Technician - @bjk-gitlab
- Change Criticality - C4
- Change Type - changescheduled
- Change Reviewer - @craigf
- Due Date - 2021-02-22 10:00 UTC
- Time tracking - 5 minutes
- Downtime Component - N/A
Detailed steps for the change
Change Steps - steps to take to execute the change
gke_target=http://thanos-query-frontend-internal.ops.gke.gitlab.net:9090
chef_target=http://i.ops-thanos-query-int.il7.us-east1.lb.gitlab-ops.internal
-
Update "Global" to GKE target. -
Update "Global - No Downsample" to GKE target. -
Update "Frank" to Chef target.
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Verify home dashboard works (gitlab-triage) -
Verify using Frank to access Chef works. -
Verify other dashboards loading correctly.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Rollback "Global" to Chef target. -
Rollback "Global - No Downsample" to Chef target. -
Rollback "Frank" to GKE target.
Monitoring
Key metrics to observe
- Metric: Monitoring SLOs
- Location: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview
- What changes to this metric should prompt a rollback: SLO below thresholds.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Ben Kochie