Rollout Thanos Store index memcache
Production Change
Change Summary
Rollout memcached enable flags to Thanos Store
https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8986
Change Details
- Services Impacted - Thanos Store (Grafana performance)
- Change Technician - @bjk-gitlab, @igorwwwwwwwwwwwwwwwwwwww
- Change Criticality - C3,
- Change Type - changeunscheduled, changescheduled
- Change Reviewer - @hphilipps
- Due Date - 2020-09-11
- Time tracking - Time, in minutes, needed to execute all change steps, including rollback
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Silence SLO alert (URL HERE)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Manually enable memcached on thanos-store knife node attribute set "${thanos-store-node}" 'thanos-store.index-memcached.enable' true -
Converge chef client on store node -
Validate change -
Enable non-prod (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4219) -
Enable in gprd (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4220)
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Enable by default in gitlab-prometheus cookbook. -
Cleanup now default chef-repo attributes.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Delete enabling node attribute knife node attribute delete "${thanos-store-node}" 'thanos-store.index-memcached.enable' -
Rollback any chef-repo changes.
Monitoring
Key metrics to observe
Thanos/Grafana SLO metrics
- Metric:
grpc_server_handling_seconds_bucket{grpc_service="thanos.Store",grpc_type="server_stream"}- Location: Per-Thanos Store gRPC performance
- What changes to this metric should prompt a rollback: Sustained performance worse than 10 seconds
- Metric:
thanos_memcached_operation_skipped_total- Location: Per-env memcached operations skipped
- What changes to this metric should prompt a rollback: More than 1 QPS.
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.
Edited by Ben Kochie