Rollout Thanos Store index memcache

Production Change

Change Summary

Rollout memcached enable flags to Thanos Store

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8986

Change Details

Services Impacted - Thanos Store (Grafana performance)
Change Technician - @bjk-gitlab, @igorwwwwwwwwwwwwwwwwwwww
Change Criticality - C3,
Change Type - changeunscheduled, changescheduled
Change Reviewer - @hphilipps
Due Date - 2020-09-11
Time tracking - Time, in minutes, needed to execute all change steps, including rollback
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Silence SLO alert (URL HERE)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Manually enable memcached on thanos-store knife node attribute set "${thanos-store-node}" 'thanos-store.index-memcached.enable' true
Converge chef client on store node
Validate change
Enable non-prod (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4219)
Enable in gprd (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4220)

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Enable by default in gitlab-prometheus cookbook.
Cleanup now default chef-repo attributes.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Delete enabling node attribute knife node attribute delete "${thanos-store-node}" 'thanos-store.index-memcached.enable'
Rollback any chef-repo changes.

Monitoring

Key metrics to observe

Thanos/Grafana SLO metrics

Metric: grpc_server_handling_seconds_bucket{grpc_service="thanos.Store",grpc_type="server_stream"}
- Location: Per-Thanos Store gRPC performance
- What changes to this metric should prompt a rollback: Sustained performance worse than 10 seconds
Metric: thanos_memcached_operation_skipped_total
- Location: Per-env memcached operations skipped
- What changes to this metric should prompt a rollback: More than 1 QPS.

Summary of infrastruture changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultes noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Sep 14, 2020 by Ben Kochie

Assignee Loading

Time tracking Loading