2021-09-17: Set redis-01 to have a higher failover priority
Production Change
Change Summary
We want to give redis-01 a higher failover priority as we've recently done some maintenance on this node, giving it more memory.
See also: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5547#note_680354939.
Change Details
- Services Impacted - ServiceRedis
-
Change Technician -
@igorwwwwwwwwwwwwwwwwwwww - Change Reviewer - @jarv
- Time tracking - 5m
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1m
-
Set label changein-progress on this issue -
Get current replica-priorityvaluesexport redis_cli='REDISCLI_AUTH="$(sudo grep -m1 ^requirepass /var/opt/gitlab/redis/redis.conf|cut -d" " -f2|tr -d \")" /opt/gitlab/embedded/bin/redis-cli' parallel -j1 --tag 'ssh redis-{}-db-gprd.c.gitlab-production.internal "$redis_cli config get replica-priority"' ::: 01 02 03
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 1m
-
Set replica-priorityonredis-01to200ssh redis-01-db-gprd.c.gitlab-production.internal "$redis_cli config set replica-priority 200"
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 1m
-
Get current replica-priorityvaluesparallel -j1 --tag 'ssh redis-{}-db-gprd.c.gitlab-production.internal "$redis_cli config get replica-priority"' ::: 01 02 03
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 1m
-
Revert replica-priorityonredis-01back to0(avoid failing over to this node)ssh redis-01-db-gprd.c.gitlab-production.internal "$redis_cli config set replica-priority 0"
Monitoring
Key metrics to observe
- Metric: Redis SLOs
- Location: https://dashboards.gitlab.net/d/redis-main/redis-overview?orgId=1&from=now-1h&to=now
- What changes to this metric should prompt a rollback: Significant increases in latency, error rates, saturation.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Igor