2022-11-02: Migrate redis-ratelimiting from VMs to GKE
Production Change
Change Summary
Migrate the redis-ratelimiting instance from VMs to GKE.
Rollout to gstg: scalability#1876 (closed)
Change Details
- Services Impacted - ServiceRedis ServiceWeb ServiceAPI ServiceGit
- Change Technician - @stejacks-gitlab
- Change Reviewer - @reprazent
- Time tracking - 240 min
Detailed steps for the change
Change Steps - steps to take to execute the change
Day 1 (2022-11-02)
Estimated Time to Complete (mins) - 60 min
-
Set label changein-progress /label ~change::in-progress
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2436 to migrate VMs to use this new redis. -
When the change is applied, ssh into a console VM and confirm that the Redis instance is reachable from a Rails console -
Ensure the right config is used Gitlab::Redis::RateLimiting.params[:sentinels]
-
Ensure redis is reachable Gitlab::Redis::RateLimiting.with { _1.ping }
-
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2232 (merged) to migrate the cny to use this new redis. -
Validate that part of the traffic is going to the new Redis instance on the service dashboard - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
-
Validate that there is no increased errors ratio or apdex drops on the dependant services dashboards: - git: https://dashboards.gitlab.net/d/git-main/git-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- web: https://dashboards.gitlab.net/d/web-main/web-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
- api: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=cny
-
Trigger a failover on redis while only cny traffic is impacted - Validate which node is the primary and set it in the command below
- Start a redis console:
kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
- Trigger a failover:
sentinel failover mymaster
-
-
Set label changescheduled /label ~change::scheduled
and wait for 24 hours.
Day 2 (2022-11-03)
Estimated Time to Complete (mins) - 90 min
-
Set label changein-progress /label ~change::in-progress
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2236 (merged) to migrate us-east1-b to use this new redis instance -
Validate that part of the traffic is going to the new Redis instance on the service dashboard - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
-
Trigger a failover on redis while only traffic to one zone is impacted - Validate which node is the primary and set it in the command below
- Start a redis console:
kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
- Trigger a failover:
sentinel failover mymaster
-
-
Wait 30 minutes. -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2257 (merged) to make us-east1-c
use the new redis instance -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2233 (closed) to make the entire fleet use the new redis instance -
Validate that part of the traffic is going to the new Redis instance on the service dashboard - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
-
Validate no changes in error ratio or apdex on the triage dashboard: https://dashboards.gitlab.net/d/general-playlist-frontend-rails/general-triage-playlist-rails-services?orgId=1 -
Trigger a failover on redis while all traffic is going to the new redis - Validate which node is the primary and set it in the command below
- Start a redis console:
kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
- Trigger a failover:
sentinel failover mymaster
-
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2436 => https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2464 -
Revert in K8S: gitlab-com/gl-infra/k8s-workloads/gitlab-com!2258 (diffs) -
Set label changeaborted /label ~change::aborted
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Ratelimiting health
- Location: https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1
- What changes to this metric should prompt a rollback: Degradation violating SLOs.
- Metric: Redis per container resources
- Memory: https://dashboards.gitlab.net/d/alerts-sat_kube_container_memory/alerts-kube_container_memory-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=redis-ratelimiting&var-stage=main
- CPU: https://dashboards.gitlab.net/d/alerts-sat_kube_container_cpu/alerts-kube_container_cpu-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=redis-ratelimiting&var-stage=main
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Stephanie Jackson