Skip to content

GitLab Next

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

2022-11-02: Migrate redis-ratelimiting from VMs to GKE

Production Change

Change Summary

Migrate the redis-ratelimiting instance from VMs to GKE.

scalability#1878 (closed)

Rollout to gstg: scalability#1876 (closed)

Change Details

Services Impacted - ServiceRedis ServiceWeb ServiceAPI ServiceGit
Change Technician - @stejacks-gitlab
Change Reviewer - @reprazent
Time tracking - 240 min

Detailed steps for the change

Change Steps - steps to take to execute the change

Day 1 (2022-11-02)

Estimated Time to Complete (mins) - 60 min

Set label changein-progress /label ~change::in-progress
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2436 to migrate VMs to use this new redis.
When the change is applied, ssh into a console VM and confirm that the Redis instance is reachable from a Rails console
- Ensure the right config is used Gitlab::Redis::RateLimiting.params[:sentinels]
- Ensure redis is reachable Gitlab::Redis::RateLimiting.with { _1.ping }
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2232 (merged) to migrate the cny to use this new redis.
- Validate that part of the traffic is going to the new Redis instance on the service dashboard
  - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
- Validate that there is no increased errors ratio or apdex drops on the dependant services dashboards:
- Trigger a failover on redis while only cny traffic is impacted
  - Validate which node is the primary and set it in the command below
  - Start a redis console: kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
  - Trigger a failover: sentinel failover mymaster
Set label changescheduled /label ~change::scheduled and wait for 24 hours.

Day 2 (2022-11-03)

Estimated Time to Complete (mins) - 90 min

Set label changein-progress /label ~change::in-progress
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2236 (merged) to migrate us-east1-b to use this new redis instance
- Validate that part of the traffic is going to the new Redis instance on the service dashboard
  - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
- Trigger a failover on redis while only traffic to one zone is impacted
  - Validate which node is the primary and set it in the command below
  - Start a redis console: kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
  - Trigger a failover: sentinel failover mymaster
Wait 30 minutes.
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2257 (merged) to make us-east1-c use the new redis instance
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!2233 (closed) to make the entire fleet use the new redis instance
- Validate that part of the traffic is going to the new Redis instance on the service dashboard
  - The primary in GKE should have an increased ops rate https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1&viewPanel=22
- Validate no changes in error ratio or apdex on the triage dashboard: https://dashboards.gitlab.net/d/general-playlist-frontend-rails/general-triage-playlist-rails-services?orgId=1
- Trigger a failover on redis while all traffic is going to the new redis
  - Validate which node is the primary and set it in the command below
  - Start a redis console: kubectl exec -it -n redis -c sentinel redis-ratelimiting-node-2 -- redis-cli -p 26379
  - Trigger a failover: sentinel failover mymaster
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2436 => https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2464
Revert in K8S: gitlab-com/gl-infra/k8s-workloads/gitlab-com!2258 (diffs)
Set label changeaborted /label ~change::aborted
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Ratelimiting health
- Location: https://dashboards.gitlab.net/d/redis-ratelimiting-main/redis-ratelimiting-overview?orgId=1
- What changes to this metric should prompt a rollback: Degradation violating SLOs.
Metric: Redis per container resources
- Memory: https://dashboards.gitlab.net/d/alerts-sat_kube_container_memory/alerts-kube_container_memory-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=redis-ratelimiting&var-stage=main
- CPU: https://dashboards.gitlab.net/d/alerts-sat_kube_container_cpu/alerts-kube_container_cpu-saturation-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=redis-ratelimiting&var-stage=main

Change Reviewer checklist

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Nov 03, 2022 by Stephanie Jackson

Assignee

Time tracking