[GSTG] Migrate repository cache from redis-repository-cache to Redis Cluster

Production Change

Change Summary

The change involves migrating Redis keys from ServiceRedisRepositoryCache to the new Redis Cluster using feature-flags.

Change Details

Services Impacted - ServiceRedisRepositoryCache
Change Technician - @schin1
Change Reviewer - @fshabir
Time tracking - ~1 days -- there is a ~1 day wait time for TTLs
Downtime Component - NA

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (90mins + 1 day of wait)

Set label changein-progress /label ~change::in-progress
Merge chef MR (https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4470) to update gstg nodes with ServiceRedisClusterRepoCache details. Note that the password needs to be added into vault just before the MR merge.
- Referencing https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/redis/provisioning-redis-cluster.md#1-configure-console-instances
- Set '."omnibus-gitlab".gitlab_rb."gitlab-rails".redis_yml_override.cluster_repository_cache.password with the rails user's password
Merge k8s-workload MR
- set external secrets: gitlab-com/gl-infra/k8s-workloads/gitlab-com!3422 (merged)
- update monolith configuration: gitlab-com/gl-infra/k8s-workloads/gitlab-com!3423 (merged)
Wait for all deployment to be completed and observe for any abnormalies (apdex or error spikes)
Enable feature flag use_primary_and_secondary_stores_for_repository_cache to start dual-write
- /chatops run feature set use_primary_and_secondary_stores_for_repository_cache true --staging
Wait >= 1 day for keys TTL
Run external validation script to migrate Redis sets which has a 2-week TTL

We can start the external script-based migration immediately while the dual-write is ongoing.

Set up environment in a console node using #15875 (comment 1440303927) as reference.
To setup the folders:

# on local machine in the runbooks project

tar cvf migrate-script.tar renovate.json scripts/redis_diff.rb Gemfile scripts/redis_key_compare.rb

scp migrate-script.tar console-01-sv-gprd.c.gitlab-production.internal:/home/<username>

# in console node
tar xvf migrate-script.tar
bundle install # gem install if the node does not have bundle

Both redis.yml and redis-cluster-repo-cache.yml should be in the same level as the scripts folder.

redis.yml which should be symlinked as source.yml. We can use replica nodes since we are only writing to the destination. Source is read-only.

# in redis.yml
url: redis://:$REDIS_REDACTED@redis-repository-cache-01-db-gstg.c.gitlab-staging-1.internal:6379

redis-cluster-repo-cache.yml which should be symlinked as destination.yml

# in redis-cluster-shared-state-go.yml
nodes:
  - host: redis-cluster-repo-cache-shard-01-01-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-01-02-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-01-03-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-02-01-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-02-02-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-02-03-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-03-01-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-03-02-db-gstg.c.gitlab-staging-1.internal
    port: 6379
  - host: redis-cluster-repo-cache-shard-03-03-db-gstg.c.gitlab-staging-1.internal
    port: 6379
password: REDIS_REDACTED
username: rails

The passwords can be found in

ServiceRedisRepositoryCache: `gitlab_rails['redis_repository_cache_instance'] = "redis://:@gstg-redis-repository-cache"
ServiceRedisClusterRepoCache: created in vault as part of https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/redis/provisioning-redis-cluster.md#2-configure-gitlab-rails. check for cluster_repository_cache

Symlink the files as follow

ln -s redis.yml source.yml 
ln -s redis-cluster-repo-cache.yml destination.yml

Run bundle exec ruby redis_diff.rb --migrate --rate=1000 --batch=300 --pool_size=30 --type=set | tee migrate-$(date +"%FT%T").out to compare and migrate set data only.
Enable feature flag use_primary_store_as_default_for_repository_cache to switch reads to the Redis Cluster
- /chatops run feature set use_primary_store_as_default_for_repository_cache true --staging
Disable feature flag use_primary_and_secondary_stores_for_repository_cache to stop dual-write and end migration
- /chatops run feature set use_primary_and_secondary_stores_for_repository_cache false --staging
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (60 mins)

Depending on which stage the CR is at

If there is a configuration issue, revert the k8s-workload MR
If there is a migration issue

Disable use_primary_and_secondary_stores_for_repository_cache feature flag to stop dual-write if read traffic has not been cut over.
Disable use_primary_store_as_default_for_repository_cache feature flag to switch reads back to the ServiceRedisRepositoryCache. Then disable use_primary_and_secondary_stores_for_repository_cache after 1-2 minutes to let Rails applications' in-memory cache to expire.
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Apdex and error rates for ServiceRedisRepositoryCache
- Location: https://dashboards.gitlab.net/d/redis-repository-cache-main/redis-repository-cache3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg
- What changes to this metric should prompt a rollback: apdex drops below 1h outage threshold or sustains below 6h degradation threshold
Metric: Apdex and error rates for ServiceRedisRepositoryCache
- Location: https://dashboards.gitlab.net/d/redis-cluster-repo-cache-main/redis-cluster-repo-cache3a-overview?orgId=1&var-PROMETHEUS_DS=PA258B30F88C30650&var-environment=gstg
- What changes to this metric should prompt a rollback: apdex drops below 1h outage threshold or sustains below 6h degradation threshold

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Feb 28, 2024 by Sylvester Chin