Update Gitlab::Issues::Rebalancing::State to be compatible with Redis Cluster

Problem

Originating issue: gitlab-com/gl-infra/scalability#2193 (comment 1447544432)

Gitlab::Issues::Rebalancing::State performs transactions using cross-slot keys which needs to be updated to support the migration of shared-state workload into Redis Cluster.

Methods affected

cleanup_cache -- delete keys and remove an item from a set
refresh_keys_expiration -- update TTL of a few keys

The difficulty lies in the use of multi or transactions to enable rollback on any server-side failures.

Proposals

The two initial ideas to achieve compatibility with Redis Cluster without overhauling Gitlab::Issues::Rebalancing::State are:

Lax the requirements by using pipeline instead of multi
Use hash-tags around the redis key prefix so that all keys are in the same key slot

Using hash-tags around unique identifiers like root_namespace.id will not work as there are keys without unique ids in the transactions. Such keys are the CONCURRENT_RUNNING_REBALANCES_KEY Redis Set and recently_finished_key which uses the container id (project or namespace).

Option 1 of dropping the use of multi does not seem like a good idea.

Option 2 of using hash-tags to force all rebalancing state keys into 1 key slot will work but that could skew workload on a node. But looking at the logs for Issues::RebalancingWorker and the current max of 5 rebalances, I think this could work out.

Proposed implementation of option 2

Due to difficulties with handling rollbacks, we are opting for option 1.

Click to expand

The sole caller of Gitlab::Issues::Rebalancing::State is the Issues::RelativePositionRebalancingService which is idempotent and does not change the ordering of issues (it "spreads it out and shifts it "left" towards the negative region).

As the rebalancing starts from the smallest id, it can be interrupted without affecting the ordering of issues. If this assumption holds, we could change the key prefix to include hashtags and deploy. There could be two situations

No ongoing rebalancing during rollout: this is the best-case scenario since the next rebalancing job will use the new key prefix and no migration needs to be done.
There is an ongoing job that gets interrupted since the sidekiq pods are terminated. The job gets retried on a new pod and the existing rebalance details for said job are migrated. The job resumes on the new pod.

On deployment rollbacks

If a deployment is rolled back, the sidekiq pods (before the change MR) will not find any running jobs since all running jobs are in the new namespace with hashtags. The keys will remain in the Redis for up to 10 days since that is the TTL it sets.

On the subsequent deployments, the jobs will be unstuck by a cron run of reschedule_stuck_issue_rebalances_worker.

Concerns

if the rollback interrupts a job and lasts for an extended period of time, the ordering of issues for the group will be blocked until it is resumed (in a new deployment).

Other considered options

a post-deployment migration could be performed to migrate the keys from the older namespace to the new namespace but the blocking of issues ordering in that window is undesirable.
Introduce a cronjob to run the migration from the old namespace to the new namespace, instead of running it in the new worker. This seems excessive as we only need the migration to happen once.

Proposed implementation for option 1

cleanup_cache method can be simplified and reordered such that any failure would lead to a safe state: either re-run in full or a no-op.

Why current_project_key got removed:

current_project_key should be empty by the time it reaches the cleanup step since a remove is run in preload_issues. If it is not empty, get_issues_id would trigger a preload_issues.

Why are the rest of the key cleanup reordered in this manner?

We want to ensure any errors raised in the cleanup_cache method will not result in an invalid state. When the error is raised, the sidekiq job fails and retries.

The keys are ordered in increasing order of importance to ensure that a job retry does not create an invalid state (we either pass through or retry). Without a transaction, we would assume that each Redis command could raise an error. Note that this ordering is important

Looking at the proposed cleanup logic

redis.srem?(CONCURRENT_RUNNING_REBALANCES_KEY, value) # doesnt matter since it will still run

redis.del(issue_ids_key) # if issue ids get deleted and job reruns, it just preloads again

redis.del(current_index_key) # last to delete since it rebalannces

redis.set(self.class.recently_finished_key(rebalanced_container_type, rebalanced_container_id), true, ex: 1.hour)

If the srem? fails to remove the key from CONCURRENT_RUNNING_REBALANCES_KEY set, the sidekiq job will retry, preload the sorted set and it will break out of the loop immediately (since get_issue_ids would return an empty list).

Failing on issue_ids_key deletion would mean that if the job was resumed/retried, the preload_issue_ids will fill the zset and the job will break out of the loop immediately too (since current_index_key points to the last index).

Failing on current_index_key deletion would mean that if the job was resumed/retried: 1. preload_issue_ids would fill the set again. 2. the loop would break since current_index_key points to the last index 3. cleanup_cache would rerun

Failing on setting recently_finished_key would mean that if the job was resumed/retried, the job is rerun from scratch (index 0).

I'm working with the assumption that rerunning a job is safe but we would prefer not to. Since https://gitlab.com/gitlab-org/gitlab/-/blob/0dc78611ad928d7673a8ff4b6dcb63b506ec19d0/spec/services/issues/relative_position_rebalancing_service_spec.rb#L66 tests for its idempotency.

Edited Jul 13, 2023 by Sylvester Chin