Update Gitlab::Issues::Rebalancing::State to be compatible with Redis Cluster
Problem
Originating issue: gitlab-com/gl-infra/scalability#2193 (comment 1447544432)
Gitlab::Issues::Rebalancing::State
performs transactions using cross-slot keys which needs to be updated to support the migration of shared-state workload into Redis Cluster.
Methods affected
-
cleanup_cache
-- delete keys and remove an item from a set -
refresh_keys_expiration
-- update TTL of a few keys
The difficulty lies in the use of multi
or transactions to enable rollback on any server-side failures.
Proposals
The two initial ideas to achieve compatibility with Redis Cluster without overhauling Gitlab::Issues::Rebalancing::State
are:
- Lax the requirements by using
pipeline
instead ofmulti
- Use hash-tags around the redis key prefix so that all keys are in the same key slot
Using hash-tags around unique identifiers like root_namespace.id will not work as there are keys without unique ids in the transactions. Such keys are the CONCURRENT_RUNNING_REBALANCES_KEY
Redis Set and recently_finished_key
which uses the container id (project or namespace).
Option 1 of dropping the use of multi
does not seem like a good idea.
Option 2 of using hash-tags to force all rebalancing state keys into 1 key slot will work but that could skew workload on a node. But looking at the logs for Issues::RebalancingWorker
and the current max of 5 rebalances, I think this could work out.
Proposed implementation of option 2
Due to difficulties with handling rollbacks, we are opting for option 1.
Click to expand
The sole caller of Gitlab::Issues::Rebalancing::State
is the Issues::RelativePositionRebalancingService
which is idempotent and does not change the ordering of issues (it "spreads it out and shifts it "left" towards the negative region).
As the rebalancing starts from the smallest id, it can be interrupted without affecting the ordering of issues. If this assumption holds, we could change the key prefix to include hashtags and deploy. There could be two situations
- No ongoing rebalancing during rollout: this is the best-case scenario since the next rebalancing job will use the new key prefix and no migration needs to be done.
- There is an ongoing job that gets interrupted since the sidekiq pods are terminated. The job gets retried on a new pod and the existing rebalance details for said job are migrated. The job resumes on the new pod.
On deployment rollbacks
If a deployment is rolled back, the sidekiq pods (before the change MR) will not find any running jobs since all running jobs are in the new namespace with hashtags. The keys will remain in the Redis for up to 10 days since that is the TTL it sets.
On the subsequent deployments, the jobs will be unstuck by a cron run of reschedule_stuck_issue_rebalances_worker
.
Concerns
- if the rollback interrupts a job and lasts for an extended period of time, the ordering of issues for the group will be blocked until it is resumed (in a new deployment).
Other considered options
- a post-deployment migration could be performed to migrate the keys from the older namespace to the new namespace but the blocking of issues ordering in that window is undesirable.
- Introduce a cronjob to run the migration from the old namespace to the new namespace, instead of running it in the new worker. This seems excessive as we only need the migration to happen once.
Proposed implementation for option 1
cleanup_cache
method can be simplified and reordered such that any failure would lead to a safe state: either re-run in full or a no-op.
Why current_project_key
got removed:
current_project_key
should be empty by the time it reaches the cleanup step since a remove is run in preload_issues
. If it is not empty, get_issues_id
would trigger a preload_issues
.
Why are the rest of the key cleanup reordered in this manner?
We want to ensure any errors raised in the cleanup_cache
method will not result in an invalid state. When the error is raised, the sidekiq job fails and retries.
The keys are ordered in increasing order of importance to ensure that a job retry does not create an invalid state (we either pass through or retry). Without a transaction, we would assume that each Redis command could raise an error. Note that this ordering is important
Looking at the proposed cleanup logic
redis.srem?(CONCURRENT_RUNNING_REBALANCES_KEY, value) # doesnt matter since it will still run
redis.del(issue_ids_key) # if issue ids get deleted and job reruns, it just preloads again
redis.del(current_index_key) # last to delete since it rebalannces
redis.set(self.class.recently_finished_key(rebalanced_container_type, rebalanced_container_id), true, ex: 1.hour)
If the srem?
fails to remove the key from CONCURRENT_RUNNING_REBALANCES_KEY
set, the sidekiq job will retry, preload the sorted set and it will break out of the loop immediately (since get_issue_ids
would return an empty list).
Failing on issue_ids_key
deletion would mean that if the job was resumed/retried, the preload_issue_ids
will fill the zset and the job will break out of the loop immediately too (since current_index_key
points to the last index).
Failing on current_index_key
deletion would mean that if the job was resumed/retried:
1. preload_issue_ids
would fill the set again.
2. the loop would break since current_index_key
points to the last index
3. cleanup_cache
would rerun
Failing on setting recently_finished_key
would mean that if the job was resumed/retried, the job is rerun from scratch (index 0).
I'm working with the assumption that rerunning a job is safe but we would prefer not to. Since https://gitlab.com/gitlab-org/gitlab/-/blob/0dc78611ad928d7673a8ff4b6dcb63b506ec19d0/spec/services/issues/relative_position_rebalancing_service_spec.rb#L66 tests for its idempotency.