Establish strategy for deprecating Sidekiq namespace on GitLab SaaS
This issue discusses the strategy and details of the application, configuration and rollouts. Details on monitoring and exact steps will be in the change management issues.
Ability to handle both job in both namespace and non-namespaced Redis data structures
We use the environment variable SIDEKIQ_POLL_NON_NAMESPACED to control the dual-polling mechanism in Sidekiq servers (servers poll various data structures like sorted sets and queues to perform tasks)
- semi-reliable fetcher: gitlab-org/gitlab!116379 (merged)
- scheduled poller: gitlab-org/gitlab!129565 (merged)
Gradual rollout without conventional feature-flags
To toggle the enqueues in Sidekiq clients, the environment variable SIDEKIQ_ENQUEUE_NON_NAMESPACED will control if Sidekiq.redis is configured with :namespace which controls which Redis key it lpushes the job to.
The gradual rollout can happen by setting that variable in the following groups' k8s-workload config:
- cny webservice in us-east-c (smallest group)
- webservice in us-east-1b
- webservice in us-east-1c
- webservice in us-east-1d
- sidekiq
low-urgency-cpu-boundshard (smallest substantial shard) - sidekiq
urgent-othershard - sidekiq
urgent-cpu-boundshard - sidekiq
catchallshard and the rest
These are the roughly equal size contributors and the exact rollout cadence (whether we group shards into 1 deploy can be discussed).
Handling crons during migration
If SIDEKIQ_ENQUEUE_NON_NAMESPACED is enabled only in some of the shards, there will be 2 parallel sets of sidekiq servers' cron poller polling separate sorted sets. This will lead to the possibility that multiple cronjobs (of the same class) are scheduled at the same time to different namespaces. See more in should_enque?'s implementation at https://github.com/sidekiq-cron/sidekiq-cron/blob/master/lib/sidekiq/cron/job.rb#L23.
Although Sidekiq job deduplication is not affected by namespaces, deduplication only takes effect for idempotent! jobs and not all Gitlab crons are idempotent!.
Possible solution
We can disable the sidekiq cron pollers of a certain namespace by setting the poll interval to a negative number as the launcher will not set a @cron_poller.
Gitlab.com can continue polling crons with a smaller fleet since the frequency scales to the number of processes. This process count is calculated using a Sidekiq.redis connection so the process count will match the number of namespaced processes.
Example of per-shard rollout w.r.t cron polling
Assuming shards A, B, and C:
Rollout: cronjobs are handled by namespaced pollers until step 3
- Enable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDand disable crons in shard A - Enable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDand disable crons in shard B - Enable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDin shard C - Enable crons in A and B (can be done with step 3 for fewer deployment pipelines)
Rollback: cronjobs are handled by non-namespaced pollers until step 3
- Disable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDand disable crons in shard A - Disable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDand disable crons in shard B - Disable
SIDEKIQ_ENQUEUE_NON_NAMESPACEDin shard C - Enable crons in A and B (can be done with step 3 for fewer deployment pipelines)
