Establish strategy for deprecating Sidekiq namespace on GitLab SaaS

This issue discusses the strategy and details of the application, configuration and rollouts. Details on monitoring and exact steps will be in the change management issues.

Ability to handle both job in both namespace and non-namespaced Redis data structures

We use the environment variable SIDEKIQ_POLL_NON_NAMESPACED to control the dual-polling mechanism in Sidekiq servers (servers poll various data structures like sorted sets and queues to perform tasks)

Gradual rollout without conventional feature-flags

To toggle the enqueues in Sidekiq clients, the environment variable SIDEKIQ_ENQUEUE_NON_NAMESPACED will control if Sidekiq.redis is configured with :namespace which controls which Redis key it lpushes the job to.

The gradual rollout can happen by setting that variable in the following groups' k8s-workload config:

  • cny webservice in us-east-c (smallest group)
  • webservice in us-east-1b
  • webservice in us-east-1c
  • webservice in us-east-1d
  • sidekiq low-urgency-cpu-bound shard (smallest substantial shard)
  • sidekiq urgent-other shard
  • sidekiq urgent-cpu-bound shard
  • sidekiq catchall shard and the rest

These are the roughly equal size contributors and the exact rollout cadence (whether we group shards into 1 deploy can be discussed).

Screenshot_2023-08-17_at_7.47.28_PM

source

Handling crons during migration

If SIDEKIQ_ENQUEUE_NON_NAMESPACED is enabled only in some of the shards, there will be 2 parallel sets of sidekiq servers' cron poller polling separate sorted sets. This will lead to the possibility that multiple cronjobs (of the same class) are scheduled at the same time to different namespaces. See more in should_enque?'s implementation at https://github.com/sidekiq-cron/sidekiq-cron/blob/master/lib/sidekiq/cron/job.rb#L23.

Although Sidekiq job deduplication is not affected by namespaces, deduplication only takes effect for idempotent! jobs and not all Gitlab crons are idempotent!.

Possible solution

We can disable the sidekiq cron pollers of a certain namespace by setting the poll interval to a negative number as the launcher will not set a @cron_poller.

Gitlab.com can continue polling crons with a smaller fleet since the frequency scales to the number of processes. This process count is calculated using a Sidekiq.redis connection so the process count will match the number of namespaced processes.

Example of per-shard rollout w.r.t cron polling

Assuming shards A, B, and C:

Rollout: cronjobs are handled by namespaced pollers until step 3

  1. Enable SIDEKIQ_ENQUEUE_NON_NAMESPACED and disable crons in shard A
  2. Enable SIDEKIQ_ENQUEUE_NON_NAMESPACED and disable crons in shard B
  3. Enable SIDEKIQ_ENQUEUE_NON_NAMESPACED in shard C
  4. Enable crons in A and B (can be done with step 3 for fewer deployment pipelines)

Rollback: cronjobs are handled by non-namespaced pollers until step 3

  1. Disable SIDEKIQ_ENQUEUE_NON_NAMESPACED and disable crons in shard A
  2. Disable SIDEKIQ_ENQUEUE_NON_NAMESPACED and disable crons in shard B
  3. Disable SIDEKIQ_ENQUEUE_NON_NAMESPACED in shard C
  4. Enable crons in A and B (can be done with step 3 for fewer deployment pipelines)
Edited by Sylvester Chin