Skip to content

Allow Sidekiq jobs to use readonly database replicas

Currently Sidekiq always use primary, but not always needs. This means that all of Sidekiq's database traffic will hit the primary, whereas only some web database traffic will hit the primary. From https://dashboards.gitlab.net/d/000000144/postgresql-overview, we can see that none of the replicas are used as much as the primary, but all of them are in the same ballpark.

Overall, our metrics suggest that we spend more database time in web transactions (green line) than Sidekiq jobs (orange line), but Sidekiq is still a significant percentage:

image

We currently have no way to distinguish whether the given worker requires read-only or read-write access to data. It seems that if we would start annotating workers, we could call for majority of time Replicas instead, for operations that do not require read-write and super up-to date data, like:

  • all notifications
  • all webhooks
  • all ...

This would allow us to remove a number of SELECT statements from master.

groupscalability is spending a lot of effort of annotating workers, maybe following the same pattern we could do the same.

Proposed solution

  1. We should be able to define the data consistency requirement for a worker:
  • always: the worker is required to use primary (a default)
  • sticky: worker would use replica as long as possible, but would switch to primary either on write or long replication lag: use on jobs that require to be executed as fast as possible
  • delayed: worker would switch to primary only on write, would use replica always if there's a long replication lag the job will be delayed, and only if the replica is not up to date on the next retry, it will switch to the primary. It should be used on jobs where we are fine to delay the execution of a given job, due to their importance: expire caches, or execute hooks...

It is also possible to control data consistency configuration with the feature flag for each worker:

data_consistency :delayed, feature_flag: load_balancing_for_build_hooks_worker
  1. In order to be safer, we should be able to control LoadBalancing for the Sidekiq by setting the ENV variable ENABLE_LOAD_BALANCING_FOR_SIDEKIQ to 'true'

Rollout plan:

Rollout plan:

  1. For GitLab.com, possibly Omnibus too: we should make sure that the pgbouncer nodes for the read-only replicas are configured with a sidekiq pool. At present, (iirc) only the primary has a Sidekiq pool (since the replica pool would be unused). - @jarv opened a Charts issue gitlab-org/charts/gitlab#2619 (closed) for this, once this is done we will need to add an option to allow Sidekiq to use the loadbalancing config.
  2. We will also need to configure the read-replica pgbouncer pools on the patroni nodes https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12871
  3. Enable load balancing for Sidekiq by setting ENV[ENABLE_LOAD_BALANCING_FOR_SIDEKIQ]='true'. This will enable load balancing, but we will still always use primary database for all workers since workers data_consistency will default to: always
  4. In #324232 (closed) we will configure BuildHooksWorker data_consistency to :delayed, controlled by the feature flag: load_balancing_for_build_hooks_worker
  5. Rollout of a feature flag: load_balancing_for_build_hooks_worker
  6. If everything is fine, we will proceed with updating other workers listed in &5592 (closed)
Edited by Nikola Milojevic