Limit the number of Redis clients/concurrency of sidekiq-cluster

From https://gitlab.com/gitlab-com/production/issues/431#note_97230109, we see that as the number of Redis connections increases, the less responsive Redis gets because it runs in a single thread to service all the requests.

With our existing Sidekiq fleet, it seems that when we use around 5900 Redis connections, we start to see more frequent Redis client timeouts and failovers. Note that the size of the Redis Sidekiq/shared state cluster is only 2 GB; the cache cluster is about 80 GB but "only" has ~2500 connections.

Currently each sidekiq-besteffort node uses around 500 connections (and threads) with the following config:

  • 4 processes
  • 6 nodes
  • 134 Sidekiq threads (see below for the calculation)

4 * 6 * 134 = 3216 connections in total.

By shutting down 2 nodes (sidekiq-besteffort-05 and sidekiq-besteffort-06), Redis PING latency decreased from 3 ms to 1 ms, and timeouts/failovers stopped happening.

The sidekiq-cluster concurrency is set to the number of queues here: https://gitlab.com/gitlab-org/gitlab-ee/blob/78efa878e30456f19fc3a3e29abb6d9618b5a50c/ee/lib/gitlab/sidekiq_cluster.rb#L74

https://github.com/mperham/sidekiq/wiki/Advanced-Options#concurrency explicitly mentions, "Don't set the concurrency higher than 50."

@yorickpeterse I'm wondering whether we should just specify a cap for the number of threads. Do we really need to have one connection per queue?

Just as we have pgbouncer in front of PostgreSQL, we could also alleviate this problem by putting twemproxy in front of Redis: https://gitlab.com/gitlab-com/infrastructure/issues/4841

all_queues = Gitlab::SidekiqConfig.worker_queues('/opt/gitlab/embedded/service/gitlab-rails')
negate = "admin_emails,authorized_projects,build,cronjob:update_all_mirrors,delete_merged_branches,delete_user,elastic_batch_project_indexer,elastic_commit_indexer,elastic_indexer,email_receiver,emails_on_push,expire_build_instance_artifacts,export_csv,geo,geo_repository_update,gitlab_shell,group_destroy,ldap_group_sync,mail_scheduler,merge,namespaceless_project_destroy,new_issue,new_merge_request,new_note,pages,pipeline,pipeline_background:archive_trace,pipeline_cache,pipeline_creation,pipeline_default,pipeline_hooks,pipeline_processing,post_receive,process_commit,project_destroy,project_export,project_import_schedule,project_service,project_update_repository_storage,propagate_service_template,reactive_caching,repository_fork,repository_import,repository_update_mirror,update_merge_requests,web_hook,object_storage:object_storage_background_move,object_storage:migrate_uploads".split(',')
(all_queues - negate).count
=> 134
Edited Aug 28, 2018 by Stan Hu
Assignee Loading
Time tracking Loading