Roll out queue-per-shard to workers of shard `elasticsearch`

Introduction

The Scalability team is working on migrating queue-per-worker to queue-per-shard strategy so that all workers in the same shard belong to a same queue. This is an attempt to reduce daily peak CPU saturation on redis-sidekiq. To achieve such goal, we applied queue routing rules mechanism to determine the destination queue of a particular job when scheduling. This mechanism is a drop-in replacement for queue selector. We already applied and rolled out for default and mailers queue on production (see #1073 (closed) for more information).

We would like to continue to roll out this mechanism for all workers of elasticsearch shard. Some notable information about this shard:

  • elasticsearch shard stays on Kubernetes cluster.
  • Queue selector configuration is located here.
  • Queue selector rule: feature_category=global_search&urgency=throttled
  • Workers in this shard, filtered from the queue selector rule, and compared with the logs of completed jobs by shard:
Worker Feature Category Current Queue Maintaining Group 7-day job completions
ElasticClusterReindexingCronWorker global_search cronjob:elastic_cluster_reindexing_cron Global Search 1,008
ElasticIndexBulkCronWorker global_search cronjob:elastic_index_bulk_cron Global Search 10,079
ElasticIndexInitialBulkCronWorker global_search cronjob:elastic_index_initial_bulk_cron Global Search 10,080
Elastic::MigrationWorker global_search cronjob:elastic_migration Global Search 336
ElasticCommitIndexerWorker global_search elastic_commit_indexer Global Search 10,080,127
ElasticDeleteProjectWorker global_search elastic_delete_project Global Search 33,349

Pre-check

Before rolling out, the following checklist must be done, to ensure the reliability and safety:

  • All workers do not depend on their queue

    • They should not implement their own capacity control depending on checking their own queue size. If they do, we should redirect them to use LimitedCapacity::Worker instead.
    • They should not store or cache into Redis under the queue name key
    • If there is a good particular reason for a worker to work on the queue name, the corresponding documentation should be updated to reflect this semantic changes.
  • Maintaining stage groups should be aware of this change. This is necessarily for shards having a specific purpose, like eleasticsearch.

  • It's absolutely normal that a worker doesn't have any logs. We still need to include and inspect such workers.

  • Test both Sidekiq client and server in the local environment. The best way is actually applying the configuration, and test a full flow on the UI. However, it is too complicated and time consuming. Instead, we can:

    • Test Sidekiq client in the local environment. One simple way is to update config/gitlab.yml to include the new routing rules, start a console and inspect the queues of targeted workers: Screen_Shot_2021-06-21_at_11.31.01

    • Test Sidekiq server in the local environment. It's highly recommended to bring up a real kubernestes cluster in the local environment, and start Sidekiq cluster with dry-run flag to compare listening queues to the existing queues of aforementioned workers. The dry-run command is bin/sidekiq-cluster --dryrun .... Another method is to inspect the cmdline of Sidekiq pod's container process with ps -ww -fp [PID]. Screen_Shot_2021-06-21_at_11.28.51

  • Zero-downtime consideration. As stated in #1136 (comment 607419452), it is feasible that the full roll out may take 10s of minutes We are updating both Sidekiq clients and Sidekiq servers, while rolling out to our fleets, there are following scenarios:

    • Old clients before suspended still send jobs to per-worker queue. The jobs can be captured by both new servers and old servers. It's not a problem
    • New clients send jobs to per-shard queue. The jobs can only captured by new servers. As a result, there could be a period of time the jobs stay in the per-shard queue without having any new servers pulling jobs from. Therefore, it's critical to apply the configuration for the servers before clients.

Migrations

Please follow the linked issues for the detailed migration steps.

Appendix

Script to fetch workers of a shard
url = URI("https://gitlab.com/gitlab-com/www-gitlab-com/raw/master/data/stages.yml")
request = Net::HTTP::Get.new(url)
response = Net::HTTP.new(url.host, url.port).tap { |http| http.use_ssl = true }.request(request)
groups = YAML.safe_load(response.read_body)["stages"].values.flat_map { |stage| stage["groups"].values }

worker_metadatas = Gitlab::SidekiqConfig::CliMethods.worker_metadatas
matcher = Gitlab::SidekiqConfig::WorkerMatcher.new('feature_category=global_search&urgency=throttled')

worker_metadatas.select { |w| matcher.match?(w) }.each do |w|
  group = groups.find { |g| g['categories'].include?(w[:feature_category].to_s) }
  puts "| #{w[:worker_name]} | #{w[:feature_category]} | [#{group['name']}](http://about.gitlab.com/#{group['group_link']}) |"
end
Edited by Quang-Minh Nguyen (Ex-GitLab)