Roll out queue-per-shard to workers of shard `elasticsearch`
Introduction
The Scalability team is working on migrating queue-per-worker to queue-per-shard strategy so that all workers in the same shard belong to a same queue. This is an attempt to reduce daily peak CPU saturation on redis-sidekiq. To achieve such goal, we applied queue routing rules mechanism to determine the destination queue of a particular job when scheduling. This mechanism is a drop-in replacement for queue selector. We already applied and rolled out for default and mailers queue on production (see #1073 (closed) for more information).
We would like to continue to roll out this mechanism for all workers of elasticsearch shard. Some notable information about this shard:
-
elasticsearchshard stays on Kubernetes cluster. - Queue selector configuration is located here.
- Queue selector rule:
feature_category=global_search&urgency=throttled - Workers in this shard, filtered from the queue selector rule, and compared with the logs of completed jobs by shard:
| Worker | Feature Category | Current Queue | Maintaining Group | 7-day job completions |
|---|---|---|---|---|
| ElasticClusterReindexingCronWorker | global_search | cronjob:elastic_cluster_reindexing_cron | Global Search | 1,008 |
| ElasticIndexBulkCronWorker | global_search | cronjob:elastic_index_bulk_cron | Global Search | 10,079 |
| ElasticIndexInitialBulkCronWorker | global_search | cronjob:elastic_index_initial_bulk_cron | Global Search | 10,080 |
| Elastic::MigrationWorker | global_search | cronjob:elastic_migration | Global Search | 336 |
| ElasticCommitIndexerWorker | global_search | elastic_commit_indexer | Global Search | 10,080,127 |
| ElasticDeleteProjectWorker | global_search | elastic_delete_project | Global Search | 33,349 |
- Prometheus metrics: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=elasticsearch
- Kibana logs: https://log.gprd.gitlab.net/goto/9ea13456cbdb3aa851b7bf688a19abe9
- Sentry events: https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+type%3Asidekiq+shard%3Aelasticsearch
Pre-check
Before rolling out, the following checklist must be done, to ensure the reliability and safety:
-
All workers do not depend on their queue -
They should not implement their own capacity control depending on checking their own queue size. If they do, we should redirect them to use LimitedCapacity::Worker instead. -
They should not store or cache into Redis under the queue name key -
If there is a good particular reason for a worker to work on the queue name, the corresponding documentation should be updated to reflect this semantic changes.
-
-
Maintaining stage groups should be aware of this change. This is necessarily for shards having a specific purpose, like eleasticsearch. -
It's absolutely normal that a worker doesn't have any logs. We still need to include and inspect such workers. -
Test both Sidekiq client and server in the local environment. The best way is actually applying the configuration, and test a full flow on the UI. However, it is too complicated and time consuming. Instead, we can: -
Test Sidekiq client in the local environment. One simple way is to update config/gitlab.ymlto include the new routing rules, start a console and inspect the queues of targeted workers:
-
Test Sidekiq server in the local environment. It's highly recommended to bring up a real kubernestes cluster in the local environment, and start Sidekiq cluster with dry-run flag to compare listening queues to the existing queues of aforementioned workers. The dry-run command is bin/sidekiq-cluster --dryrun .... Another method is to inspect the cmdline of Sidekiq pod's container process withps -ww -fp [PID].
-
-
Zero-downtime consideration. As stated in #1136 (comment 607419452), it is feasible that the full roll out may take 10s of minutes We are updating both Sidekiq clients and Sidekiq servers, while rolling out to our fleets, there are following scenarios: - Old clients before suspended still send jobs to per-worker queue. The jobs can be captured by both new servers and old servers. It's not a problem
- New clients send jobs to per-shard queue. The jobs can only captured by new servers. As a result, there could be a period of time the jobs stay in the per-shard queue without having any new servers pulling jobs from. Therefore, it's critical to apply the configuration for the servers before clients.
Migrations
Please follow the linked issues for the detailed migration steps.
Appendix
Script to fetch workers of a shard
url = URI("https://gitlab.com/gitlab-com/www-gitlab-com/raw/master/data/stages.yml")
request = Net::HTTP::Get.new(url)
response = Net::HTTP.new(url.host, url.port).tap { |http| http.use_ssl = true }.request(request)
groups = YAML.safe_load(response.read_body)["stages"].values.flat_map { |stage| stage["groups"].values }
worker_metadatas = Gitlab::SidekiqConfig::CliMethods.worker_metadatas
matcher = Gitlab::SidekiqConfig::WorkerMatcher.new('feature_category=global_search&urgency=throttled')
worker_metadatas.select { |w| matcher.match?(w) }.each do |w|
group = groups.find { |g| g['categories'].include?(w[:feature_category].to_s) }
puts "| #{w[:worker_name]} | #{w[:feature_category]} | [#{group['name']}](http://about.gitlab.com/#{group['group_link']}) |"
end

