Scheduled repository storage move workers should be urgency=throttled
The GitLab API now supports scheduled repository moves - https://docs.gitlab.com/ee/api/project_repository_storage_moves.html#schedule-a-repository-storage-move-for-a-project
However, at present, ProjectUpdateRepositoryStorageWorker
runs on the catchall
shard: https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=project_update_repository_storage
This means that we need to drip-feed scheduled repository updates to the API, or face overwhelming the system with too many concurrent updates, possibly additionally causing saturation on the catchall
fleet along with IO saturation in the Gitaly fleet.
Proposal
- Make
ProjectUpdateRepositoryStorageWorker
run asurgency=throttled
and assign a dedicated shard (repository_updates
orgitaly_throttled
?) to run these jobs. The sidekiq selector for jobs would befeature_category=gitaly&urgency=throttled
- Updating the total concurrency of the shard (though the number of pods and sidekiq concurrency) will allow control over the number of concurrent jobs that can be executed.
- This sidekiq shard could run in kubernetes (cc @jarv)
- CirepoM would continue to work as it does right now, but could be simplified to issue all moves in a single batch, but continue to poll for status updates. Migration jobs would run back to back, the next job starting immediately once the previous one had completed on both failure and success (at present, we need to timeout for failures)
cc for comments @glopezfernandez @nnelson @zj-gitlab @marin @proglottis
Rollout Plan
Prep work
-
Any time: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!92 (merged) "Adds relabeling and log configuration for throttled shards" -
Any time: gitlab-com/runbooks!2427 (merged) "Preparatory refactor to allow HPA saturation rules" -
Any time: gitlab-com/runbooks!2426 (merged) "Autogenerate kubernetes HPA alerting rules" -
Monitoring and logging updates for the shards gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!92 (merged) -
Merge gitlab-org/gitlab!35230 (merged) "Throttle ProjectUpdateRepositoryStorageWorker Jobs" - At this point, the jobs will continue to run safely on the catchall fleet.
-
because we are using maximum replicas for throttling, we will also need to ignore these shards for the hpa alert https://gitlab.com/gitlab-com/runbooks/-/blob/master/rules/kubernetes-hpa.yml#L32-43
Staging
-
gitlab-com/gl-infra/k8s-workloads/gitlab-com!288 (merged): Add the throttled shards to non-prod environments -
https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3766: Exclude the database and throttled jobs in staging (reverted, see https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/3797) -
Confirm the following: - Queues are now not running in catchall
- Pods are created in staging
- Queues are running in K8s
- Queues are processing jobs and logging
Production
-
change issue production#2378 (closed)