2021-09-24: move repository import jobs back to the imports shard
Production Change
Change Summary
In gitlab-org/gitlab#332616 (closed) and gitlab-com/gl-infra/k8s-workloads/gitlab-com!930 (merged), we moved RepositoryImportWorker
jobs to a new imports
Sidekiq shard.
Unfortunately, we (by which I mean me) messed up with a related project: changing our Sidekiq shard configuration. We created scalability#1073 (closed) to codify all the existing shards, but we did this before the new shard was created. That meant we missed it out in our work on &469 (closed), and so now this worker is running on the catchall
shard, and has been for about two months! https://thanos-query.ops.gitlab.net/classic/graph?g0.range_input=90d&g0.max_source_resolution=0s&g0.expr=sum%20by%20(shard%2C%20worker)%20(rate(sidekiq_jobs_completion_seconds_count%7Benv%3D%22gprd%22%2C%20worker%3D%22RepositoryImportWorker%22%7D%5B5m%5D))&g0.tab=0
The shard itself still has containers running: https://dashboards.gitlab.net/d/sidekiq-kube-containers/sidekiq-kube-containers-detail?viewPanel=23&orgId=1
They just aren't doing anything: https://dashboards.gitlab.net/d/sidekiq-shard-detail/sidekiq-shard-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=imports
Change Details
- Services Impacted - ServiceSidekiq
-
Change Technician -
@smcgivern
- Change Reviewer - @smcgivern
- Time tracking - 1 hour 40 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 50 mins
-
Set label changein-progress on this issue -
Get approval on gitlab-com/gl-infra/k8s-workloads/gitlab-com!1261 (merged) (staging MR) and follow the roll-out steps there -
Get approval on gitlab-com/gl-infra/k8s-workloads/gitlab-com!1260 (merged)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 40 mins
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!1260 (merged) and wait for it to apply -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/625 and wait for it to apply
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 10 mins
-
Review the metrics below -
Check the routing rules in a console node: pp Gitlab.config.sidekiq.routing_rules; nil
-
Migrate all scheduled-in-the-future jobs from the old queues to the new one. In a Rails console, run: ::Gitlab::SidekiqMigrateJobs.new('retry', logger: Logger.new($stdout)).execute(::Gitlab::SidekiqConfig.worker_queue_mappings) ::Gitlab::SidekiqMigrateJobs.new('interrupted', logger: Logger.new($stdout)).execute(::Gitlab::SidekiqConfig.worker_queue_mappings) ::Gitlab::SidekiqMigrateJobs.new('schedule', logger: Logger.new($stdout)).execute(::Gitlab::SidekiqConfig.worker_queue_mappings)
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 40 mins
-
Revert gitlab-com/gl-infra/k8s-workloads/gitlab-com!1260 (merged), get approval, and merge it
Monitoring
Key metrics to observe
- Metric: RepositoryImportWorker details
- Location: https://dashboards.gitlab.net/d/sidekiq-worker-detail/sidekiq-worker-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-worker=RepositoryImportWorker
- What changes to this metric should prompt a rollback: decrease in apdex, increase in error rate, etc. The error rate should go down with this change, as it's already quite high.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
N/A
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue: #5585 (comment 686477337) -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.