2023-11-22: High database load caused slow and unresponsive merge requests and CI pipelines
Original title SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard for urgent-cpu-bound
Customer Impact
Customers will experience slowness on merge requests, CI pipelines and background processing.
Current Status

Summary for CMOC notice / Exec summary:

- Customer Impact: Customers experienced slowness on merge requests, CI pipelines and background processing.
- Service Impact: ServicePostgres ServiceSidekiq
- Impact Duration: 2023-11-22 1817 - 2023-11-23 1905 (1488 minutes = 24.8 hours)
- Root cause: A SELECT query on the primary db was inefficient under load (the query clause generated 85K
ids
in itsIN
clause), and the worker ended up saturating the Sidekiq queue trying to process all the jobs on the queue. - Further analysis - Incident Review: High database load caused slow... (#17171 - closed)
Done
-
Changed the worker's queue assignment to
database_throttled
. -
Canary deployment is unblocked now by quarantining the spec.
-
Enabled the worker feature flag once again, which will allow those jobs to very slowly plow through the jobs.
-
MR with the query improvement to improve worker being reviewed Improve delete_software_license_policies query (gitlab-org/gitlab!137707 - merged)
-
Wait for gitlab-org/gitlab!137707 (merged) to be deployed to production
- You can check the progress with
/chatops run auto_deploy status https://gitlab.com/gitlab-org/gitlab/-/merge_requests/137707
in #production - The workflowproduction will be added to the merge request
- You can check the progress with
-
MR rolled out to production as of 1420 UTC.
-
Worker is now performing better per #17168 (comment 1664172064).
-
Move the Worker to
quarantine
shard, but be ready to disable them again via the feature flag if the workload on the primary db gets dangerously busy.-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!3244 (merged)
-
To disable the feature flag through the rails console
Feature.disable(:"run_sidekiq_jobs_Security::ProcessScanResultPolicyWorker") # returns `true` Feature.enabled?(:"run_sidekiq_jobs_Security::ProcessScanResultPolicyWorker") # returns `false`
-
Things to check on the Primary
-
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!3245 (merged) to increase resources to database_throttled to clear the queue faster.
-
Wait for
database_throttled
to be fully drains-
Check queue length dashboard or through redis below:
ssh redis-sidekiq-01-db-gprd.c.gitlab-production.internal sudo gitlab-redis-cli --raw llen queue:database_throttled
-
-
Expire silences for
database_throttled
-
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-com!3246 (merged) which is a revert of the previous change for maxreplicas for database_throttled. This will restore this back to the standard setting.
-
If all looks well, move the job back to the
catchall
shard by merging gitlab-com/gl-infra/k8s-workloads/gitlab-com!3247 (merged) -
Monitor for 1 hour.
-
Revert the chef-repo change that lowered sidekiq pgbouncer amount. https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4234

References and helpful links

Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Security Note: If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.