2023-10-25: High Sidekiq Job Latency
Summary
Since 15:05 UTC, background jobs processing has started to slow down which manifests as GitLab.com slowdown.
Initially the team noticed pgbouncer saturation in both the ci database as well as the main database. We noticed many queries that were idling as these transactions were held open, pointing to the sidekiq workers busily doing something else.
We then noticed that on the main db primary node sidekiq was spending a lot of its backend on NewIssueWorker
trying to lookup in the notes table that was slow because of a notable id that has 600,000 matching records.
At the same time, we also noticed a user that was busily hitting a CI API. We blocked that user hitting CI, and immediately we saw sidekiq queues drop, as well as database saturation. At the same time, however, we also saw the user who was creating issues stop creating those. Also the rollback finished around this time.
We don't suspect the rollback helped, but blocking the ci user did help relieve pressure off of the ci database, where as the user that stopped creating issues relieved pressure off of the main database.
Customer Impact
Between 2023-10-15 15:03 and 17:10 UTC, we saw a significant drop in Sidekiq Apdex.
We saw a massive drop on Sidekiq's Apdex SLO starting around 15:05 UTC: Source
For end users of GitLab.com, this was visible as CI jobs not getting picked up and some interactions in the UI either slow or not loading. Merge request diffs (produced by background jobs) were slow to load.
Two different sets of problematic request patterns were found while investigating the incidents. Preventing those patterns alleviated the issue.
The Reliability teams have produced notes on how to detect and mitigate those problematic requests which generated contention in our CI and Main database nodes.
Mitigation Response
- rolled back to version before we noticed the effects of performance degradation begin
- blocked busy CI API user
Resolution
After we blocked the busy user, we saw the database saturation start to come down and sidekiq queues start to drain. There were three events that happened around this time. We blocked the busy ci user, but also another user that was creating issues through the API stopped their activity. Thirdly, the rollback completed. The team believes however, that blocking the busy user was what helped bring the db saturation down.
Current Status - Resolved
More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.