2023-10-25: High Sidekiq Job Latency

Summary

Since 15:05 UTC, background jobs processing has started to slow down which manifests as GitLab.com slowdown.

Initially the team noticed pgbouncer saturation in both the ci database as well as the main database. We noticed many queries that were idling as these transactions were held open, pointing to the sidekiq workers busily doing something else.

We then noticed that on the main db primary node sidekiq was spending a lot of its backend on NewIssueWorker trying to lookup in the notes table that was slow because of a notable id that has 600,000 matching records.

At the same time, we also noticed a user that was busily hitting a CI API. We blocked that user hitting CI, and immediately we saw sidekiq queues drop, as well as database saturation. At the same time, however, we also saw the user who was creating issues stop creating those. Also the rollback finished around this time.

We don't suspect the rollback helped, but blocking the ci user did help relieve pressure off of the ci database, where as the user that stopped creating issues relieved pressure off of the main database.

Customer Impact

Between 2023-10-15 15:03 and 17:10 UTC, we saw a significant drop in Sidekiq Apdex.

We saw a massive drop on Sidekiq's Apdex SLO starting around 15:05 UTC: Source

For end users of GitLab.com, this was visible as CI jobs not getting picked up and some interactions in the UI either slow or not loading. Merge request diffs (produced by background jobs) were slow to load.

Two different sets of problematic request patterns were found while investigating the incidents. Preventing those patterns alleviated the issue.

The Reliability teams have produced notes on how to detect and mitigate those problematic requests which generated contention in our CI and Main database nodes.

Mitigation Response

rolled back to version before we noticed the effects of performance degradation begin
blocked busy CI API user

Resolution

After we blocked the busy user, we saw the database saturation start to come down and sidekiq queues start to drain. There were three events that happened around this time. We blocked the busy ci user, but also another user that was creating issues through the API stopped their activity. Thirdly, the rollback completed. The team believes however, that blocking the busy user was what helped bring the db saturation down.

Current Status - Resolved

More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.

📚 References and helpful links

Recent Events (available internally only):

Feature Flag Log - Chatops to toggle Feature Flags Documentation
Infrastructure Configurations
GCP Events (e.g. host failure)

Deployment Guidance

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Edited Oct 26, 2023 by John Cai