Multiple duplicate ProjectImportScheduleWorker jobs caused PostgreSQL session exhaustion while waiting for database record lock
Summary
Twice over a weekend, a large customer's GitLab instance exhausted all available connections to PostgreSQL owing to sessions trying to get a record/tuple lock:
update "project_mirror_data" set "last_update_scheduled_at" = ?,
"status" = ? where "project_mirror_data"."id" = ?;
This caused an outage until the customer identified the session blocking the other queries and killed it in the database.
- Customer isn't running a
statement_timeout, so it's possible that this occurs in other instances and is handled by the query/queries being cancelled. Session exhaustion started slowly and then accelerated; in the last few seconds enough new sessions started that session exhaustion would have been likely even with a reasonablestatement_timeoutof 15-60 seconds. - Customer is running Geo, but this had recently switched to 'paused' state owing to a database replication issue. I mention this because of similarities with #5994 (closed).
The underlying sidekiq jobs don't appear to be deduplicating properly, or not at all. The behaviour observed here looks a lot like #385008 (closed) for another sidekiq job - which was observed in the same customer's environment. The fix there was ensure multiple jobs didn't attempt to run for the same work.
The customer provided a dump of PostgreSQL's pg_stat_activity table showing the queries being attempted by all sessions, and the locked sessions were all ProjectImportScheduleWorker. This had been repeatedly scheduled at a high or very high frequency. The value being written to last_update_scheduled_at in the query is consistent with pg_stat_activity.query_start:
-
2023-02-04 03:4015 sessions/queries -
2023-02-04 03:4117 -
2023-02-04 03:4217 -
2023-02-04 03:4319 -
2023-02-04 03:4424 -
2023-02-04 03:456 -
2023-02-04 03:463 -
2023-02-04 04:013 -
2023-02-04 04:0230 -
2023-02-04 04:0337, of which 27 started in one second 04:03:10, and the rest 04:03:12
Every single query was for the same "project_mirror_data"."id". the marginalia indicates these 173 sessions were from 109 unique correlation_id/job_id combinations, and seven unique correlation_id
UPDATE "project_mirror_data" SET "last_update_scheduled_at" = '2023-02-04 04:03:12.374456', "status" = 'scheduled' WHERE "project_mirror_data"."id" = 109706
GitLab team members can read more in two tickets: It was one of the last issues on a long running support ticket with the investigation continuing on a new ticket.
Steps to reproduce
Example Project
What is the current bug behavior?
Sidekiq is scheduling multiple ProjectImportScheduleWorker jobs for the same mirror, as evidence by the queries all these jobs were trying to run, but unable to do so because the record was locked:
UPDATE "project_mirror_data"
SET "last_update_scheduled_at" = '2023-02-04 04:03:12.374456', "status" = 'scheduled'
WHERE "project_mirror_data"."id" = 109706
The customer's configuration - no statement_timeout is unusual, but there may be other circumstances under which this situation could arise.
Since 67 of these sessions arrived over 04:03:10-04:03:12 even the 15 second timeout configured on GitLab.com would not protect a database from session exhaustion.
What is the expected correct behavior?
Does it make sense to run more jobs, when an existing job is running?
See #385008 (closed) for a similar example.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)