Multiple duplicate ProjectImportScheduleWorker jobs caused PostgreSQL session exhaustion while waiting for database record lock

Summary

Twice over a weekend, a large customer's GitLab instance exhausted all available connections to PostgreSQL owing to sessions trying to get a record/tuple lock:

update "project_mirror_data" set "last_update_scheduled_at" = ?, 
"status" = ? where "project_mirror_data"."id" = ?;

This caused an outage until the customer identified the session blocking the other queries and killed it in the database.

  • Customer isn't running a statement_timeout, so it's possible that this occurs in other instances and is handled by the query/queries being cancelled. Session exhaustion started slowly and then accelerated; in the last few seconds enough new sessions started that session exhaustion would have been likely even with a reasonable statement_timeout of 15-60 seconds.
  • Customer is running Geo, but this had recently switched to 'paused' state owing to a database replication issue. I mention this because of similarities with #5994 (closed).

The underlying sidekiq jobs don't appear to be deduplicating properly, or not at all. The behaviour observed here looks a lot like #385008 (closed) for another sidekiq job - which was observed in the same customer's environment. The fix there was ensure multiple jobs didn't attempt to run for the same work.

The customer provided a dump of PostgreSQL's pg_stat_activity table showing the queries being attempted by all sessions, and the locked sessions were all ProjectImportScheduleWorker. This had been repeatedly scheduled at a high or very high frequency. The value being written to last_update_scheduled_at in the query is consistent with pg_stat_activity.query_start:

  • 2023-02-04 03:40 15 sessions/queries
  • 2023-02-04 03:41 17
  • 2023-02-04 03:42 17
  • 2023-02-04 03:43 19
  • 2023-02-04 03:44 24
  • 2023-02-04 03:45 6
  • 2023-02-04 03:46 3
  • 2023-02-04 04:01 3
  • 2023-02-04 04:02 30
  • 2023-02-04 04:03 37, of which 27 started in one second 04:03:10, and the rest 04:03:12

Every single query was for the same "project_mirror_data"."id". the marginalia indicates these 173 sessions were from 109 unique correlation_id/job_id combinations, and seven unique correlation_id

UPDATE "project_mirror_data" SET "last_update_scheduled_at" = '2023-02-04 04:03:12.374456', "status" = 'scheduled' WHERE "project_mirror_data"."id" = 109706

GitLab team members can read more in two tickets: It was one of the last issues on a long running support ticket with the investigation continuing on a new ticket.

Steps to reproduce

Example Project

What is the current bug behavior?

Sidekiq is scheduling multiple ProjectImportScheduleWorker jobs for the same mirror, as evidence by the queries all these jobs were trying to run, but unable to do so because the record was locked:

UPDATE "project_mirror_data" 
SET "last_update_scheduled_at" = '2023-02-04 04:03:12.374456', "status" = 'scheduled' 
WHERE "project_mirror_data"."id" = 109706

The customer's configuration - no statement_timeout is unusual, but there may be other circumstances under which this situation could arise.

Since 67 of these sessions arrived over 04:03:10-04:03:12 even the 15 second timeout configured on GitLab.com would not protect a database from session exhaustion.

What is the expected correct behavior?

Does it make sense to run more jobs, when an existing job is running?

See #385008 (closed) for a similar example.

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

Edited by Ben Prescott (ex-GitLab)