Database::PartitionManagementWorker seems to be triggering session lock storms on self managed instances

Summary

There seems to be a body of evidence accumulating that Database::PartitionManagementWorker is causing incidents on self managed instances owing to a lock storm backed up behind:

LOCK TABLE "p_ci_finished_build_ch_sync_events" IN ACCESS EXCLUSIVE MODE

This is a fairly intrusive lock:

Conflicts with locks of all modes (ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.

I think the problem is that in a busy instance, there's a lot of activity relating to finished CI jobs, so if this lock can't be acquired because of other sessions, it will then act to block all new sessions requiring locks since they need to wait behind ACCESS EXCLUSIVE.

So, database very rapidly accumulates sessions attempting to INSERT INTO "p_ci_finished_build_ch_sync_events"

Customers see two issues:

  • session exhaustion in PgBouncer/PostgreSQL
  • worker threads all fill in Sidekiq

Steps to reproduce

Example Project

What is the current bug behavior?

Database::PartitionManagementWorker causes GitLab outages

What is the expected correct behavior?

Relevant logs and/or screenshots

Output of checks

looks to affect GitLab 16.8 and later

Possible fixes

Edited by Ben Prescott (ex-GitLab)