Database::PartitionManagementWorker seems to be triggering session lock storms on self managed instances
Summary
There seems to be a body of evidence accumulating that Database::PartitionManagementWorker is causing incidents on self managed instances owing to a lock storm backed up behind:
LOCK TABLE "p_ci_finished_build_ch_sync_events" IN ACCESS EXCLUSIVE MODE
This is a fairly intrusive lock:
Conflicts with locks of all modes (
ACCESS SHARE,ROW SHARE,ROW EXCLUSIVE,SHARE UPDATE EXCLUSIVE,SHARE,SHARE ROW EXCLUSIVE,EXCLUSIVE, andACCESS EXCLUSIVE). This mode guarantees that the holder is the only transaction accessing the table in any way.
I think the problem is that in a busy instance, there's a lot of activity relating to finished CI jobs, so if this lock can't be acquired because of other sessions, it will then act to block all new sessions requiring locks since they need to wait behind ACCESS EXCLUSIVE.
So, database very rapidly accumulates sessions attempting to INSERT INTO "p_ci_finished_build_ch_sync_events"
Customers see two issues:
- session exhaustion in PgBouncer/PostgreSQL
- worker threads all fill in Sidekiq
Steps to reproduce
-
The following customers ran into this because backups were running;
Database::PartitionManagementWorkergets stuck waiting for exclusive lock. -
Database::PartitionManagementWorkerholds the transaction open -
ERROR: no partition of relation "p_ci_finished_build_ch_sync_events" found for row- I'm unclear why customers run into this; I would expect all partitions to be created ahead of time.
Example Project
What is the current bug behavior?
Database::PartitionManagementWorker causes GitLab outages
What is the expected correct behavior?
Relevant logs and/or screenshots
Output of checks
looks to affect GitLab 16.8 and later