Running a number of PipelineUpdateWorkers can cause many blocking queries
We've seen on GitLab.com that on occasion we get a lot of SELECT FOR UPDATE
queries on the ci_commits
table that appear to accumulate and block on one another, leading to 502 timeouts and graphs as these:
Today @ayufan ran a test:
- On dev: Ran 1000 queries:
PipelineUpdateWorker.perform_async(<some pipeline ID>)
. No problems. - On GitLab.com: Repeat same experiment. We saw lots of blocking queries.
Often the 5-minute statement timeout hits, freeing these blocked queries. This is reflected in the Sidekiq graphs for PiplelineUpdateWorker
:
The kicker is that PipelineUpdateWorker
just does a simple state transition update in the DB (e.g. pending
-> running
):
def update_status
with_lock do
case latest_builds_status
when 'pending' then enqueue
when 'running' then run
when 'success' then succeed
when 'failed' then drop
when 'canceled' then cancel
when 'skipped' then skip
end
end
end
In looking at Sidekiq TTIN traces, it appears that threads are blocked on the with_lock
call. If we look at the pg_lock
table, it appears there is some tuple lock that's not being released:
gitlabhq_production=# select * from pg_locks where pid = 34805;
locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath
---------------+----------+----------+--------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+----------
relation | 16385 | 34092 | | | | | | | | 19/784908 | 34805 | AccessShareLock | t | t
relation | 16385 | 33272 | | | | | | | | 19/784908 | 34805 | RowShareLock | t | t
virtualxid | | | | | 19/784908 | | | | | 19/784908 | 34805 | ExclusiveLock | t | t
transactionid | | | | | | 4822878 | | | | 19/784908 | 34805 | ExclusiveLock | t | f
tuple | 16385 | 33272 | 105613 | 7 | | | | | | 19/784908 | 34805 | AccessExclusiveLock | f | f
(5 rows)
We're not exactly sure how this could happen. Is there some deadlock occurring? Is there a race condition when multiple hosts attempt to grab the same lock? Is some Sidekiq thread not properly releasing the lock?
Even though we don't fully understand the problem, here are the merge requests that should help:
- Implement optimistic locking: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7040
- Run only one pipeline and project when scheduled multiple times: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7005
We should also considering lowering the default statement timeout of 5 minutes